Re: Patch: self-contained HTML using Data URI

2014-06-25 Thread Nick Burch
On Wed, 25 Jun 2014, Andrew Skiba wrote: Let me check I understand you right. WordExtractor will continue to create Yes, as will (should..) the other parsers which find embedded resources and call the ImageParser once for every file name. No. It'll call your code, as you'll have registered

Re: Metadata at e.g. textfiles

2014-07-10 Thread Nick Burch
On Thu, 10 Jul 2014, Kai-Uwe Schmidt wrote: is there a way to get e.g. creator and or creation date into the metadata dictionary? Only for file formats which store this information When I extract from a file I get the following: Content-Encoding: ISO-8859-1 Content-Length: 9 Content-Type: te

Re: Wrong parsing of XML

2014-07-11 Thread Nick Burch
On Fri, 11 Jul 2014, Avi Hayun wrote: 1. I use tika-core in my app 2. I use the following to detect the stream's media type: byte[] bytes = IOUtils.toByteArray(new URL("http://www.amazon.com/sitemap_ video.xml")); That file doesn't have an xml header on the front, which is probably why it isn

Re: Patch: self-contained HTML using Data URI

2014-07-14 Thread Nick Burch
On Thu, 10 Jul 2014, Andrew Skiba wrote: Took some time, but I glued it all together, so now it works without modifying Tika sources, only by using custom handler, extractor and parser. It works with WordExtractor, although it is looking as a dirty hack. As I could not override the behavior of

Java code layout settings - do we have them documented somewhere?

2014-07-14 Thread Nick Burch
Hi All Recently when reviewing some patches from new contributors, we've had some confusion over spaces and tabs. For TIKA-1361, I've hit another one - explicit imports vs package wide wildcard ones. Currently, we don't seem to have anything listed on the Contributors page to tell people wha

Re: Java code layout settings - do we have them documented somewhere?

2014-07-14 Thread Nick Burch
On Mon, 14 Jul 2014, Matthias Krueger wrote: a guideline on how to handle existing code that does not adhere to 4 spaces would be helpful. TIKA-44 says all code was converted, so clearly this can't happen... ;-) When submitting a patch/pull request should I include a commit reformatting the r

Re: [DISCUSS] 1.6 Release?

2014-07-17 Thread Nick Burch
On Thu, 17 Jul 2014, Tyler Palsulich wrote: I just resolved all but TIKA-1324 (and 1367). Nick, what's the difference between the release notes and changelog? Want to make sure the right files are updated. I'm not the one to ask, as I've never played Release Manager for Tika. (Can't spot any

Re: Miredot License Key for Apache Tika Project

2014-07-21 Thread Nick Burch
On Sat, 19 Jul 2014, Tom Barber wrote: Jumping on this thread very late so please excuse me if this had been covered. Anyone cinematic contemplate Enunciate for Rest API documentation? If you look on the list in about April, you should see a patch I posted which turned on Enunciate support,

Re: How should video files with audio be handled by parsers?

2014-07-22 Thread Nick Burch
On Tue, 22 Jul 2014, Ray Gauss wrote: This is a few months old but I've been looking at this recently and since we're unlikely to move to a structured metadata store in the short term I've come up with what I think is an interim solution [1] that essentially allows nesting through XPath-like sy

Re: How should video files with audio be handled by parsers?

2014-07-23 Thread Nick Burch
On Tue, 22 Jul 2014, Ray Gauss wrote: The info on what the streams are and how they relate can be conveyed via PBCore, i.e.: pbcore:instantiationTracks=1 video track, English and Spanish audio, Director's commentary audio Ah, that's good. Looks a sensible enough and easy to follow standard t

Re: How should video files with audio be handled by parsers?

2014-07-24 Thread Nick Burch
On Wed, 23 Jul 2014, Ray Gauss wrote: 2) There are are several PBCore instantiation properties that apply to the entire file like duration and tracks that we'd want prefixed with pbcore so I think it would be odd to see:   pbcore:instantiationDuration=00:00:05.20   stream[0]/pbcore:essenceTrac

Re: [VOTE] Apache Tika 1.6 release candidate #1

2014-07-29 Thread Nick Burch
On Mon, 28 Jul 2014, Sergey Beryozkin wrote: This is not an issue that should block the release, I was careful not to vote with a minus one. I've become a bit impatient, but no one really blocks me from completing this pure documentation effort myself, I was hoping that someone would do it firs

RE: [VOTE] Apache Tika 1.6 release candidate #1

2014-07-29 Thread Nick Burch
On Mon, 28 Jul 2014, Allison, Timothy B. wrote: There was one regression: http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx Stacktrace: Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: -369073454 at java.lang.String.checkBounds(String.java

Re: [VOTE] Apache Tika 1.6 release candidate #1

2014-07-30 Thread Nick Burch
On Mon, 28 Jul 2014, Mattmann, Chris A (3980) wrote: A candidate for the Tika 1.6 release is available at: http://people.apache.org/~mattmann/apache-tika-1.6/rc1/ Should original-tika-app-1.6.jar be in there? IIRC we decided in the 1.5 release that it shouldn't be Please vote on releasing

Re: Compress algorithm 'implode' not parsed.

2014-07-31 Thread Nick Burch
On Thu, 31 Jul 2014, kevin slote wrote: Point being, Tika-1.5 uses apache-commons-compress 1-5. According to the Apache compress jira ticket below, Apache compress can Trunk currently uses Commons Compress 1.8, can you try with that? (Tika 1.6 should be out within about a week, based on trunk)

Re: [VOTE] Apache Tika 1.6 release candidate #1

2014-07-31 Thread Nick Burch
Another quick thought on the artifiacts in http://people.apache.org/~mattmann/apache-tika-1.6/rc1/ - as well as needing to ditch original-tika-app.jar, shouldn't we have the Tika Server standalone jar in there too as another released + easily downloadable jar? Thanks Nick On 28/07/14 05:22,

RE: [VOTE] Apache Tika 1.6 release candidate #1

2014-07-31 Thread Nick Burch
On Thu, 31 Jul 2014, Allison, Timothy B. wrote: On a related note, I did some digging on the one regression I found in the pptx, and that will be solved if we wait for POI 3.11 beta 1. I haven't yet had a chance to rerun on the random sample with the updated POI... I'm currently on a train

Re: New API wiki page

2014-08-04 Thread Nick Burch
On Mon, 4 Aug 2014, Tyler Palsulich wrote: Thanks, Chris! But, it looks like the page is immutable for me. What's your Wiki username? You'll probably need adding to the ContributorsGroup - by default most pages require a listed username to prevent spam Nick

Re: New API wiki page

2014-08-04 Thread Nick Burch
On Mon, 4 Aug 2014, Tyler Palsulich wrote: What's your Wiki username? TylerPalsulich Added, enjoy! Nick

Re: svn commit: r1616295 [1/2] - in /tika/trunk: ./ tika-app/src/main/java/org/apache/tika/cli/ tika-app/src/test/java/org/apache/tika/cli/ tika-core/src/main/java/org/apache/tika/detect/ tika-core/sr

2014-08-07 Thread Nick Burch
On 06/08/14 19:16, tpalsul...@apache.org wrote: Author: tpalsulich Date: Wed Aug 6 18:16:27 2014 New Revision: 1616295 URL: http://svn.apache.org/r1616295 Log: Fix for TIKA-1387 (thanks Uwe Schindler). Adding the Maven forbidden-apis plugin and fixing identified errors. Minor thing, but any

Re: How should video files with audio be handled by parsers?

2014-08-07 Thread Nick Burch
On Wed, 6 Aug 2014, Ray Gauss wrote: I've updated tika-ffmpeg with a new file with 2 audio tracks and a subtitle track and added a test.  The metadata looks as follows: pbcore:instantiationDataRate=3511 kb/s pbcore:instantiationDuration=00:00:01.03 pbcore:instantiationEssenceTrack[0]/pbcore:ess

Re: [DISCUSS] Give examples of Parser, Detector, and Translator usage

2014-08-07 Thread Nick Burch
On Thu, 7 Aug 2014, Tyler Palsulich wrote: I think we should add some consolidated documentation on how to use Tika's Java API. It would be very helpful if we had short snippets of code that showed how exactly you can use Parser.parse(), for example. I think I remember a thread about testing exam

Re: [DISCUSS] Give examples of Parser, Detector, and Translator usage

2014-08-07 Thread Nick Burch
On Thu, 7 Aug 2014, Tyler Palsulich wrote: This needs to pull from the examples in svn, so we make sure it compiles and stays working. See above! Ahh. Thank you for the link. So, first, create the tika-example module and some examples. Yup The integrate Apache CMS into the website (never don

Re: [DISCUSS] Give examples of Parser, Detector, and Translator usage

2014-08-07 Thread Nick Burch
On Thu, 7 Aug 2014, Tyler Palsulich wrote: Sounds like the new module is a good idea. So, let's jump on it! I will create a new 'example' JIRA tag and create issues for creating the module and adding Parse, Detect, and Translate examples. Others should add issues/desired examples as they see fi

RE: [DISCUSS] Give examples of Parser, Detector, and Translator usage

2014-08-11 Thread Nick Burch
On Mon, 11 Aug 2014, Allison, Timothy B. wrote: For development on TIKA-1302, I've been using a modified version of the recursive parser wrapper that I submitted (well, plagiarized from Jukka+Nick's code on the wiki site above) as TIKA-1329. For Tika 1.7, I'd like to add this to tika-app and t

Re: [DISCUSS] Apache Tika 1.6 RC #2..today?

2014-08-19 Thread Nick Burch
On Tue, 19 Aug 2014, Mattmann, Chris A (3980) wrote: OK I've been watching dev fly by and I think it's time for RC #2 of 1.6, which I'll spin today. If possible, I'd say wait another day. There's one more issue (see private@) spotted in POI since 3.11 beta 1 that should be included in Tika 1.

Re: How should video files with audio be handled by parsers?

2014-08-20 Thread Nick Burch
-ffmpeg/blob/master/src/main/java/org/apache/tika/metadata/PBCore.java On August 7, 2014 at 6:21:37 AM, Nick Burch (apa...@gagravarr.org) wrote: On Wed, 6 Aug 2014, Ray Gauss wrote: I've updated tika-ffmpeg with a new file with 2 audio tracks and a subtitle track and added a test. The met

Re: [DISCUSS] Apache Tika 1.6 RC #2..today?

2014-08-20 Thread Nick Burch
On Tue, 19 Aug 2014, Chris Mattmann wrote: OK, will roll the RC in a day. Update has been committed to trunk and merged to the 1.6 branch As soon as your normal maven mirror sees the 3.11 beta 2 release artifacts, you should be good to roll the RC! (All tests pass for me after the upgrade an

Re: How should video files with audio be handled by parsers?

2014-08-22 Thread Nick Burch
On Wed, 20 Aug 2014, Ray Gauss wrote: Are these the droids I'm looking for?    https://github.com/Gagravarr/VorbisJava/tree/master/tika/src/main/java/org/gagravarr/tika Yup! To find out about the relations between streams, you'll need to use the org.gagravarr.skeleton classes to decode the S

Trunk broken - forbidden API check failing

2014-08-27 Thread Nick Burch
Hi All Anyone know about this, which is causing trunk to not build, triggering during creating the bundle: [ERROR] BUILD FAILURE [INFO] [INFO] Check for forbidden API calls failed: java.lang.ClassNotFoundException: Load

RE: Trunk broken - forbidden API check failing

2014-08-28 Thread Nick Burch
On Thu, 28 Aug 2014, Uwe Schindler wrote: Patch, the attachment go lost (I have no commits rights): Ta, committed in r1621077 (along with a comment on why we do it) (I'll leave commenting on why the bundler works like that, and why the examples isn't pulling in the asserts from the Tika Core

RE: Trunk broken - forbidden API check failing

2014-08-28 Thread Nick Burch
On Thu, 28 Aug 2014, Tyler Palsulich wrote: Quick unrelated question: to resolve that issue, we're using Locale.ROOT (from Java 1.6). Our default Maven target is 1.6. But, on the website, we list Java 1.5 as the minimum. What exactly is the minimum? I believe we switched to 1.6 a year or two b

Re: TIKA - how to read chunks at a time from a very large file?

2014-08-28 Thread Nick Burch
On Thu, 28 Aug 2014, ruby wrote: Since the files contain over 5GB data, the content string here will end up too much data in memory. I want to avoid this and want to read chunk at a time. You'll probably need your own custom ContentHandler, which detects when there's too much data, and flushes

Re: [VOTE] Release Apache Tika 1.6 RC #2

2014-09-02 Thread Nick Burch
On Mon, 1 Sep 2014, Mattmann, Chris A (3980) wrote: A candidate for the Tika 1.6 release is available at: http://people.apache.org/~mattmann/apache-tika-1.6/rc2/ I see that the pesky original-tika-app jar has gone, and we have the tika-server jar in its place, thanks for fixing that :) O

Re: Please add me to authorized wiki editors

2014-09-03 Thread Nick Burch
On Wed, 3 Sep 2014, Allison, Timothy B. wrote: TimothyAllison I’d like to start documenting tika-batch. Done! Nick

Re: NPE on all *.odt, odp, .ods documents

2014-09-11 Thread Nick Burch
On Thu, 11 Sep 2014, Tyler Palsulich wrote: BTW, we don't have any x.y.z releases yet - should we just call this 1.7? That's probably just as easy? Both sound good to me. We don't want to run out of numbers < 2.0, though. ;) As long as we don't hit 1.FF, we're probably OK, it is 2 digit non

Re: NPE on all *.odt, odp, .ods documents

2014-09-12 Thread Nick Burch
On Fri, 12 Sep 2014, Andrzej Bialecki wrote: Nick and all, anything else? I'd lean towards setting a deadline of a week, then roll 1.7 from trunk then. That would give people a few days to get any last fixes that want in I'd quite like it if people could look at http://tika.apache.org/1.7/e

Re: NPE on all *.odt, odp, .ods documents

2014-09-12 Thread Nick Burch
On Fri, 12 Sep 2014, Mattmann, Chris A (3980) wrote: One thing about the examples - I had to do a hack and export tika-examples into the Maven docia path to get the site to build since the APT pages reference the tika-examples src - can someone look into that? Hmm, builds cleanly for me... I

Re: NPE on all *.odt, odp, .ods documents

2014-09-12 Thread Nick Burch
On Fri, 12 Sep 2014, Mattmann, Chris A (3980) wrote: But, before the site built fine from anywhere, so maybe we should add something to the tika-site build that checks out tika-examples or something..that's what I was thinking just updating the ant-run for that. How does it work for building

Re: NPE on all *.odt, odp, .ods documents

2014-09-12 Thread Nick Burch
On Fri, 12 Sep 2014, Mattmann, Chris A (3980) wrote: Nah, I just download the full tika package separately, run mvn javadoc:aggregate and then just copy it to the publish directory. So that's a hack too and not automatic. :/ Ah, ok. I've had a try with a svn:externals, does that work well for

Tika at ApacheCon Europe - 2 months time!

2014-09-22 Thread Nick Burch
Hi All It's only 2 months to go until ApacheCon Europe in Budapest. I'm simultaneously exciting by all the great Tika stuff going on, and worried by how many talks I need to finish writing... As usual for an ApacheCon, we've a number of talks about Tika going on, and almost certainly a hacka

Re: Apache Tika 1.6 Fails SHA1 and Key Checks

2014-10-07 Thread Nick Burch
On Tue, 7 Oct 2014, Shannon Brown wrote: I tried to download Tika 1.6 today 7 Oct 2014. Both the SHA1 and PGP checks failed. Also, the PGP keys link is broken on the Tika downloads site. See https://tika.apache.org/download.html Which mirror did you download from? (You don't directly download

Re: import (re)ordering?

2014-10-21 Thread Nick Burch
On Tue, 21 Oct 2014, Allison, Timothy B. wrote: I have Intellij set to order imports by javax, java, then other. I think this is the most common pattern in Tika. Is it ok if I make these (meaningless/formatting) changes when I commit other changes? The only downside of this is that the top o

RE: import (re)ordering?

2014-10-24 Thread Nick Burch
On Fri, 24 Oct 2014, Allison, Timothy B. wrote: Y, I'll try to be more careful about separating out formatting from content in the future (apologies for TIKA-1451). What I didn't want to do was start an IDE war if others have different settings that will order imports in a different way. I'd

PDF test failing on trunk

2014-10-29 Thread Nick Burch
Hi All Just tried to build trunk, and got a test failure: Tests in error: testSequentialParser(org.apache.tika.parser.pdf.PDFParserTest): Unable to extract PDF content Tests run: 547, Failures: 0, Errors: 1, Skipped: 7 The exception in the log is: Caused by: java.io.IOException: javax.cr

RE: PDF test failing on trunk

2014-10-30 Thread Nick Burch
On Thu, 30 Oct 2014, Allison, Timothy B. wrote: The build is working for me on linux and Windows with Java 1.7. Can you tell which file is causing the problem? I wonder if the upgrade to PDFBox 1.8.7 caused the issue? I've just tried with Java 7, and that passes! The JVM it's failing on is:

RE: PDF test failing on trunk

2014-10-30 Thread Nick Burch
On Thu, 30 Oct 2014, Allison, Timothy B. wrote: Ha. Works with an older version of 1.6: java version "1.6.0_30" OpenJDK Runtime Environment (IcedTea6 1.13.1) (rhel-3.1.13.1.el6_5-x86_64) OpenJDK 64-Bit Server VM (build 23.25-b01, mixed mode) Joy. Full stracktrace below, maybe one that needs r

RE: PDF test failing on trunk

2014-10-30 Thread Nick Burch
On Thu, 30 Oct 2014, Allison, Timothy B. wrote: I think so. Would you like the honors? You're more of a pdf expert than I am, so maybe you'd be best :) Nick

Re: Move definitively from SVN to Git ?

2014-11-17 Thread Nick Burch
On Mon, 17 Nov 2014, Hong-Thai Nguyen wrote: Git is implemented everywhere and profit many new features. Should we abandon SVN repo and move to Git forever to facility apply fixes and contribution ? We already have a git mirror - http://git.apache.org/tika.git/ - and a GitHub mirror which acc

Re: Move definitively from SVN to Git ?

2014-11-17 Thread Nick Burch
On Mon, 17 Nov 2014, Hong-Thai Nguyen wrote: I didn't realize that we could commit/push directly into git repo. Could we ? Master source is still SVN. However, committers can (and at least some do) work on a clone of the Git repo, and use GitSVN to push their changes to the SVN repo as commit

Re: Move definitively from SVN to Git ?

2014-11-17 Thread Nick Burch
On Mon, 17 Nov 2014, Hong-Thai Nguyen wrote: Yes, that's exactly I'm doing. If we move to Git, we'll avoid all SVN stuff. Anyway, this concerns commiters only. If we move to git, people who currently use SVN have to change though! Given that non-committers can already work with Git, could you

Re: svn commit: r1640535 - /tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaResource. java

2014-11-19 Thread Nick Burch
On Wed, 19 Nov 2014, Tyler Palsulich wrote: It looks like imports are being reordered here. I think we decided (can't find an archive link right now) on java and javax imports before others. Everything we wrote down is here: http://tika.apache.org/contribute.html#Code_Formatting Nothing there

Subsets of tika parsers redux

2014-11-23 Thread Nick Burch
Hi All During ApacheCon, I had a chance to chat with Sergey about the "subset of Tika Parsers" issue that bubbles up from time to time. It seemed to work well, and I think we both now have a better idea of the other's needs and concerns, which is good :) As is shown on our list from time to

Re: Subsets of tika parsers redux

2014-11-25 Thread Nick Burch
On Mon, 24 Nov 2014, Sergey Beryozkin wrote: It is an interesting idea, one that can lead to introducing finer-grained bundles but also providing a mechanism for the (auto-)generation of the import metadata required by each of the parser modules. Besides, introducing several smaller bundles tha

Re: Using Tika to compile glossaries in commercial software

2014-12-06 Thread Nick Burch
On Sat, 6 Dec 2014, Emmanuel Ichbiah wrote: What do weed need to distribute besides the jar file to be compliant with your licence agreement ? The Apache Software License is fairly easy to read as a non-lawyer, so your best answer is likely to come from reading that! The main section of inte

One for our XMP experts - Property with indexed closed choice?

2014-12-14 Thread Nick Burch
Hi All I'm trying to add photoshop:ColorMode as a new Metadata Property. It's on page 32 of the XMP spec part 2: http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/xmp/pdfs/cc-201306/XMPSpecificationPart2.pdf photoshop:ColorMode * Closed Choice of Integer * The colour mode. One of: 0 = B

Re: Subsets of tika parsers redux

2014-12-15 Thread Nick Burch
On Mon, 15 Dec 2014, Sergey Beryozkin wrote: As a first step, I thought we'd still keep the same tika-parser jar, the only difference would be what dependencies ended up in the bundle. If the tika-bundle-pdf has no POI jars included in it, then the Microsoft Office related parsers shouldn't regis

Re: Subsets of tika parsers redux

2014-12-15 Thread Nick Burch
On Mon, 15 Dec 2014, Sergey Beryozkin wrote: OSGi users would pick tika + tika-parsers, or tika + tika-parsers-pdf, or tika + tika-parsers-pdf + tika-parsers-mp3 if they want OSGi is nicely contained, and fairly easy to unit test, so let's use that to test out the idea! That also solves the CXF

Re: Subsets of tika parsers redux

2014-12-15 Thread Nick Burch
On Mon, 15 Dec 2014, Sergey Beryozkin wrote: I'm not proposing to split tika-parsers in a way that would affect the users, tika-parsers would still be there, except that it would strongly depend on tika-pdf and perhaps, when it is being built, it can have its dependencies like tika-pdf shaded i

Re: 1.7 release? | potential blocker?

2015-01-05 Thread Nick Burch
On Mon, 5 Jan 2015, Tyler Palsulich wrote: Works for me. I got stalled midway through the process of getting RC#1 out (authentication issues). But, going to try to finish it right now (best way to upload to dist.apache.org? That's a svn checkout For the RC, assuming it's the same process as fo

Re: [VOTE] Apache Tika 1.7 Release

2015-01-06 Thread Nick Burch
On Tue, 6 Jan 2015, Tyler Palsulich wrote: A candidate for the Tika 1.7 release is available at: https://dist.apache.org/repos/dist/dev/tika/ The release candidate is a zip archive of the sources in: http://svn.apache.org/repos/asf/tika/tags/1.7-rc2/ The SHA1 checksum of the archive is

Re: [VOTE] Apache Tika 1.7 Release

2015-01-14 Thread Nick Burch
On Fri, 9 Jan 2015, Tyler Palsulich wrote: A candidate for the Tika 1.7 release is available at: https://dist.apache.org/repos/dist/dev/tika/ The release candidate is a zip archive of the sources in: http://svn.apache.org/repos/asf/tika/tags/1.7-rc3/ All looks good to me (signatures, has

Re: [VOTE] Apache Tika 1.7 Release

2015-01-14 Thread Nick Burch
On Wed, 14 Jan 2015, Tyler Palsulich wrote: Nick, thanks for building the site! We still need to rebuild the index, right? You'll need to build the 1.7 index page (based on the changelog), then update the download page + homepage + menu, and finally rebuild the site (All I did was finish off

Re: Tika Server docker image

2015-01-19 Thread Nick Burch
On Mon, 19 Jan 2015, Konstantin Gribov wrote: There's no Apache docker registry (see INFRA-9035 and INFRA-8441). There's no docker hub intergration with apache repos, as far as I know. So there's no way to create some official docker build currently. Your best bet is probably to hop in the inf

Re: Tika 1.14?

2016-08-12 Thread Nick Burch
On Thu, 11 Aug 2016, Bob Paulin wrote: I know it's been a little bit since we talked about 2.0. We had discussed holding off while some API changes that were under consideration. Has any progress been made on this? I think we're still trying to come up with a plan for how to allow multiple

RE: Query on correct use of 'fileUrl' in TikaJAXRS Server to extract document at remote url - my request is not working

2016-09-13 Thread Nick Burch
On Tue, 13 Sep 2016, John Dougrez-Lewis wrote: Surely the security vulnerability could have been fixed by disallowing "file://" variants in the URL rather than removing the feature altogether? Or were there other implementation issues relating to the fileUrl feature that meant it was best remove

Re: A new Tika App in 2.0?

2016-09-13 Thread Nick Burch
On Sun, 11 Sep 2016, Bob Paulin wrote: I'd like to propose a new Tika App for the 2.0 branch. One of the reasons we broke apart the Tika parsers into modules was due to the complexity of having to deal with all the parser dependencies and transitive dependencies. Now developers can use just t

RE: Query on correct use of 'fileUrl' in TikaJAXRS Server to extract document at remote url - my request is not working

2016-09-14 Thread Nick Burch
On Wed, 14 Sep 2016, Allison, Timothy B. wrote: Would it be as much of a disaster to require the user to allow the fileUrl capability on the commandline at server startup? We could add some menacing "all bets are off, we hope you know what you're doing" warning. With a special switch, and a

Re: Plans for the first Tika 2.0 release

2016-09-21 Thread Nick Burch
On Mon, 19 Sep 2016, Bob Paulin wrote: I think it's a good thing to discuss. I know there are other features that are targeted for 2.0. Do we have a general sense of where those features are at? I think the big one we need to crack is allowing multiple parsers to run against a file. OCR is

Re: tika-2.x-windows - Build # 60 - Still Failing

2016-10-05 Thread Nick Burch
On Wed, 5 Oct 2016, Apache Jenkins Server wrote: The Apache Jenkins build system has built tika-2.x-windows (build #60) Check console output at https://builds.apache.org/job/tika-2.x-windows/60/ to view the results. Anyone with Jenkins-foo able to fix our Windows Jenkin builds? This failed d

Re: tika-2.x - Build # 156 - Failure

2016-10-05 Thread Nick Burch
On Wed, 5 Oct 2016, Apache Jenkins Server wrote: The Apache Jenkins build system has built tika-2.x (build #156) Check console output at https://builds.apache.org/job/tika-2.x/156/ to view the results. Another one for our Jenkins experts. Looks like it needs a bit more memory for the job, as

Re: Tika parsers 1.14-SNAPSHOT parses empty content depending to Apache POI 3.15

2016-10-12 Thread Nick Burch
On Wed, 12 Oct 2016, Simone Tripodi wrote: while upgrading the system where I've been working on, I updated Apache POI to version 3.15, then Tika (currently tika-parsers-1.7, I am testing tika-parsers-1.14-SNAPSHOT) You can't just upgrade one jar. You need to use all of the POI jars together f

Re: FW: ApacheCon Miami is coming in May.

2016-11-30 Thread Nick Burch
On Wed, 30 Nov 2016, Allison, Timothy B. wrote: ApacheCon and Apache Big Data will be held at the Intercontinental in Miami, Florida, May 16-18, 2017 I plan to attend. Who's in? Any idea if there will be another "content" track like we had in Austin? If we want a Content track, then we'd h

Re: Tess4j API for TIKA OCR parser

2017-03-07 Thread Nick Burch
On Tue, 7 Mar 2017, Thejan Wijesinghe wrote: I have already use the Tess4j API to rewrite the TesseractOCRParser class, Although It successfully extracts content from most of the file types, it fails some particular unit tests in the TesseractOCRParserTest class. I can solve that. However, I want

Re: Require guidance from where to start contributing in Apache Tika

2017-03-08 Thread Nick Burch
On Thu, 9 Mar 2017, Avtar Singh Mehra wrote: I am new to Apache Tika but have plenty of experience with other Apache Softwares like Apache Solr, Apache Lucene, Apache Velocity etc. I would like to start contributing to Apache Tika community. It would be great help if someone could guide me regard

Re: [Q] reason for tika-parser-*-bundle to be separated from corresponding parser modules in 2.x

2017-03-29 Thread Nick Burch
On Wed, 29 Mar 2017, Konstantin Gribov wrote: I've been surprised by such separation, what was the reason to separate them? I think partly history (we split in 1.x), partly how the split was done (osgi folks amongst the most keen), and partly a desire not to have non-OSGi users getting a load

Tika talk next week - help needed!

2017-05-14 Thread Nick Burch
Hi All Last year in Seville, I gave a talk on Tika entitled "Apache Tika - What’s new with 2.0?". For ApacheCon Miami next week, I've been roped into giving an updated version... https://apachecon2017.sched.com/event/9zvD/apache-tika-whats-new-with-20-nick-burch-apache-software

Re: Tika talk next week - help needed!

2017-05-16 Thread Nick Burch
On Tue, 16 May 2017, Eric Pugh wrote: It was great to read through http://events.linuxfoundation.org/sites/events/files/slides/WhatsNewWithApacheTika_1.pdf… Wow there is a lot in Tika. And I think that might be the one challenge with the talk structure, there is SOO much information. The pl

Tika App, Extract (-z) and Inline PDF Images?

2017-05-18 Thread Nick Burch
Hi All I've just been caught out by the Tika App's -z on a PDF not extracting the embedded images. I think we probably shouldn't tweak the default config for the other Tika App modes, but what about extract? Any reason why we shouldn't turn on the PDF Parser option "extractInlineImages" when -

RE: Tika 1.15

2017-05-22 Thread Nick Burch
On Mon, 22 May 2017, Allison, Timothy B. wrote: Last I remember, Tyler had some detailed notes...anyone remember where those are? https://wiki.apache.org/tika/ReleaseProcess Nick

Re: Tika App, Extract (-z) and Inline PDF Images?

2017-05-22 Thread Nick Burch
On 2017-05-18 17:02 (-0400), Nick Burch wrote: Hi All> I've just been caught out by the Tika App's -z on a PDF not extracting the > embedded images. I think we probably shouldn't tweak the default config > for the other Tika App modes, but what about extract? Any reason w

Re: documenting configuration

2017-07-03 Thread Nick Burch
On Mon, 3 Jul 2017, Allison, Timothy B. wrote: To help a user configure a parameter in the PDFParser, I just started: https://wiki.apache.org/tika/TikaConfig. I realize, though, that I probably should update: https://tika.apache.org/1.15/configuring.html instead. Preferences, recommendations

Re: [DISCUSS] Enable specific ContentHandler for tika-server

2017-09-28 Thread Nick Burch
On Thu, 28 Sep 2017, Giuseppe Totaro wrote: if I am not wrong, currently you cannot configure a specific ContentHandler while using tika-server. I mean that you can configure your own parser [0] but you cannot control which ContentHandler the parser leverages to extract text and metadata (e.g., y

Re: [DISCUSS] Enable specific ContentHandler for tika-server

2017-09-29 Thread Nick Burch
On Fri, 29 Sep 2017, Giuseppe Totaro wrote: To sum up, I would like to quickly discuss the following aspects: - As you all mentioned, the HTTP headers for configuring the ContentHandler to be used are better suited for the dynamic cases. Specifically, a ContentHadler can be given through a

Not-yet-broken breaking changes for Tika 2?

2017-10-26 Thread Nick Burch
Hi All Based on the plan on the wiki , we still have a major breaking change or two planned for Tika 2 that we haven't yet "broken". (There's also removing some deprecated stuff etc) As I unde

Re: Not-yet-broken breaking changes for Tika 2?

2017-10-26 Thread Nick Burch
On Thu, 26 Oct 2017, Chris Mattmann wrote: Why don’t we just store N copies of the stream, and parse it twice? I'm not sure that's the challenge though? Using TikaInputStream we can buffer to a temp file if needed to re-read the input Of course that’s the ugly way, but currently the way I’ve

Re: Not-yet-broken breaking changes for Tika 2?

2017-10-26 Thread Nick Burch
On Thu, 26 Oct 2017, Chris Mattmann wrote: My general approach to conflicting metadata is simply to define precedence orders. For example here is one documented from OODT: https://cwiki.apache.org/confluence/display/OODT/Understanding+CAS-PGE+Metadata+Precendence We can do similar things with

Re: Not-yet-broken breaking changes for Tika 2?

2018-01-02 Thread Nick Burch
ser for a file? Thanks Nick On 10/26/17, 9:43 AM, "Nick Burch" wrote: On Thu, 26 Oct 2017, Chris Mattmann wrote: > My general approach to conflicting metadata is simply to define > precedence orders. > > For example here is one documented from OODT: >

Re: relying on a non-Maven central repo?

2018-02-05 Thread Nick Burch
On Mon, 5 Feb 2018, Allison, Timothy B. wrote: Sorry for the duplication, but I wanted to check on this and didn't want it to get lost in a github comment. Fellow devs on Apache Tika, are we ok with relying on a non-Maven central repo? Nope. ASF policy is that we can only rely on maven centr

Re: Not-yet-broken breaking changes for Tika 2?

2018-02-05 Thread Nick Burch
Ping - anyone got any thoughts on the proposed metadata parser stuff, and any ideas on the content part? On Tue, 2 Jan 2018, Nick Burch wrote: On Thu, 26 Oct 2017, Chris Mattmann wrote: On collision, the precedence order defines what key takes precedence and _overwrites_ the other. Overwrite

Re: Not-yet-broken breaking changes for Tika 2?

2018-02-05 Thread Nick Burch
5/18, 8:37 AM, "Nick Burch" wrote: Ping - anyone got any thoughts on the proposed metadata parser stuff, and any ideas on the content part? On Tue, 2 Jan 2018, Nick Burch wrote: > On Thu, 26 Oct 2017, Chris Mattmann wrote: >> On collision, the precedence

Re: Unnecessary WARNING Logging?

2018-02-28 Thread Nick Burch
On Tue, 27 Feb 2018, lewis john mcgibbney wrote: I don't know when it was introduced, by I see the following, rather annoying WARNING messages in many logs now. IIRC we're changing those to ignore in Tika 2.x, but as we always warned for missing parsers / missing parser classes in 1.x we can't

Re: Tika 1.18?

2018-03-01 Thread Nick Burch
On Thu, 1 Mar 2018, Allison, Timothy B. wrote: There have been some important bug fixes, a few new capabilities, and the upgrading of dependencies because of CVEs. There are a bunch of mime tickets from Andreas Meier that I’d like to get into 1.18. Is there anything else that is critical? I

Re: Tika 1.18?

2018-03-02 Thread Nick Burch
On Fri, 2 Mar 2018, Luís Filipe Nassif wrote: If I make no progress on TIKA-1466 until 3/9, you can start the release process without it. But do you devs agree with the proposed change: allow overriding of glob patterns in custom-mimetypes.xml? What happens if you have two different custom file

RE: Tika 1.18?

2018-03-12 Thread Nick Burch
On Mon, 12 Mar 2018, Allison, Timothy B. wrote: Anyone have anything they'd like to get in before I run the regression tests? I can certainly put it off a few days. I've made some progress on the metadata-only fallback/merge multiple parser work from https://wiki.apache.org/tika/CompositePars

TIKA-1509 (2.x breaking parser change) - ready for first review!

2018-03-14 Thread Nick Burch
Hi All As promised, I've finally had a go to try and implement my ideas for TIKA-1509 / https://wiki.apache.org/tika/CompositeParserDiscussion / breaking 2.x parser change My work so far is in this github branch, and is ready for review! https://github.com/apache/tika/tree/multiple-parsers

Re: message/news; charset=windows-1252 -> message/rfc822

2018-03-28 Thread Nick Burch
On Wed, 28 Mar 2018, Allison, Timothy B. wrote: With the new mime patterns, we've gotten quite a few changes of message/news being identified as message/rfc822. An example is: http://162.242.228.174/docs/commoncrawl2/DA/DALFSFPD6FX4GGZ6EEJQA6RABA7OXIF5

Re: TIKA-1509 (2.x breaking parser change) - ready for first review!

2018-04-08 Thread Nick Burch
finitely try this week as well. Thank you! Sincerely, Chris On 3/18/18, 2:47 PM, "David Meikle" wrote: Nice one Nick! Will take a look this week. Cheers, Dave On 14 March 2018 at 17:38, Nick Burch wrote: > Hi All > > As promised, I've finall

RE: TIKA-1509 (2.x breaking parser change) - ready for first review!

2018-04-09 Thread Nick Burch
On Tue, 10 Apr 2018, Allison, Timothy B. wrote: It looks like you merged to master, which, I think is the base for 2.0.0-SNAPSHOT. I've been treating branch_1x as the master for 1.x.[1] Ah, I'd thought that the 2.x branch (with the tika-parser-bundles / tika-parser-modules folders) was the on

Re: Build with Java 10, but target 8 in Tika 2.0?

2018-06-19 Thread Nick Burch
On 19/06/18 20:46, Tim Allison wrote: What would you think of requiring Java 10 to build Tika 2.0 but still setting 8 as the target? This would allow us to bake modularity in now. Given that I haven't actually tried modularizing/jigsawizing Tika yet, this could be a complete disaster, of course.

<    1   2   3   4   5   6   7   8   9   10   >