[jira] [Commented] (TIKA-1334) Add presentation layer for results of each run
[ https://issues.apache.org/jira/browse/TIKA-1334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15994941#comment-15994941 ] Tyler Palsulich commented on TIKA-1334: --- The format should probably be in the form: {noformat} [ { "mime-type": "something", "count": 1234, "version": "a" }, { "mime-type": "something", "count": 4321, "version": "b" }, ... ] {noformat} > Add presentation layer for results of each run > -- > > Key: TIKA-1334 > URL: https://issues.apache.org/jira/browse/TIKA-1334 > Project: Tika > Issue Type: Sub-task > Components: cli, general, server >Reporter: Tim Allison > Attachments: static_stats.zip > > > If I'm doing this, it'll probably be vintage mid-90s html. If someone with > some .js kung-fu wants to take this, please do. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
Re: Squashing GitHub pull requests while merging
A contributor should be able to squash the commits in the pull request before we merge into the Tika. So, we don't need to mess up Tika's history. Right? Tyler On May 6, 2016 8:41 PM, "Mattmann, Chris A (3980)" < chris.a.mattm...@jpl.nasa.gov> wrote: > Squashing messes up history and atm requires infra intervention song would > suggest we stay away from it for now > > Sent from my iPhone > > > On May 6, 2016, at 2:20 PM, Ken Krugler> wrote: > > > > I was perusing https://wiki.apache.org/tika/UsingGit < > https://wiki.apache.org/tika/UsingGit>, and noticed that it doesn’t talk > about squashing a pull request’s commits while merging. > > > > This is described at https://mahout.apache.org/developers/github.html < > https://mahout.apache.org/developers/github.html> > > > > Isn't this something we’d want to do as well? > > > > Thanks, > > > > — Ken > > > > -- > > Ken Krugler > > +1 530-210-6378 > > http://www.scaleunlimited.com > > custom big data solutions & training > > Hadoop, Cascading, Cassandra & Solr > > > > > > >
Re: JIRA issue?
Hi Ben, Sorry for the inconvenience. The infrastructure team had to disable the create and comment features of JIRA for many projects to mitigate spam. Hopefully everything will be back up and running again soon. Thanks for emailing. Tyler Hi, I'd like to create an issue on the JIRA. When I visit https://issues.apache.org/jira/browse/TIKA/ and hit Create I don't see Tika as an option. I can only create issues for Zookeeper and other projects Thanks, Ben -- about.me/benmccann
Re: [VOTE] Apache Tika 1.12 Release Candidate #1
A bit late to the party, but +1 from me. Tyler On Thu, Feb 4, 2016 at 1:44 PM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > Hi Chris, > +1 to release this release candidate > Thanks > Lewis > > On Tue, Feb 2, 2016 at 4:24 PM, Lewis John Mcgibbney < > lewis.mcgibb...@gmail.com> wrote: > > > Hi Chris, > > > > Signatures all good. Verified using the scripts apachestuff. > > mvn install and all tests pass fine on MacOSX 10.9.5 > > Ran DRAT from master branch with following output > > > > Notes Binaries Archives Standards Apache Generated Unknown > > 0 2 0 868 836 0 32 > > Issue filed in Jira to address and resolve the unknown's > > > > https://issues.apache.org/jira/browse/TIKA-1848 > > > > On Thu, Jan 28, 2016 at 12:01 AM,> wrote: > > > >> > >> A first candidate for the Tika 1.12 release is available at: > >> > >> https://dist.apache.org/repos/dist/dev/tika/ > >> > >> The release candidate is a zip archive of the sources in: > >> > >> > https://git-wip-us.apache.org/repos/asf?p=tika.git;a=tag;h=203a26ba5e65db24 > >> 27f9e84bc4ff31e569ae661c > >> < > https://git-wip-us.apache.org/repos/asf?p=tika.git;a=tag;h=203a26ba5e65db2427f9e84bc4ff31e569ae661c > > > >> > >> > >> The SHA1 checksum of the archive is: > >> 30e64645af643959841ac3bb3c41f7e64eba7e5f > >> > >> In addition, a staged maven repository is available here: > >> > >> https://repository.apache.org/content/repositories/orgapachetika-1015/ > >> > >> > >> Please vote on releasing this package as Apache Tika 1.12. > >> The vote is open for the next 72 hours and passes if a majority of at > >> least three +1 Tika PMC votes are cast. > >> > >> [ ] +1 Release this package as Apache Tika 1.12 > >> [ ] -1 Do not release this package because… > >> > >> Cheers, > >> Chris > >> > >> P.S. Of course here is my +1. > >> > >> > > > -- > *Lewis* >
Re: [VOTE] Moving SCM to Git
Hi, Just reiterating my +1 for the move. A huge benefit in my eyes is a reduced barrier to entry for new developers and contributors. Tyler On Jan 2, 2016 4:34 PM, "Mattmann, Chris A (3980)" < chris.a.mattm...@jpl.nasa.gov> wrote: > One final note - this isn't a vote to make GitHub the canonical repo. In > the future if Whimsy goes well I'd like to explore that but here I am > simply proposing to use the ASF writeable Git repos (which happen to be > mirrored to GH). > > Cheers, > Chris > > Sent from my iPhone > > > On Jan 2, 2016, at 4:31 PM, Mattmann, Chris A (3980) < > chris.a.mattm...@jpl.nasa.gov> wrote: > > > > Hey Ken, > > > > Projects have been using writeable git repos at the ASF since 2009-2010. > The recent conversation at the foundation level was - should we allow > GitHub as a canonical external repo and more broadly - is this possible in > general? The Whimsy project is currently undergoing that experiment and > it's going well but nothing official to report yet. > > > > Beyond that - projects can release from and use writeable Git repos. > Some projects were getting around history by squashing commits ahead of the > repo and getting around infra's checks on master (aka trunk) by using > different main branch names but we're not in that boat. > > > > Cheers, > > Chris > > > > > > Sent from my iPhone > > > >> On Jan 2, 2016, at 3:47 PM, Ken Krugler <kkrugler_li...@transpac.com> > wrote: > >> > >> Hi Chris, > >> > >> I'd be +1, but I don't have the essence of the "Re: git (Was: > ASF/GitHub Findings of Fact / Statements of Principles)" thread on the > Apache members list clearly in my mind. > >> > >> Specifically, while that thread was spinning merrily away, there were > concerns about immutability when using git. > >> > >> E.g. one comment was... > >> > >>> releases must correspond to an immutable tag in a repository on ASF > hardware. > >>> > >>> "Canonical" is needed for releases, and for IP provenance, so I'd > augment the above with a second requirement: for each release tag, we must > be able to establish the provenance of all files referenced by that tag. > >>> > >>> I believe that is the essence of the Foundation's requirements for > version control. Both can be satisfied via svn or git. Git may require > external sources to satisfy one or both of those requirements. svn > inherently has the first nailed, and is much easier for provenance (there > may be edge cases I'm missing offhand, but we know the ICLA/grant > associated with each change leading up to the tagged release). > >> > >> Did it wind up as "projects can experiment with using git for official > releases"? > >> > >> Thanks, > >> > >> -- Ken > >> > >>> From: Mattmann, Chris A (3980) > >>> Sent: January 1, 2016 8:30:16pm PST > >>> To: dev@tika.apache.org > >>> Subject: [VOTE] Moving SCM to Git > >>> > >>> Hi Everyone, > >>> > >>> DISCUSS thread here: http://s.apache.org/wVE > >>> > >>> Time to officially VOTE on moving Tika to Git. I’ve made a wiki > >>> page for our SCM explaining how to use Git at Apache, and how to > >>> use it with Github, and how to use it even in a traditional SVN > >>> sense. The page is here: > >>> > >>> https://wiki.apache.org/tika/UsingGit > >>> > >>> > >>> I’ve also linked it from the main wiki page. I took the liberty > >>> of updating the only other 2 pages on the wiki that referenced > >>> SCM with (pending) Git instructions as well: > >>> > >>> https://wiki.apache.org/tika/DeveloperResources > >>> https://wiki.apache.org/tika/ReleaseProcess > >>> > >>> From the DISCUSS thread it would seem the following members of > >>> the community support this move: > >>> > >>> Chris Mattmann > >>> Tyler Palsulich > >>> Bob Paulin > >>> Hong-Thai Nguyen > >>> > >>> Oleg Tikhonov > >>> David Meikle > >>> > >>> > >>> Given the above I’m going to count the above people as +1 in > >>> this VOTE if I don’t hear otherwise. > >>> > >>> Nick Burch said he would be more supportive if there was a guide, > >>> so I made one and updated the other wiki docs
RE: NER Parser tests behind proxy?
Apologies if i missed a discussion about this earlier, but should we be downloading a model by default? Tyler On Nov 23, 2015 8:03 AM, "Allison, Timothy B."wrote: > The problem comes down to: ModelGetter.groovy which is trying to grab: > ${basedir}/src/test/resources/org/apache/tika/parser/ner/opennlp/ner-person.bin > > If we could build a small model (and I mean really small) and package it > with Tika, we wouldn't have to worry about http connectivity outside of the > usual maven stuff. > > -Original Message- > From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] > Sent: Monday, November 23, 2015 10:52 AM > To: dev@tika.apache.org > Cc: ThammeGowda Narayanaswamy > Subject: Re: NER Parser tests behind proxy? > > Hey Tim, > > I’m not seeing these of course b/c I’m not behind a proxy. Thamme, any > ideas? > > Cheers, > Chris > > ++ > Chris Mattmann, Ph.D. > Chief Architect > Instrument Software and Science Data Systems Section (398) NASA Jet > Propulsion Laboratory Pasadena, CA 91109 USA > Office: 168-519, Mailstop: 168-527 > Email: chris.a.mattm...@nasa.gov > WWW: http://sunset.usc.edu/~mattmann/ > ++ > Adjunct Associate Professor, Computer Science Department University of > Southern California, Los Angeles, CA 90089 USA > ++ > > > > > > -Original Message- > From: "Allison, Timothy B." > Reply-To: "dev@tika.apache.org" > Date: Thursday, November 19, 2015 at 5:36 PM > To: "dev@tika.apache.org" > Subject: NER Parser tests behind proxy? > > >My proxy is configured for git/maven/etc, but how do I configure it > >within the test so that I don't get this? > > > >GET : http://opennlp.sourceforge.net/models-1.5/en-ner-person.bin -> > >tika-parsers\src\test\resources\org\apache\tika\parser\ner\opennlp\ner- > >per > >son.bin > >[INFO] > >--- > >- > >[INFO] Reactor Summary: > >[INFO] > >[INFO] Apache Tika parent SUCCESS > >[3.264s] [INFO] Apache Tika core .. > >SUCCESS [44.470s] [INFO] Apache Tika parsers > >... FAILURE [1:56.462s] [INFO] Apache Tika > >XMP ... SKIPPED [INFO] Apache Tika > >serialization . SKIPPED [INFO] Apache Tika > >batch . SKIPPED [INFO] Apache Tika > >application ... SKIPPED [INFO] Apache Tika OSGi > >bundle ... SKIPPED [INFO] Apache Tika translate > >. SKIPPED [INFO] Apache Tika server > > SKIPPED [INFO] Apache Tika examples > >.. SKIPPED [INFO] Apache Tika Java-7 > >Components . SKIPPED [INFO] Apache Tika > >... SKIPPED [INFO] > >--- > >- > >[INFO] BUILD FAILURE > >[INFO] > >--- > >- > >[INFO] Total time: 2:45.245s > >[INFO] Finished at: Thu Nov 19 20:29:34 EST 2015 [INFO] Final Memory: > >52M/482M [INFO] > >--- > >- > >[ERROR] Failed to execute goal > >org.codehaus.groovy.maven:gmaven-plugin:1.0:execute (testSetup) on > >project tika-parsers: java.net.ConnectException: Connection refused: > >connect -> [Help 1] > >org.apache.maven.lifecycle.LifecycleExecutionException: Failed to > >execute goal org.codehaus.groovy.maven:gmaven-plugin:1.0:execute > >(testSetup) on project tika-parsers: java.net.ConnectException: > Connection refused: > >connect > > at > >org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.j > >ava > >:217) > > at > >org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.j > >ava > >:153) > > at > >org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.j > >ava > >:145) > > at > >org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject > >(Li > >fecycleModuleBuilder.java:84) > > at > >org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject > >(Li > >fecycleModuleBuilder.java:59) > > at > >org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuil > >d(L > >ifecycleStarter.java:183) > > at > >org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleS > >tar > >ter.java:161) > > at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:320) > > at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:156) > > at
Re: [DISCUSS] Moving to Git
+1 from me. Tyler On Nov 18, 2015 6:46 AM, "Mattmann, Chris A (3980)" < chris.a.mattm...@jpl.nasa.gov> wrote: > Hey Team, > > I propose we move to writeable git repos for Tika for our repository. > I mostly interact with Git & Github nowadays even with Tika using the > mirroring and PR interaction support. > > Thoughts? > > Cheers, > Chris > > ++ > Chris Mattmann, Ph.D. > Chief Architect > Instrument Software and Science Data Systems Section (398) > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 168-519, Mailstop: 168-527 > Email: chris.a.mattm...@nasa.gov > WWW: http://sunset.usc.edu/~mattmann/ > ++ > Adjunct Associate Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++ > > > >
Re: Named Entity Recognition support in trunk
That's awesome! Great work. Have we tried running any benchmarks? Tyler On Nov 18, 2015 6:42 AM, "Mattmann, Chris A (3980)" < chris.a.mattm...@jpl.nasa.gov> wrote: > Hey Folks, > > With the commit of TIKA-1787/GH-61 in trunk we now have full integration > of Named Entity Recognition with Stanford NER/NLP and Apache OpenNLP. > Will also look to see if we can integrate NLTK too. This is a *big > deal* since NER is something we’ve always wanted to pull into Tika. > > Woot! > > Cheers, > Chris > > ++ > Chris Mattmann, Ph.D. > Chief Architect > Instrument Software and Science Data Systems Section (398) > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 168-519, Mailstop: 168-527 > Email: chris.a.mattm...@nasa.gov > WWW: http://sunset.usc.edu/~mattmann/ > ++ > Adjunct Associate Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++ > > > >
Re: [VOTE] Apache Tika 1.11 Release Candidate #1
+1 from me -- builds, tests pass, sanity check files parse, and sums look good. But, I get a warning that the signature is not certified with a trusted signature. Tyler On Wed, Oct 21, 2015 at 6:43 AM Allison, Timothy B.wrote: > +0 (some regressions in ppt content) > > I just finished the batch comparison run on ~1.8 million files in our > govdocs1 and commoncrawl corpora comparing Tika 1.10 to 1.11-rc1. As a > caveat, the eval code is still in development and there may be bugs in the > reports. > > Results are here: > https://github.com/tballison/share/blob/master/tika_comparisons/tika_1_10_vs_1_11-rc1.zip > > Key reports: > contents/content_diffs.csv (file had one corrupt row when viewing in > Excel...manually deleted offending content) > exceptions/newExceptionsInBByMimeTypeByStackTrace.csv (small handful) > exceptions/fixedExceptionsInBByMimeType.csv (none!) > mimes/mime_diffs_A_to_B.csv > > On the positive side: > From "mime_diffs_A_to_B.csv", it looks like we are catching more pdfs as > pdfs (that text/xhtml) than we were...great! We're identifying more files > as images (jpeg, pict) than as xhtml, and, from a quick look, this appears > to be an improvement. We have at least 9 new x-hwp-v5 (great!). > > On the negative side: > > 1) We have a few regressions in ppt exceptions (six of the same aioobe). > 2) We have regressions in ppt content (it looks like we're not adding a > new line/word break where we need to). The regressions are small per file, > but they affect ~220 ppts out of ~1500 (~15%). > > Other than the regressions in ppt content, I'd be +1, but I don't think > this is severe enough to warrant a re-spin. Happy to look into a fix, > though, if we want a re-spin...and even if we don't, I'll start looking > into this asap. > > -Original Message- > From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] > Sent: Monday, October 19, 2015 10:23 AM > To: dev@tika.apache.org > Cc: u...@tika.apache.org > Subject: [VOTE] Apache Tika 1.11 Release Candidate #1 > > Hi Folks, > > A first candidate for the Tika 1.11 release is available at: > > https://dist.apache.org/repos/dist/dev/tika/ > > The release candidate is a zip archive of the sources in: > http://svn.apache.org/repos/asf/tika/tags/1.11-rc1/ > > The SHA1 checksum of the archive is > d0dde7b3a4f1a2fb6ccd741552ea180dddab630a > > In addition, a staged maven repository is available here: > > https://repository.apache.org/content/repositories/orgapachetika-1014/ > > > Please vote on releasing this package as Apache Tika 1.11. > The vote is open for the next 72 hours and passes if a majority of at > least three +1 Tika PMC votes are cast. > > [ ] +1 Release this package as Apache Tika 1.11 [ ] -1 Do not release this > package because… > > Cheers, > Chris > > P.S. Of course here is my +1. > > > > ++ > Chris Mattmann, Ph.D. > Chief Architect > Instrument Software and Science Data Systems Section (398) NASA Jet > Propulsion Laboratory Pasadena, CA 91109 USA > Office: 168-519, Mailstop: 168-527 > Email: chris.a.mattm...@nasa.gov > WWW: http://sunset.usc.edu/~mattmann/ > ++ > Adjunct Associate Professor, Computer Science Department University of > Southern California, Los Angeles, CA 90089 USA > ++ > > > >
Re: Tika Tesseract configuration
Hi Aditya, The wiki (https://wiki.apache.org/tika/TikaOCR) also had some good information about setting up and configuring Tesseract. Let me know if you have any questions. Thanks, Tyler On Wed, Oct 14, 2015, 6:59 AM Aditya Dhulipalawrote: > Hi Tika devs, > > Scratch that previous email. > > I found the TesseractOCRConfig .properties file > > I was looking for it in the wrong location. > > Sorry for the confusion. > > Thanks! > -- > Aditya > > > adi > > On Wed, Oct 14, 2015 at 9:52 AM, Aditya Dhulipala > wrote: > >> Tika Devs! >> >> I'm trying to run Tika with Tesseract. >> I finished installing tesseract and confirmed that its working correctly. >> >> I ran an image against Tika server expecting that tesseractOCR would be >> enabled by default. >> >> But I noticed that the extracted metadata didn't have OCR output. >> >> Is this because tesseract is disabled by default? >> >> Should there be a TesseractConfig.properties files somewhere? (I read >> about this in the TesseractOCRParser source. But I didn't find this file >> anywhere) >> >> >> [image: Inline image 1]Hi >> >> Thanks! >> -- >> Aditya >> >> >> >> >
Re: svn commit: r1706077 - /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/gdal/TestGDALParser.java
Hi Chris, It looks like these two lines are equivalent (assert not null versus assert true not null). Right? Tyler On Wed, Sep 30, 2015, 9:45 AMwrote: > Author: mattmann > Date: Wed Sep 30 16:45:32 2015 > New Revision: 1706077 > > URL: http://svn.apache.org/viewvc?rev=1706077=rev > Log: > - Files isn't always present (just found test case on older version of > GDAL) > > Modified: > > tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/gdal/TestGDALParser.java > > Modified: > tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/gdal/TestGDALParser.java > URL: > http://svn.apache.org/viewvc/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/gdal/TestGDALParser.java?rev=1706077=1706076=1706077=diff > > == > --- > tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/gdal/TestGDALParser.java > (original) > +++ > tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/gdal/TestGDALParser.java > Wed Sep 30 16:45:32 2015 > @@ -69,7 +69,7 @@ public class TestGDALParser extends Tika > assertNotNull(met); > assertNotNull(met.get("Driver")); > assertEquals(expectedDriver, met.get("Driver")); > -assertNotNull(met.get("Files")); > +assumeTrue(met.get("Files") != null); > assertNotNull(met.get("Coordinate System")); > assertEquals(expectedCoordinateSystem, met.get("Coordinate > System")); > assertNotNull(met.get("Size")); > > >
Re: svn commit: r1706077 - /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/gdal/TestGDALParser.java
Hi Chris, Ah, got it. I misread assume as assert. Doh! Tyler On Thu, Oct 1, 2015, 6:45 AM Mattmann, Chris A (3980) < chris.a.mattm...@jpl.nasa.gov> wrote: > Hey Tyler, > > assertNotNull returns void whereas I needed something testable for > assumeTrue (since apparently gdal doesn’t always print out the > Files output on all systems and versions which I found out yesterday). > > Make sense? > > Cheers, > Chris > > ++ > Chris Mattmann, Ph.D. > Chief Architect > Instrument Software and Science Data Systems Section (398) > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 168-519, Mailstop: 168-527 > Email: chris.a.mattm...@nasa.gov > WWW: http://sunset.usc.edu/~mattmann/ > ++ > Adjunct Associate Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++ > > > > > > -Original Message- > From: Tyler Palsulich <tpalsul...@gmail.com> > Reply-To: "dev@tika.apache.org" <dev@tika.apache.org> > Date: Thursday, October 1, 2015 at 6:39 AM > To: "dev@tika.apache.org" <dev@tika.apache.org>, "comm...@tika.apache.org" > <comm...@tika.apache.org> > Subject: Re: svn commit: r1706077 - > /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/gdal/TestGDAL > Parser.java > > >Hi Chris, > > > >It looks like these two lines are equivalent (assert not null versus > >assert > >true not null). Right? > > > >Tyler > > > >On Wed, Sep 30, 2015, 9:45 AM <mattm...@apache.org> wrote: > > > >> Author: mattmann > >> Date: Wed Sep 30 16:45:32 2015 > >> New Revision: 1706077 > >> > >> URL: http://svn.apache.org/viewvc?rev=1706077=rev > >> Log: > >> - Files isn't always present (just found test case on older version of > >> GDAL) > >> > >> Modified: > >> > >> > >>tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/gdal/TestGDA > >>LParser.java > >> > >> Modified: > >> > >>tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/gdal/TestGDA > >>LParser.java > >> URL: > >> > >> > http://svn.apache.org/viewvc/tika/trunk/tika-parsers/src/test/java/org/ap > >>ache/tika/parser/gdal/TestGDALParser.java?rev=1706077=1706076=17060 > >>77=diff > >> > >> > >>= > >>= > >> --- > >> > >>tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/gdal/TestGDA > >>LParser.java > >> (original) > >> +++ > >> > >>tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/gdal/TestGDA > >>LParser.java > >> Wed Sep 30 16:45:32 2015 > >> @@ -69,7 +69,7 @@ public class TestGDALParser extends Tika > >> assertNotNull(met); > >> assertNotNull(met.get("Driver")); > >> assertEquals(expectedDriver, met.get("Driver")); > >> -assertNotNull(met.get("Files")); > >> +assumeTrue(met.get("Files") != null); > >> assertNotNull(met.get("Coordinate System")); > >> assertEquals(expectedCoordinateSystem, met.get("Coordinate > >> System")); > >> assertNotNull(met.get("Size")); > >> > >> > >> > >
[jira] [Commented] (TIKA-1743) NetworkParser can create Unbounded Number of Threads
[ https://issues.apache.org/jira/browse/TIKA-1743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903878#comment-14903878 ] Tyler Palsulich commented on TIKA-1743: --- [Copied from the list] This sounds like a great idea! We should make the size of the pool configurable with TikaConfig. > NetworkParser can create Unbounded Number of Threads > > > Key: TIKA-1743 > URL: https://issues.apache.org/jira/browse/TIKA-1743 > Project: Tika > Issue Type: Bug >Reporter: Bob Paulin > > The current NetworkParser class creates new instances of the Thread class > which each call to parse. This could create an unbounded number of threads > created by this class. I'd suggest replacing this logic with a > ThreadPoolExecutor and a configurable number of threads. This will help > prevent creating an unbounded number of threads and allow the user to tune > performance to the hardware. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [jira] [Created] (TIKA-1743) NetworkParser can create Unbounded Number of Threads
This sounds like a great idea! We should make the size of the pool configurable with TikaConfig. On Tue, Sep 22, 2015, 3:04 PM Bob Paulin (JIRA)wrote: > Bob Paulin created TIKA-1743: > > > Summary: NetworkParser can create Unbounded Number of Threads > Key: TIKA-1743 > URL: https://issues.apache.org/jira/browse/TIKA-1743 > Project: Tika > Issue Type: Bug > Reporter: Bob Paulin > > > The current NetworkParser class creates new instances of the Thread class > which each call to parse. This could create an unbounded number of threads > created by this class. I'd suggest replacing this logic with a > ThreadPoolExecutor and a configurable number of threads. This will help > prevent creating an unbounded number of threads and allow the user to tune > performance to the hardware. > > > > -- > This message was sent by Atlassian JIRA > (v6.3.4#6332) >
Re: [ANNOUNCE] Welcome Bob Paulin as Tika Committer + PMC Member
Welcome! On Wed, Sep 16, 2015, 6:37 PM Allison, Timothy B.wrote: > Welcome! Great to have you on board! > > Cheers, > > Tim > > -Original Message- > From: Bob Paulin [mailto:b...@bobpaulin.com] > Sent: Wednesday, September 16, 2015 9:16 PM > To: dev@tika.apache.org > Subject: Re: [ANNOUNCE] Welcome Bob Paulin as Tika Committer + PMC Member > > Hi Tika Community, > > I'm an independent developer [1], speaker[2], podcaster[3], and Java User > Group [4] leader from Chicago. I specialize in modular development with > OSGi and commit code to Apache Felix. For fun I coach football and > robotics. I have 3 kids and 1 very understanding wife. Excited to be a > part of the Tika Community! > > - Bob Paulin > [1] https://github.com/bobpaulin > [2] http://www.slideshare.net/bobpaulin > [3] http://www.javaoffheap.com/ > [4] http://www.meetup.com/ChicagoJUG/ > > On 9/16/2015 7:05 PM, David Meikle wrote: > > Hello All, > > > > Please welcome Bob Paulin as he joins us as the latest Tika committer > and PMC Member. > > > > Bob, please feel free to say a bit about yourself as an introduction to > the group. > > > > Welcome aboard, > > Dave > > > > > > > > > > > >
[jira] [Commented] (TIKA-1672) Integrate tika-java7 component
[ https://issues.apache.org/jira/browse/TIKA-1672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14722705#comment-14722705 ] Tyler Palsulich commented on TIKA-1672: --- Hmm. Maybe we should rename the module? Right now, it doesn't make sense to have a java7 component when the entire project depends on Java 7. Integrate tika-java7 component -- Key: TIKA-1672 URL: https://issues.apache.org/jira/browse/TIKA-1672 Project: Tika Issue Type: Improvement Reporter: Tyler Palsulich Fix For: 1.11 Code requiring Java 7 doesn't need to be in a separate module now that TIKA-1536 (upgrade to Java 7) is done. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [ANNOUNCE] Apache Tika 1.10 release
Thanks, Dave! On Sat, Aug 8, 2015, 7:01 AM David Meikle dmei...@apache.org wrote: The Apache Tika project is pleased to announce the release of Apache Tika 1.10. The release contents have been pushed out to the main Apache release site and to the Central sync, so the releases should be available as soon as the mirrors get the syncs. Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. Apache Tika 1.10 contains a number of improvements and bug fixes. Details can be found in the changes file: http://www.apache.org/dist/tika/CHANGES-1.10.txt http://www.apache.org/dist/tika/CHANGES-1.10.txt Apache Tika is available in source form from the following download page: http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.10-src.zip http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.10-src.zip Apache Tika is also available in binary form or for use using Maven 2 from the Central Repository: http://repo1.maven.org/maven2/org/apache/tika/ http://repo1.maven.org/maven2/org/apache/tika/ In the initial 48 hours, the release may not be available on all mirrors. When downloading from a mirror site, please remember to verify the downloads using signatures found on the Apache site: https://people.apache.org/keys/group/tika.asc https://people.apache.org/keys/group/tika.asc For more information on Apache Tika, visit the project home page: http://tika.apache.org/ http://tika.apache.org/ -- David Meikle, on behalf of the Apache Tika community
Re: [VOTE] Apache Tika 1.10 Release Candidate #1
Everything looks good to me! +1 Thanks, Dave! Tyler On Tue, Aug 4, 2015, 6:48 AM Ken Krugler kkrugler_li...@transpac.com wrote: +1 Built on Mac, tested with Bixo. -- Ken From: David Meikle Sent: August 2, 2015 12:15:24am PDT To: dev@tika.apache.org; u...@tika.apache.org Subject: [VOTE] Apache Tika 1.10 Release Candidate #1 Hi Everyone, A candidate for the Apache Tika 1.10 release is available at: https://dist.apache.org/repos/dist/dev/tika/ The release candidate is a zip archive of the sources in: http://svn.apache.org/repos/asf/tika/tags/1.10-rc1/ The SHA1 checksum of the archive is b1573adcb194e2c09b77eccc3b1edd16bd4ac67d. In addition, a staged maven repository is available here: https://repository.apache.org/content/repositories/orgapachetika-1013 Please vote on releasing this package as Apache Tika 1.10. The vote is open for the next 72 hours and passes if a majority of at least three +1 Tika PMC votes are cast. [ ] +1 Release this package as Apache Tika 1.10 [ ] -1 Do not release this package because... Here is my +1! Cheers, Dave -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr
[jira] [Commented] (TIKA-1362) Add GoogleTranslate implementation of Translation API
[ https://issues.apache.org/jira/browse/TIKA-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623246#comment-14623246 ] Tyler Palsulich commented on TIKA-1362: --- If you have a pressing need for better configuration abilities for the Google Translator, feel free to open up a new issue and upload a patch! :) We'd be happy to help you get started. Check out the [contributing page|https://tika.apache.org/contribute.html] for some general information. Add GoogleTranslate implementation of Translation API - Key: TIKA-1362 URL: https://issues.apache.org/jira/browse/TIKA-1362 Project: Tika Issue Type: Bug Components: translation Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.6 Add an implementation of the Translation API that uses the Google Translate v2 API and Apache CXF: https://www.googleapis.com/language/translate/v2 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1672) Integrate tika-java7 component
Tyler Palsulich created TIKA-1672: - Summary: Integrate tika-java7 component Key: TIKA-1672 URL: https://issues.apache.org/jira/browse/TIKA-1672 Project: Tika Issue Type: Improvement Reporter: Tyler Palsulich Fix For: 1.10 Code requiring Java 7 doesn't need to be in a separate module now that TIKA-1536 (upgrade to Java 7) is done. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1536) Upgrade compiler definition in pom's to Java 7
[ https://issues.apache.org/jira/browse/TIKA-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich resolved TIKA-1536. --- Resolution: Fixed Upgraded in r1688779. Thanks, all. Will open a new issue regarding integrating tika-java7. Upgrade compiler definition in pom's to Java 7 -- Key: TIKA-1536 URL: https://issues.apache.org/jira/browse/TIKA-1536 Project: Tika Issue Type: Improvement Components: packaging Affects Versions: 1.7 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.10 Attachments: TIKA-1536.patch Since we committed TIKA-1423 it would appear through [mailing list|http://www.mail-archive.com/dev%40tika.apache.org/msg11542.html] commentary that there is a willingness to drop support for Java 1.6 in favour of = Java 1.7. This issue simply addresses this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1536) Upgrade compiler definition in pom's to Java 7
[ https://issues.apache.org/jira/browse/TIKA-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605772#comment-14605772 ] Tyler Palsulich commented on TIKA-1536: --- Yep, see http://apache.markmail.org/thread/7oubuh4hp6rdlbch. Upgrade compiler definition in pom's to Java 7 -- Key: TIKA-1536 URL: https://issues.apache.org/jira/browse/TIKA-1536 Project: Tika Issue Type: Improvement Components: packaging Affects Versions: 1.7 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.10 Attachments: TIKA-1536.patch Since we committed TIKA-1423 it would appear through [mailing list|http://www.mail-archive.com/dev%40tika.apache.org/msg11542.html] commentary that there is a willingness to drop support for Java 1.6 in favour of = Java 1.7. This issue simply addresses this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (TIKA-1481) TikaJAXRS get metadata calls give different results
[ https://issues.apache.org/jira/browse/TIKA-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-1481. - Resolution: Not A Problem Hi [~arbuzovada]. Sorry for the trouble! Did you make sure to respond to the automated response, confirming your subscription? I'm closing this issue as not a problem. But, don't hesitate to let us know if you have any more issues. TikaJAXRS get metadata calls give different results --- Key: TIKA-1481 URL: https://issues.apache.org/jira/browse/TIKA-1481 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.6 Environment: Windows 8, JDK 1.8 Reporter: Darya Arbuzova Priority: Minor Attachments: sample.csv Hello! I'm trying to use Tika in server mode. I downloaded tika-server-1.6.jar from http://mirror.vorboss.net/apache/tika/. I have tried to get file metadata in 2 different ways (as explained here: http://wiki.apache.org/tika/TikaJAXRS ): {{ curl -T sample.csv http://localhost:9998/meta --header Content-Type: text/csv}} {{Content-Encoding,windows-1252}} {{Content-Type,text/plain; charset=windows-1252}} and {{ curl -X PUT -d @sample.csv http://localhost:9998/meta --header Content-Type: text/csv}} {{Content-Encoding,ISO-8859-1}} {{Content-Type,text/plain; charset=ISO-8859-1}} How come they give different results in encoding if I call the same {{http://localhost:9998/meta}}? What could the other differences appear and which is the preferable way to get metadata? Many thanks! Best regards, Darya Arbuzova -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-756) XMP output from Tika CLI
[ https://issues.apache.org/jira/browse/TIKA-756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich resolved TIKA-756. -- Resolution: Fixed Marking this as Fixed, since there are a few more references to tika-parser components (see TikaToXMP). Feel free to reopen if you disagree. XMP output from Tika CLI Key: TIKA-756 URL: https://issues.apache.org/jira/browse/TIKA-756 Project: Tika Issue Type: New Feature Components: cli, metadata Reporter: Jukka Zitting Assignee: Jörg Ehrlich Labels: metadata, xmp Attachments: tika-xmp.patch, tika-xmp_styleAndHeader.patch It would be great if the Tika CLI could output metadata also in the XMP format. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (TIKA-1429) Unable to View a 9mb file even after setting a large Heap Size of 3GB while TIKA GUI
[ https://issues.apache.org/jira/browse/TIKA-1429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-1429. - Resolution: Not A Problem Closing this as not a problem. The file needs to be kept in memory for the GUI to work. So, the problem should be fixed with a higher limit. Unable to View a 9mb file even after setting a large Heap Size of 3GB while TIKA GUI --- Key: TIKA-1429 URL: https://issues.apache.org/jira/browse/TIKA-1429 Project: Tika Issue Type: Bug Components: gui Affects Versions: 1.6 Environment: Windows 8 Reporter: Gautham Gowrishankar Priority: Minor we seem to have found an issue while tika1.6 jar as a GUI (-g option),It seems to work for smaller .tsv files but we running into GC Overload Excpetion while running on of the files in your DataSet. Strangely it seems to work with -x option. There might be an issue with at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:284). Just bringing it to your notice. Below are the logs. = Exception in thread AWT-EventQueue-0 java.lang.OutOfMemoryError: GC overhead l imit exceeded at java.util.Arrays.copyOfRange(Unknown Source) at java.lang.String.init(Unknown Source) at java.lang.StringBuilder.toString(Unknown Source) at java.lang.StackTraceElement.toString(Unknown Source) at java.lang.String.valueOf(Unknown Source) at java.lang.StringBuilder.append(Unknown Source) at java.lang.Throwable.printStackTrace(Unknown Source) at java.lang.Throwable.printStackTrace(Unknown Source) at org.apache.tika.gui.TikaGUI.handleError(TikaGUI.java:351) at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:284) at org.apache.tika.gui.TikaGUI.actionPerformed(TikaGUI.java:238) at javax.swing.AbstractButton.fireActionPerformed(Unknown Source) at javax.swing.AbstractButton$Handler.actionPerformed(Unknown Source) at javax.swing.DefaultButtonModel.fireActionPerformed(Unknown Source) at javax.swing.DefaultButtonModel.setPressed(Unknown Source) at javax.swing.AbstractButton.doClick(Unknown Source) at javax.swing.plaf.basic.BasicMenuItemUI.doClick(Unknown Source) at javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(Unknown Source) at java.awt.Component.processMouseEvent(Unknown Source) at javax.swing.JComponent.processMouseEvent(Unknown Source) at java.awt.Component.processEvent(Unknown Source) at java.awt.Container.processEvent(Unknown Source) at java.awt.Component.dispatchEventImpl(Unknown Source) at java.awt.Container.dispatchEventImpl(Unknown Source) at java.awt.Component.dispatchEvent(Unknown Source) at java.awt.LightweightDispatcher.retargetMouseEvent(Unknown Source) at java.awt.LightweightDispatcher.processMouseEvent(Unknown Source) at java.awt.LightweightDispatcher.dispatchEvent(Unknown Source) at java.awt.Container.dispatchEventImpl(Unknown Source) at java.awt.Window.dispatchEventImpl(Unknown Source) at java.awt.Component.dispatchEvent(Unknown Source) at java.awt.EventQueue.dispatchEventImpl(Unknown Source) Exception in thread AWT-EventQueue-0 java.lang.OutOfMemoryError: GC overhead l imit exceeded at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown Sour ce) at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown Sour ce) at java.awt.EventQueue$4.run(Unknown Source) at java.awt.EventQueue$4.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown Sour ce) at java.awt.EventQueue.dispatchEvent(Unknown Source) at java.awt.EventDispatchThread.pumpOneEventForFilters(Unknown Source) at java.awt.EventDispatchThread.pumpEventsForFilter(Unknown Source) at java.awt.EventDispatchThread.pumpEventsForHierarchy(Unknown Source) at java.awt.EventDispatchThread.pumpEvents(Unknown Source) at java.awt.EventDispatchThread.pumpEvents(Unknown Source) at java.awt.EventDispatchThread.run(Unknown Source) Exception in thread AWT-EventQueue-0 java.lang.OutOfMemoryError: GC overhead l imit exceeded at java.lang.StringBuilder.toString(Unknown Source) at com.sun.java.swing.plaf.windows.TMSchema$Part.getControlName(Unknown Source) at com.sun.java.swing.plaf.windows.XPStyle.isSkinDefined(Unknown Source
[jira] [Commented] (TIKA-1493) Update for JAXRS page with details on passing password
[ https://issues.apache.org/jira/browse/TIKA-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605292#comment-14605292 ] Tyler Palsulich commented on TIKA-1493: --- Can someone familiar with the latest in passing a password to Tika server update the wiki page? Or, is setting the environment variable enough? Update for JAXRS page with details on passing password -- Key: TIKA-1493 URL: https://issues.apache.org/jira/browse/TIKA-1493 Project: Tika Issue Type: Improvement Components: documentation Reporter: Peter Bowyer Priority: Minor Labels: documentation, newbie I signed up for a wiki account to make the edit, but the page is immutable :( It would be really helpful to put on https://wiki.apache.org/tika/TikaJAXRS information about passing the password for encrypted PDFs into TikaJAXRS. In Changelog.txt I discovered the TIKA_PASSWORD environment variable which has worked for me, and it'd be nice to save others having to hunt around. I'd also like to know if there's a way to pass it in per-request (a HTTP header? Useful when many different passwords) - not found anything in the source code for that though. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (TIKA-1552) Pdf document parser
[ https://issues.apache.org/jira/browse/TIKA-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-1552. - Resolution: Not A Problem Marking this as not a problem, since Adobe Reader also adds white space. Pdf document parser --- Key: TIKA-1552 URL: https://issues.apache.org/jira/browse/TIKA-1552 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.7 Reporter: Konstantin Attachments: 2014_US_Federal_Budget.pdf, issue.jpg Hello, We found that when a pdf document has marked text inside frame (table) then after parsing Tika insert tabs between words. Original text from attached file: Provides $17.7 billion in discretionary funding for the National Aeronautics and Space Parsed text (jira removed tabs, so i will add - symbols instead): •Provides - $17.7 - billion-in-discretionary-funding-for-the-National-Aeronautics-and-Space Please take a look in attached screenshot. On the left side is the parsed text in text editor Thank you. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (TIKA-1452) parser.parse() throws exception after which the procesed file is not getting renamed/moved/deleted
[ https://issues.apache.org/jira/browse/TIKA-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-1452. - Resolution: Not A Problem I'm closing this as not a problem. But, please feel free to reopen if you're still having this issue! parser.parse() throws exception after which the procesed file is not getting renamed/moved/deleted -- Key: TIKA-1452 URL: https://issues.apache.org/jira/browse/TIKA-1452 Project: Tika Issue Type: Bug Components: detector, metadata, parser Affects Versions: 1.6 Environment: jre6 Reporter: Abhishek I am passing a file as input stream to parser.parse() method while using apache tika library to convert file to text.The method throws an exception (displayed below) but the input stream is closed in the finally block successfully. Then while renaming the file, the File.renameTo method from java.io returns false. I am not able to rename/delete/move the file despite successfully closing the inputStream. I am afraid another instance of file is created, while parser.parse() method processess the file, which doesn't get closed till the time exception is throw. Is that possible? If so what should I do to rename or delete the file. The Exception thrown while checking the content type is java.lang.NoClassDefFoundError: Could not initialize class com.adobe.xmp.impl.XMPMetaParser at com.adobe.xmp.XMPMetaFactory.parseFromBuffer(XMPMetaFactory.java:160) at com.adobe.xmp.XMPMetaFactory.parseFromBuffer(XMPMetaFactory.java:144) at com.drew.metadata.xmp.XmpReader.extract(XmpReader.java:106) at com.drew.imaging.jpeg.JpegMetadataReader.extractMetadataFromJpegSegmentReader(JpegMetadataReader.java:112) at com.drew.imaging.jpeg.JpegMetadataReader.readMetadata(JpegMetadataReader.java:71) at org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:91) at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:121) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (TIKA-1439) PDF embeded with document can not parse.
[ https://issues.apache.org/jira/browse/TIKA-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-1439. - Resolution: Duplicate PDF embeded with document can not parse. Key: TIKA-1439 URL: https://issues.apache.org/jira/browse/TIKA-1439 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.6 Environment: windows7 Reporter: sunxingzhe Labels: pdfbox Attachments: PDF2XHTML.java_diff.html I insert a Excel file into the pdf file. But can not extracte embedded excel resources. The attachment file PDF2XHTML.java_diff.html is the diff file. Please confirm it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1233) PDFBox can throw StringIndexOutOfBoundsException on some dates
[ https://issues.apache.org/jira/browse/TIKA-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-1233: -- Fix Version/s: (was: 1.6) 1.10 PDFBox can throw StringIndexOutOfBoundsException on some dates -- Key: TIKA-1233 URL: https://issues.apache.org/jira/browse/TIKA-1233 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5 Reporter: Tim Allison Priority: Trivial Labels: easyfix Fix For: 1.10 PDFBOX's date parser can throw a StringIndexOutOfBoundsException if a date string for parsing is empty or contains only spaces. A few of my test pdfs have this feature. Until PDFBOX-1803 is resolved, we can add an extra catch to prevent this from causing problems in TIKA {noformat} @@ -171,6 +171,9 @@ addMetadata(metadata, TikaCoreProperties.CREATED, info.getCreationDate()); } catch (IOException e) { // Invalid date format, just ignore +} catch (StringIndexOutOfBoundsException e){ +//remove after PDFBOX-1883 is fixed +// Invalid date format, just ignore } try { Calendar modified = info.getModificationDate(); @@ -178,6 +181,9 @@ addMetadata(metadata, TikaCoreProperties.MODIFIED, modified); } catch (IOException e) { // Invalid date format, just ignore +} catch (StringIndexOutOfBoundsException e){ +//remove after PDFBOX-1883 is fixed +// Invalid date format, just ignore } {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1585) Create Example Website with Form Submission
[ https://issues.apache.org/jira/browse/TIKA-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich resolved TIKA-1585. --- Resolution: Fixed Good idea, [~lewismc]. I added it to http://people.apache.org/~tpalsulich/tika.html. The server is down right now. If/when another one is started, we'll need to start it with the right CORS argument (http://people.apache.org) and I'll update the page with the right IP address. Create Example Website with Form Submission --- Key: TIKA-1585 URL: https://issues.apache.org/jira/browse/TIKA-1585 Project: Tika Issue Type: New Feature Components: example, server Reporter: Tyler Palsulich Assignee: Tyler Palsulich It would be great to have a website where we can direct people who ask what Tika can do for [filetype] without needing them to actually download Tika. Some initial work to do that is [here|http://tpalsulich.github.io/TikaExamples/]. I'm far from a design guru, but I imagine the site as having a form where you can upload a file at the top, checkboxes for if you want metadata, content, or both, and a submit button. The request should be sent with AJAX and the result should populate a {{div}}. One issue with AJAX requests is that Tika Server doesn't currently allow Cross-Origin-Resource-Sharing (CORS). So, we either need to maintain a slightly updated tika-server, or update the server to allow configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1536) Upgrade compiler definition in pom's to Java 7
[ https://issues.apache.org/jira/browse/TIKA-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605300#comment-14605300 ] Tyler Palsulich commented on TIKA-1536: --- Now that 1.9 is released, are there any blockers for upgrading to Java 1.7? Upgrade compiler definition in pom's to Java 7 -- Key: TIKA-1536 URL: https://issues.apache.org/jira/browse/TIKA-1536 Project: Tika Issue Type: Improvement Components: packaging Affects Versions: 1.7 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.10 Attachments: TIKA-1536.patch Since we committed TIKA-1423 it would appear through [mailing list|http://www.mail-archive.com/dev%40tika.apache.org/msg11542.html] commentary that there is a willingness to drop support for Java 1.6 in favour of = Java 1.7. This issue simply addresses this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Troubleshooting guide
Looks good! Thanks, Nick. Tyler On Wed, Jun 24, 2015 at 2:42 PM Nick Burch apa...@gagravarr.org wrote: Hi All I've had a go at writing up a troubleshooting guide on the wiki, hopefully covering the main problems people face (content detected wrong, parser missing etc). It's linked from the front page and at https://wiki.apache.org/tika/Troubleshooting%20Tika Please expand and correct it as needed! Thanks Nick
Re: Configuring parsers and translators
It seems like there are two goals here, both aiming to centralize configuration: 1. Provide an easy mechanism to configure which parsers to use when (TIKA-1509). 2. Configure all individual parser parameters in Tika Config (not in, for example, TesseractOCRConfig.properties) (TIKA-1508). I'm also in favor of consolidating everything in Tika Config. Tyler On Mon, Jun 8, 2015 at 7:25 AM Allison, Timothy B. talli...@mitre.org wrote: Tyler, I see your devil's advocate point. I strongly agree with Chris about the benefit of centralizing configuration and making it easy to dump and modify the TikaConfig file. Even though the TikaConfig file might get ugly, it would be far better to have everything nailed down there than searching through service loaders...IMHO. I opened TIKA-1508 a while ago and haven't had any time to work on it...this just deals with simple parameter settings for parsers, not the far more difficult/interesting stuff that we've discussed with composite parsers. My main worry with putting it all into config xml is that we accidently end up re-inventing spring badly... Yeah, or re-inventing Solr's parameter loading as my example does... :( I think that basic parameter setting should at least be fairly trivial to code...time allowing...argh. -Original Message- From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] Sent: Saturday, June 06, 2015 7:01 PM To: dev@tika.apache.org Subject: Re: Configuring parsers and translators Hey Tyler, I hear you, but balance that against all the hidden things here and there, and everywhere, that I constantly keep discovering and having to pour through lines of TikaConfig - service loaders, class loaders. When things work right - no problem. When something goes wrong; HUGE waste of time. Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Tyler Palsulich tpalsul...@gmail.com Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Saturday, June 6, 2015 at 3:59 PM To: dev@tika.apache.org dev@tika.apache.org Subject: Re: Configuring parsers and translators (Devil's advocate hat slightly on.) My one hesitation about putting it all into tika-config is that the default might get to be a monstrosity -- difficult for new users to use. Tyler On Sat, Jun 6, 2015 at 3:48 PM Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: I think it would be great to have all this in the Tika Config. The one thing then is to provide an example default config and to make it *hugely* clear rather than all the levels of indirection that we currently have going on which makes it super hard when there is a config error (SPI, swallowing print messages, etc.) ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Tyler Palsulich tpalsul...@gmail.com Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Saturday, June 6, 2015 at 3:45 PM To: dev@tika.apache.org dev@tika.apache.org Subject: Re: Configuring parsers and translators Hi Nick, I've been mulling this over since you sent the first message. But, I'm afraid I don't have a good solution or developed ideas. I agree, it would be very nice to consolidate all configuration for all parsers in the server and app. Is it feasible to put everything into tika-config? Then Parser implementations would read the config to pull out their own configuration. Or, would it be better to keep some configuration separate? Documentation would be an issue if every parser defines its own metadata keys... But, it might be an improvement since we don't have free form properties and configuration files. Tyler On Sat, Jun 6, 2015 at 12:30 PM Nick Burch apa...@gagravarr.org wrote: Anyone have any thoughts on this? On Fri, 8 May
[jira] [Closed] (TIKA-1199) Tika extracts weird signs instead of text
[ https://issues.apache.org/jira/browse/TIKA-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-1199. - Resolution: Not A Problem Tika extracts weird signs instead of text - Key: TIKA-1199 URL: https://issues.apache.org/jira/browse/TIKA-1199 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Environment: MacOSX, Linux Reporter: Marc Teutelink Attachments: gaat fout.pdf, plain_text_tika_output_from_gaat_fout_pdf.txt, structured_text_tika_output_from_gaat_fout_pdf.xml Tika extracts complete bogus text from the attached document. I have attached the .PDF in question and also added the plain and structured text output from Tika. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1630) Mention APK support in List of Supported Formats
[ https://issues.apache.org/jira/browse/TIKA-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich resolved TIKA-1630. --- Resolution: Fixed Fix Version/s: 1.9 Assignee: Tyler Palsulich Bolded the Please note for version 1.9. Hopefully that will help clear things up. [~flowlo], thank you for reporting this! Please let us know if you run into any other issues or have any other suggested improvements. Mention APK support in List of Supported Formats Key: TIKA-1630 URL: https://issues.apache.org/jira/browse/TIKA-1630 Project: Tika Issue Type: Improvement Components: documentation Affects Versions: 1.8 Reporter: Lorenz Leutgeb Assignee: Tyler Palsulich Priority: Trivial Fix For: 1.9 http://tika.apache.org/1.8/formats.html claims to offer a full list of supported formats does not mention support for APK files at all. I trusted that source and only found that tike supports APK files and their respective MIME types from looking at Tikas codebase, which is suboptimal. Please add APK files to that list as appropriate (at least include the MIME type Tika understands). Consider reevaluating the list to find out whether other formats are missing (this is not covered by this ticket). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [VOTE] Release Apache Tika 1.9 Candidate #2
+1 from me. Thanks for running this, Chris! Tyler On Mon, Jun 8, 2015 at 1:11 PM Allison, Timothy B. talli...@mitre.org wrote: +1 Built in Windows and Linux. Works on problems (that I caused!) in rc1. Let's make sure to include last Java 1.6 version in the release notes, if that's what we've decided. Thank you, Chris! Best, Tim -Original Message- From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] Sent: Saturday, June 06, 2015 9:47 PM To: dev@tika.apache.org Cc: u...@tika.apache.org Subject: [VOTE] Release Apache Tika 1.9 Candidate #2 Hi Folks, A second candidate for the Tika 1.9 release is available at: https://dist.apache.org/repos/dist/dev/tika/ The release candidate is a zip archive of the sources in: http://svn.apache.org/repos/asf/tika/tags/1.9-rc2/ The SHA1 checksum of the archive is 9b78c9e9ce9640b402b7fef8e30f3cdbe384f44c. In addition, a staged maven repository is available here: https://repository.apache.org/content/repositories/orgapachetika-1011/ Please vote on releasing this package as Apache Tika 1.9. The vote is open for the next 72 hours and passes if a majority of at least three +1 Tika PMC votes are cast. [ ] +1 Release this package as Apache Tika 1.9 [ ] -1 Do not release this package because… Cheers, Chris P.S. Of course here is my +1. ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: Configuring parsers and translators
Hi Nick, I've been mulling this over since you sent the first message. But, I'm afraid I don't have a good solution or developed ideas. I agree, it would be very nice to consolidate all configuration for all parsers in the server and app. Is it feasible to put everything into tika-config? Then Parser implementations would read the config to pull out their own configuration. Or, would it be better to keep some configuration separate? Documentation would be an issue if every parser defines its own metadata keys... But, it might be an improvement since we don't have free form properties and configuration files. Tyler On Sat, Jun 6, 2015 at 12:30 PM Nick Burch apa...@gagravarr.org wrote: Anyone have any thoughts on this? On Fri, 8 May 2015, Nick Burch wrote: Hi All This came up in TIKA-1623, but I thought it might be better brought out to the list for discussion To configure parsers on a per-document basis, such as setting PDF spacing tolerances, or telling Tesseract what language it should be OCRing for, we have the *Config objects. You create one of these, use the setters to configure it for your document, pop it onto the Parse context and it's used when processing your document To configure parsers and translators on a per-JVM basis, to apply to all documents processed, it's a bit less consistent. At least some look for a properties file with a specific name, usually in the tika namespace, and grab their settings / keys / etc out of that. At least some expect to find a *Config with their program path on it, even though that remains constant between documents. None of them support getting their settings from the Tika Config As part of our evolution of parser preferences, we're moving towards people either being able to set their preferences in code, or being able to supply a Tika Config xml which sets their parser preferences or overrides certain bits of the default. The code option works for people who want to declare certain specific things, the Tika Config one gives the same functionality but allows a consistent and clean way to set it between Tika App, Tika Server and java code. Another related example is the External Parser support. Because you can have multiple External Parser instances in your setup, one per format / program, we look for all the org/apache/tika/parser/external/tika-external-parsers.xml files on the classpath, and create parser instances based on definitions in there What do we think about setting executable paths and keys/logins for parsers like OCR, Strings, Translators etc? Always on ParseContext? Properties? Custom xml config? Tika config xml? Other? Combination? Nick
[jira] [Commented] (TIKA-1652) Tika Server should allow config file override from the command line like Tika App
[ https://issues.apache.org/jira/browse/TIKA-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14575986#comment-14575986 ] Tyler Palsulich commented on TIKA-1652: --- I think this is a duplicate of TIKA-1426? Tika Server should allow config file override from the command line like Tika App - Key: TIKA-1652 URL: https://issues.apache.org/jira/browse/TIKA-1652 Project: Tika Issue Type: Bug Components: server Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.9 Tika-app's TikaCLI allows a command line parameter, --config, to override the Tika config at the command line. For whatever reason, Tika-server doesn't it should since it causes a different control flow for things to get created. I first saw this when testing the CTAKESParser (TIKA-1645) in Tika-server. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Configuring parsers and translators
(Devil's advocate hat slightly on.) My one hesitation about putting it all into tika-config is that the default might get to be a monstrosity -- difficult for new users to use. Tyler On Sat, Jun 6, 2015 at 3:48 PM Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: I think it would be great to have all this in the Tika Config. The one thing then is to provide an example default config and to make it *hugely* clear rather than all the levels of indirection that we currently have going on which makes it super hard when there is a config error (SPI, swallowing print messages, etc.) ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Tyler Palsulich tpalsul...@gmail.com Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Saturday, June 6, 2015 at 3:45 PM To: dev@tika.apache.org dev@tika.apache.org Subject: Re: Configuring parsers and translators Hi Nick, I've been mulling this over since you sent the first message. But, I'm afraid I don't have a good solution or developed ideas. I agree, it would be very nice to consolidate all configuration for all parsers in the server and app. Is it feasible to put everything into tika-config? Then Parser implementations would read the config to pull out their own configuration. Or, would it be better to keep some configuration separate? Documentation would be an issue if every parser defines its own metadata keys... But, it might be an improvement since we don't have free form properties and configuration files. Tyler On Sat, Jun 6, 2015 at 12:30 PM Nick Burch apa...@gagravarr.org wrote: Anyone have any thoughts on this? On Fri, 8 May 2015, Nick Burch wrote: Hi All This came up in TIKA-1623, but I thought it might be better brought out to the list for discussion To configure parsers on a per-document basis, such as setting PDF spacing tolerances, or telling Tesseract what language it should be OCRing for, we have the *Config objects. You create one of these, use the setters to configure it for your document, pop it onto the Parse context and it's used when processing your document To configure parsers and translators on a per-JVM basis, to apply to all documents processed, it's a bit less consistent. At least some look for a properties file with a specific name, usually in the tika namespace, and grab their settings / keys / etc out of that. At least some expect to find a *Config with their program path on it, even though that remains constant between documents. None of them support getting their settings from the Tika Config As part of our evolution of parser preferences, we're moving towards people either being able to set their preferences in code, or being able to supply a Tika Config xml which sets their parser preferences or overrides certain bits of the default. The code option works for people who want to declare certain specific things, the Tika Config one gives the same functionality but allows a consistent and clean way to set it between Tika App, Tika Server and java code. Another related example is the External Parser support. Because you can have multiple External Parser instances in your setup, one per format / program, we look for all the org/apache/tika/parser/external/tika-external-parsers.xml files on the classpath, and create parser instances based on definitions in there What do we think about setting executable paths and keys/logins for parsers like OCR, Strings, Translators etc? Always on ParseContext? Properties? Custom xml config? Tika config xml? Other? Combination? Nick
Re: [DISCUSS] Thinking about completely refactoring the ExternalParser and using commons-exec
On Mon, May 25, 2015 at 4:05 PM, Nick Burch apa...@gagravarr.org wrote: On Mon, 25 May 2015, Mattmann, Chris A (3980) wrote: ExternalParser is way broke. I have some patches that somewhat fix it, but in doing so, I realized, why not just use commons-exec? I realize that this is another dependency into core, but commons-exec simplifies a lot of the stuff that's broke with ExternalParser (reading its streams, for one). Maybe we could push some or all of external parser into the tika-parsers module, so we don't have to add more dependencies into core? What is the argument for having ExternalParser in core? Provide an easy-to-extend class for downstream users to create their own external parser? Tyler
Re: Any reason we removed the links to other downstream Tika APIs off the main web site?
Hi Chris, I may have botched the version of the index on the site (see the other thread with Nick's comments.) I'll investigate more tonight or tomorrow, if you don't beat me to it. Tyler On May 20, 2015 4:39 PM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Hey Folks, Before, we had links in the description of Tika that Tyler put in that included links to e.g., Tika Python and other downstream APIs. Would there be objection to putting those links back up, they seemed to have been removed? I created a wiki page on our Tika wiki with links to downstream API bindings. I would like to add the text back in, and then e.g., link to that wiki page. That OK? If I don’t hear objections in the next day or so I will add the link back in. Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
[jira] [Commented] (TIKA-1624) Syntax error in DOAP file release section
[ https://issues.apache.org/jira/browse/TIKA-1624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14553281#comment-14553281 ] Tyler Palsulich commented on TIKA-1624: --- Thanks, Ken. I published the file a few minutes ago. Syntax error in DOAP file release section - Key: TIKA-1624 URL: https://issues.apache.org/jira/browse/TIKA-1624 Project: Tika Issue Type: Bug Environment: http://svn.apache.org/repos/asf/tika/site/src/site/resources/doap.rdf Reporter: Sebb Assignee: Ken Krugler DOAP files can contain details of multiple release Versions, however each must be listed in a separate release section, for example: release Version nameApache XYZ/name created2015-02-16/created revision1.6.2/revision /Version /release release Version nameApache XYZ/name created2014-09-24/created revision1.6.1/revision /Version /release Please can the project DOAP be corrected accordingly? Thanks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Any reason we removed the links to other downstream Tika APIs off the main web site?
Hi Chris, I just looked again. I don't think this was a versioning issue -- I intentionally removed the links. I think the best place to add them would be on the Getting Started page [0] (at the bottom). But, it might be better to link directly to the wiki and make the link more prominent (not at the very bottom)? That way, we reduce the amount of duplicated information. On the other hand, I think it would be good to mention (on the front page) the top level ways you can use Tika: Java, command line, server, GUI, and wrappers in Python, Julia, and more. Apologies for the confusion. I believe the versioning issues from the other thread have been resolved. Tyler On Wed, May 20, 2015 at 5:54 PM, Tyler Palsulich tpalsul...@gmail.com wrote: Hi Chris, I may have botched the version of the index on the site (see the other thread with Nick's comments.) I'll investigate more tonight or tomorrow, if you don't beat me to it. Tyler On May 20, 2015 4:39 PM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Hey Folks, Before, we had links in the description of Tika that Tyler put in that included links to e.g., Tika Python and other downstream APIs. Would there be objection to putting those links back up, they seemed to have been removed? I created a wiki page on our Tika wiki with links to downstream API bindings. I would like to add the text back in, and then e.g., link to that wiki page. That OK? If I don’t hear objections in the next day or so I will add the link back in. Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
[jira] [Commented] (TIKA-1630) Mention APK support in List of Supported Formats
[ https://issues.apache.org/jira/browse/TIKA-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14553272#comment-14553272 ] Tyler Palsulich commented on TIKA-1630: --- That is a very good point. There is a paragraph on the formats page which explains in a little bit more detail: bq. (Please note that Apache Tika is able to detect a much wider range of formats than those listed below, this page only documents those formats from which Tika is able to extract metadata and/or textual content) Would it help if we included a link to the mimetypes file (which has all filetypes Tika can detect)? Mention APK support in List of Supported Formats Key: TIKA-1630 URL: https://issues.apache.org/jira/browse/TIKA-1630 Project: Tika Issue Type: Improvement Components: documentation Affects Versions: 1.8 Reporter: Lorenz Leutgeb Priority: Trivial http://tika.apache.org/1.8/formats.html claims to offer a full list of supported formats does not mention support for APK files at all. I trusted that source and only found that tike supports APK files and their respective MIME types from looking at Tikas codebase, which is suboptimal. Please add APK files to that list as appropriate (at least include the MIME type Tika understands). Consider reevaluating the list to find out whether other formats are missing (this is not covered by this ticket). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1630) Mention APK support in List of Supported Formats
[ https://issues.apache.org/jira/browse/TIKA-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14544104#comment-14544104 ] Tyler Palsulich commented on TIKA-1630: --- Hi. Thanks for reporting this! Can you be a little more specific about which file is supported? What in the Tika codebase indicates support for APK formats? Also, just to be clear, are you referring to android application packages? Mention APK support in List of Supported Formats Key: TIKA-1630 URL: https://issues.apache.org/jira/browse/TIKA-1630 Project: Tika Issue Type: Improvement Components: documentation Affects Versions: 1.8 Reporter: Lorenz Leutgeb Priority: Trivial http://tika.apache.org/1.8/formats.html claims to offer a full list of supported formats does not mention support for APK files at all. I trusted that source and only found that tike supports APK files and their respective MIME types from looking at Tikas codebase, which is suboptimal. Please add APK files to that list as appropriate (at least include the MIME type Tika understands). Consider reevaluating the list to find out whether other formats are missing (this is not covered by this ticket). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Published Site Changes
Hi Everyone, I was about to update the site for TIKA-1619 (checksums wrong on the site), but found unpublished changes in the site. This is the status after checking out the repo and running `mvn install`: ➜ site svn status M publish/1.7/examples.html M publish/1.8/examples.html M publish/1.8/index.html M publish/1.9/examples.html M publish/doap.rdf M publish/plugin-management.html X src/examples-src Not all of the changes are correct (e.g. make the list of contributors for 1.8 point to the list for 1.7). So, I don't want to commit all of the changes. Maybe someone (probably me) didn't add site/src when committing to site/publish? I think the doap.rdf change was from r1678405 http://svn.apache.org/viewvc?view=revisionrevision=1678405. But, I don't know about the others. Anyone have any ideas/clean solutions before I check each page by hand and redo any necessary 1.7/8/9 changes? Thanks, Tyler
[jira] [Commented] (TIKA-1624) Syntax error in DOAP file release section
[ https://issues.apache.org/jira/browse/TIKA-1624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14544150#comment-14544150 ] Tyler Palsulich commented on TIKA-1624: --- [~kkrugler], yes. I just updated the release instructions. Syntax error in DOAP file release section - Key: TIKA-1624 URL: https://issues.apache.org/jira/browse/TIKA-1624 Project: Tika Issue Type: Bug Environment: http://svn.apache.org/repos/asf/tika/site/src/site/resources/doap.rdf Reporter: Sebb Assignee: Ken Krugler DOAP files can contain details of multiple release Versions, however each must be listed in a separate release section, for example: release Version nameApache XYZ/name created2015-02-16/created revision1.6.2/revision /Version /release release Version nameApache XYZ/name created2014-09-24/created revision1.6.1/revision /Version /release Please can the project DOAP be corrected accordingly? Thanks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Translation API question
Hi Sergey, Unfortunately, not yet. See TIKA-1328. Tyler On Tue, May 5, 2015 at 4:51 PM, Sergey Beryozkin sberyoz...@gmail.com wrote: Hi All Is it possible to submit a document to the Translation API and get the translated words as a sequence of events ? For example, with a regular Tika API it is possible to submit a document and get the metadata and the data, and these data can be indexed, etc. What about submitting a document (for ex, French) to the translation API and getting a list of the words in English, so that they can be indexed. I'm thinking, may be one then can use a query to find all the documents in French that contain a given word as it reads in English. Example: find a French doc containing thanks, etc... Not sure how much sense it makes though :-) Cheers, Sergey
Re: Java 1.6 support for Tika 1.9?
I should have included the fact this is the last release planned to support Java 1.6 in the announcement (as we talked about a while back). But, since that has passed, should we just update the announcement on the website, wait another release, or just drop Java 1.6 support when we release 1.9? I could be persuaded to do any of the above. Tyler On Mon, Apr 27, 2015 at 1:30 PM, Konstantin Gribov gros...@gmail.com wrote: As I remember, we thought about announcing some release last java 6 compatible one and give Tika users some time to migrate. E. g., we can announce 1.10 last java 6 release when releasing 1.9. IMHO, in such case it wouldn't be a sudden change for downstream project developers and Tika users. -- Best regards, Konstantin Gribov пн, 27 апр. 2015 г. в 20:09, Allison, Timothy B. talli...@mitre.org: Hi All, I can't remember where we are on this. Are we dropping support for Java 1.6 in Tika 1.9? If so, should we open an issue to integrate tika-java7 into core, add diamond operators, catching multiple exceptions... anything else...? Or, do we want to wait for Tika 2.0 or Tika 1.10? Best, Tim
Re: comparing Tika's file detect with other tools?
Hi Tim, I do not know about if there would be licensing concerns. But, we do have TIKA-289 to track merging magic bytes from `file` into Tika. Tyler On Wed, Apr 22, 2015 at 10:40 AM, Ken Krugler kkrugler_li...@transpac.com wrote: Hi Tim, I don't believe there's any issue with comparing results. If you were looking at the source for file, then it gets more gray, but I think even that would be OK as long as you weren't copying code or directly re-implementing algorithms. -- Ken From: Allison, Timothy B. Sent: April 22, 2015 5:47:17am PDT To: dev@tika.apache.org Subject: comparing Tika's file detect with other tools? Would it be frowned upon to compare Tika's file detection with other tools, like file? Any concerns about effectively reverse engineering (when we find that Tika is wrong) from a non-Apache project? Any other sensitivities I should be aware of? Best, Tim -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr
[jira] [Commented] (TIKA-1585) Create Example Website with Form Submission
[ https://issues.apache.org/jira/browse/TIKA-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14507259#comment-14507259 ] Tyler Palsulich commented on TIKA-1585: --- Is there an Apache hosted location we'd like to stand this up? If not, I'll close this issue off. http://tpalsulich.github.io/TikaExamples/ Create Example Website with Form Submission --- Key: TIKA-1585 URL: https://issues.apache.org/jira/browse/TIKA-1585 Project: Tika Issue Type: New Feature Components: example, server Reporter: Tyler Palsulich Assignee: Tyler Palsulich It would be great to have a website where we can direct people who ask what Tika can do for [filetype] without needing them to actually download Tika. Some initial work to do that is [here|http://tpalsulich.github.io/TikaExamples/]. I'm far from a design guru, but I imagine the site as having a form where you can upload a file at the top, checkboxes for if you want metadata, content, or both, and a submit button. The request should be sent with AJAX and the result should populate a {{div}}. One issue with AJAX requests is that Tika Server doesn't currently allow Cross-Origin-Resource-Sharing (CORS). So, we either need to maintain a slightly updated tika-server, or update the server to allow configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: NUTCH-1994 and UCAR Dependencies
Hi Lewis, I also tried upgrading Tika in Nutch. But, ran into the same issue (but, udunits is found, as expected): [ivy:retrieve] :: [ivy:retrieve] :: UNRESOLVED DEPENDENCIES :: [ivy:retrieve] :: [ivy:retrieve] :: edu.ucar#jj2000;5.2: not found [ivy:retrieve] :: org.itadaki#bzip2;0.9.1: not found [ivy:retrieve] :: Thanks for pushing the dependencies out. Tyler On Tue, Apr 21, 2015 at 1:50 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Folks, Whilst addressing NUTCH-1994, I've experienced a dependency problem (related to unpublished artifacts on Maven Central) which I am working through right now. When Kaing the upgrade in Nutch, I get the following [ivy:resolve] -- artifact edu.ucar#udunits;4.5.5!udunits.jar: [ivy:resolve] http://oss.sonatype.org/content/repositories/releases/edu/ucar/udunits/4.5.5/udunits-4.5.5.jar [ivy:resolve] :: [ivy:resolve] :: UNRESOLVED DEPENDENCIES :: [ivy:resolve] :: [ivy:resolve] :: edu.ucar#jj2000;5.2: not found [ivy:resolve] :: org.itadaki#bzip2;0.9.1: not found [ivy:resolve] :: edu.ucar#udunits;4.5.5: not found [ivy:resolve] :: [ivy:resolve] [ivy:resolve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS BUILD FAILED /usr/local/trunk_clean/build.xml:112: The following error occurred while executing this line: /usr/local/trunk_clean/src/plugin/build.xml:60: The following error occurred while executing this line: /usr/local/trunk_clean/src/plugin/build-plugin.xml:229: impossible to resolve dependencies: resolve failed - see output for details Total time: 17 seconds I've just this minutes pushed the edu.ucar#udunits;4.5.5 artifacts so they will be available imminently. The remaining artifact at edu.ucar#jj2000;5.2 has a corrupted POM which means that OSS Nexus will not accepts it. I'll send a pull request further upstream for that ASAP. Finally, the BZIP dependency is a 3rd party dependency from another Org, Licensed under MIT license. So I will register interest to publish this dependency, push it, then we will be good to go. Lewis -- *Lewis*
[jira] [Commented] (TIKA-1607) Introduce new HashMapString, Object data structure for persitsence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503778#comment-14503778 ] Tyler Palsulich commented on TIKA-1607: --- Good idea! What if you created a subclass of {{Metadata}} ({{ExtendedMetadata}}?) which supports mapping to a {{ListMapString, Object}}. Then, when populating the metadata with a phone number, you can check if {{metadata instanceof ExtendedMetadata}} and respond accordingly. Any drastic changes would be a good candidate for Tika 2.0. Introduce new HashMapString, Object data structure for persitsence of Tika Metadata - Key: TIKA-1607 URL: https://issues.apache.org/jira/browse/TIKA-1607 Project: Tika Issue Type: Improvement Components: core, metadata Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Critical Fix For: 1.9 I am currently working implementing more comprehensive extraction and enhancement of the Tika support for Phone number extraction and metadata modeling. Right now we utilize the String[] multivalued support available within Tika to persist phone numbers as {code} Metadata: String: String[] Metadata: phonenumbers: number1, number2, number3, ... {code} I would like to propose we extend multi-valued support outside of the String[] paradigm by implementing a more abstract Collection of Objects such that we could consider and implement the phone number use case as follows {code} Metadata: String: ListHashMapString,String {code} Where Object could be a CollectionHashMapString/Property, String/int/long e.g. {code} Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), (LibPN-NumberType: International), (etc: etc)...), (+1292611054: LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) (etc)] {code} There are obvious backwards compatibility issues with this approach... additionally it is a fundamental change to the code Metadata API. I hope that the String, Object Mapping however is flexible enough to allow me to model Tika Metadata the way I want. Any comments folks? Thanks Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[ANNOUNCE] Apache Tika 1.8 Released
The Apache Tika project is pleased to announce the release of Apache Tika 1.8. The release contents have been pushed out to the main Apache release site and to the Maven Central sync, so the releases should be available as soon as the mirrors get the syncs. Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. Apache Tika 1.8 contains a number of improvements and bug fixes. Details can be found in the changes file: http://www.apache.org/dist/tika/CHANGES-1.8.txt Apache Tika is available in source form from the following download page: http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.8-src.zip Apache Tika is also available in binary form or for use using Maven 2 from the Central Repository: http://repo1.maven.org/maven2/org/apache/tika/ In the initial 48 hours, the release may not be available on all mirrors. When downloading from a mirror site, please remember to verify the downloads using signatures found on the Apache site: https://people.apache.org/keys/group/tika.asc For more information on Apache Tika, visit the project home page: http://tika.apache.org/ -- Tyler Palsulich, on behalf of the Apache Tika community
[RESULT] [VOTE] Apache Tika 1.8 Release Candidate #2
Hi Everyone, The VOTE to release Tika 1.8 RC #2 has passed with the following tally: +1: Chris Mattmann Hong-Thai Nguyen Konstantin Gribov Lewis John Mcgibbney Oleg Tikhonov Tim Allison Tyler Palsulich ±0: None -1: None I'll move forward with the release process now. Thank you all for your VOTE and collaboration, Tyler
Re: [VOTE] Apache Tika 1.8 Release Candidate #2
Thank you, Everyone! I'll move forward now. Lewis, KEYS are here: https://people.apache.org/keys/group/tika.asc. Of course, I'm also +1. Tyler On Mon, Apr 20, 2015 at 3:47 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Folks, On Thu, Apr 16, 2015 at 2:42 PM, dev-digest-h...@tika.apache.org wrote: Hi Folks, A candidate for the Tika 1.8 release is available at: https://dist.apache.org/repos/dist/dev/tika/ The release candidate is a zip archive of the sources in: http://svn.apache.org/repos/asf/tika/tags/1.8-rc2/ The SHA1 checksum of the archive is 5e22fee9079370398472e59082d171ae2d7fdd31. In addition, a staged maven repository is available here: https://repository.apache.org/content/repositories/orgapachetika-1009 Please vote on releasing this package as Apache Tika 1.8. The vote is open for the next 72 hours and passes if a majority of at least three +1 Tika PMC votes are cast. Where is the KEYS? All signatures are fine. Test are A OK. The remaining issue is with the Tika 1616 issue which was patched and committed to trunk. IMHO this is not a blocker. We could probably release 1.9 in a shorter release cycle to accomodate the change [X] +1 Release this package as Apache Tika 1.8 I am +1 for releasing this as 1.8. Lewis
Re: [VOTE] Apache Tika 1.8 Release Candidate #2
Hi Ken, Sorry for the delayed response. No, that patch is not included in this RC (as I think you know, given your resolution of TIKA-1606). Have a good night, Tyler On Sun, Apr 19, 2015 at 10:49 AM, Ken Krugler kkrugler_li...@transpac.com wrote: Hi Tyler, Does this include Lewis's fix for https://issues.apache.org/jira/browse/TIKA-1606? It's a simple change (bumping the Guava version), but as seen this can have unexpected consequences. I'm fine either way. -- Ken From: Tyler Palsulich Sent: April 18, 2015 8:29:22pm PDT To: dev@tika.apache.org Subject: RE: [VOTE] Apache Tika 1.8 Release Candidate #2 Hi Folks, If there are no blocking complaints (OSGi?) by Monday (a little longer than 3 days, I realize), I'll mark this as passed and finish the release process. Of course, it's no problem for me to cut another RC, if it's needed. Have a great weekend! Tyler I've run into one problem while testing Tika 1.8 with Bixo It involves a dependency issue involving (of course) Guava, since that project loves to break their API :( The bixo-core jar has these transitive dependencies on various versions of Guava: Hadoop - 11.0.2 Cascading - 14.0.1 Tika-parsers - 10.0.1 cdm - 17.0 Everyone winds up using version 10.0.1 (note that Tika has a dependency on cdm, which wants to use 17.0) The problem is that Hadoop (for any recent version) uses an API from Guava's cache implementation that no longer exists: com.google.common.cache.CacheBuilder.build(Lcom/google/common/cache/CacheLoader;)Lcom/google/common/cache/LoadingCache; java.lang.NoSuchMethodError: com.google.common.cache.CacheBuilder.build(Lcom/google/common/cache/CacheLoader;)Lcom/google/common/cache/LoadingCache; at org.apache.hadoop.io.compress.CodecPool.createCache(CodecPool.java:62) at org.apache.hadoop.io.compress.CodecPool.clinit(CodecPool.java:74) at org.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.java:1272) at org.apache.hadoop.mapred.SequenceFileOutputFormat$1.close(SequenceFileOutputFormat.java:79) So what this means is that anyone trying to use Tika with Hadoop will need to play games with the class loader to get the older version of Guava - though that can cause other issues if Hadoop (or Cascading, etc) rely on anything that's only in the newer Guava API. Guava 1.0.01 was released about 3.5 years ago; 11.0.2 was from about 3 years ago. So it seems like we should upgrade to at least 11.0.2 But I don't know if this is enough of an issue to require another RC. -- Ken PS - I've created https://issues.apache.org/jira/browse/TIKA-1606 to track this. From: Tyler Palsulich Sent: April 13, 2015 10:56:29am PDT To: dev@tika.apache.org, u...@tika.apache.org Subject: [VOTE] Apache Tika 1.8 Release Candidate #2 Hi Folks, A candidate for the Tika 1.8 release is available at: https://dist.apache.org/repos/dist/dev/tika/ The release candidate is a zip archive of the sources in: http://svn.apache.org/repos/asf/tika/tags/1.8-rc2/ The SHA1 checksum of the archive is 5e22fee9079370398472e59082d171ae2d7fdd31. In addition, a staged maven repository is available here: https://repository.apache.org/content/repositories/orgapachetika-1009 Please vote on releasing this package as Apache Tika 1.8. The vote is open for the next 72 hours and passes if a majority of at least three +1 Tika PMC votes are cast. [ ] +1 Release this package as Apache Tika 1.8 [ ] ±0 I don't object to this release, but I haven't checked it [ ] -1 Do not release this package because... Thanks, Tyler -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr
RE: [VOTE] Apache Tika 1.8 Release Candidate #2
Hi Folks, If there are no blocking complaints (OSGi?) by Monday (a little longer than 3 days, I realize), I'll mark this as passed and finish the release process. Of course, it's no problem for me to cut another RC, if it's needed. Have a great weekend! Tyler I've run into one problem while testing Tika 1.8 with Bixo It involves a dependency issue involving (of course) Guava, since that project loves to break their API :( The bixo-core jar has these transitive dependencies on various versions of Guava: Hadoop - 11.0.2 Cascading - 14.0.1 Tika-parsers - 10.0.1 cdm - 17.0 Everyone winds up using version 10.0.1 (note that Tika has a dependency on cdm, which wants to use 17.0) The problem is that Hadoop (for any recent version) uses an API from Guava's cache implementation that no longer exists: com.google.common.cache.CacheBuilder.build(Lcom/google/common/cache/CacheLoader;)Lcom/google/common/cache/LoadingCache; java.lang.NoSuchMethodError: com.google.common.cache.CacheBuilder.build(Lcom/google/common/cache/CacheLoader;)Lcom/google/common/cache/LoadingCache; at org.apache.hadoop.io.compress.CodecPool.createCache(CodecPool.java:62) at org.apache.hadoop.io.compress.CodecPool.clinit(CodecPool.java:74) at org.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.java:1272) at org.apache.hadoop.mapred.SequenceFileOutputFormat$1.close(SequenceFileOutputFormat.java:79) So what this means is that anyone trying to use Tika with Hadoop will need to play games with the class loader to get the older version of Guava - though that can cause other issues if Hadoop (or Cascading, etc) rely on anything that's only in the newer Guava API. Guava 1.0.01 was released about 3.5 years ago; 11.0.2 was from about 3 years ago. So it seems like we should upgrade to at least 11.0.2 But I don't know if this is enough of an issue to require another RC. -- Ken PS - I've created https://issues.apache.org/jira/browse/TIKA-1606 to track this. From: Tyler Palsulich Sent: April 13, 2015 10:56:29am PDT To: dev@tika.apache.org, u...@tika.apache.org Subject: [VOTE] Apache Tika 1.8 Release Candidate #2 Hi Folks, A candidate for the Tika 1.8 release is available at: https://dist.apache.org/repos/dist/dev/tika/ The release candidate is a zip archive of the sources in: http://svn.apache.org/repos/asf/tika/tags/1.8-rc2/ The SHA1 checksum of the archive is 5e22fee9079370398472e59082d171ae2d7fdd31. In addition, a staged maven repository is available here: https://repository.apache.org/content/repositories/orgapachetika-1009 Please vote on releasing this package as Apache Tika 1.8. The vote is open for the next 72 hours and passes if a majority of at least three +1 Tika PMC votes are cast. [ ] +1 Release this package as Apache Tika 1.8 [ ] ±0 I don't object to this release, but I haven't checked it [ ] -1 Do not release this package because... Thanks, Tyler -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr
[jira] [Closed] (TIKA-1266) Tika OSGI Bundle needs Bundle-ClassPath to work in Equinox
[ https://issues.apache.org/jira/browse/TIKA-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-1266. - Resolution: Not A Problem Thanks, [~bobpaulin]! Tika OSGI Bundle needs Bundle-ClassPath to work in Equinox -- Key: TIKA-1266 URL: https://issues.apache.org/jira/browse/TIKA-1266 Project: Tika Issue Type: Improvement Components: packaging Affects Versions: 1.4, 1.5 Reporter: pm The tika-bundle currently has the Embed-Dependency header filled with embedded dependencies. Embed-Dependency is not defined in OSGI spec, Bundle-ClassPath is . Please add Bundle-ClassPath with list of embedded JAR names prefixed with ., . -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[VOTE] Apache Tika 1.8 Release Candidate #2
Hi Folks, A candidate for the Tika 1.8 release is available at: https://dist.apache.org/repos/dist/dev/tika/ The release candidate is a zip archive of the sources in: http://svn.apache.org/repos/asf/tika/tags/1.8-rc2/ The SHA1 checksum of the archive is 5e22fee9079370398472e59082d171ae2d7fdd31. In addition, a staged maven repository is available here: https://repository.apache.org/content/repositories/orgapachetika-1009 Please vote on releasing this package as Apache Tika 1.8. The vote is open for the next 72 hours and passes if a majority of at least three +1 Tika PMC votes are cast. [ ] +1 Release this package as Apache Tika 1.8 [ ] ±0 I don't object to this release, but I haven't checked it [ ] -1 Do not release this package because... Thanks, Tyler
[jira] [Commented] (TIKA-1593) Doco: Broken link to Parser Quick Start Guide
[ https://issues.apache.org/jira/browse/TIKA-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492638#comment-14492638 ] Tyler Palsulich commented on TIKA-1593: --- See https://svn.apache.org/repos/asf/tika/site/src/site/apt/download.apt.vm -- you need the vm extension. Then, you can use {code}${project.parent.version}{code} to get the current version of the project. Then, when we update the site for a new release, you just have to change the version number in the site's pom.xml file. I'll fix this right now. Doco: Broken link to Parser Quick Start Guide --- Key: TIKA-1593 URL: https://issues.apache.org/jira/browse/TIKA-1593 Project: Tika Issue Type: Bug Components: documentation Affects Versions: 1.7 Reporter: Dan Rollo Priority: Minor The Tika web page: https://tika.apache.org/contribute.html, under the Section: New Parsers, Detectors and Mime Types, there is a link with the text: Parser Quick Start Guide. The link URL is: https://tika.apache.org/parser_guide.apt, and does not work. The .apt extension seems odd. I don't know what the link should be. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1593) Doco: Broken link to Parser Quick Start Guide
[ https://issues.apache.org/jira/browse/TIKA-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich resolved TIKA-1593. --- Resolution: Fixed Assignee: Tyler Palsulich Fixed in r1673240. Thank you [~bhamail]! Please let us know if you find any more. Doco: Broken link to Parser Quick Start Guide --- Key: TIKA-1593 URL: https://issues.apache.org/jira/browse/TIKA-1593 Project: Tika Issue Type: Bug Components: documentation Affects Versions: 1.7 Reporter: Dan Rollo Assignee: Tyler Palsulich Priority: Minor The Tika web page: https://tika.apache.org/contribute.html, under the Section: New Parsers, Detectors and Mime Types, there is a link with the text: Parser Quick Start Guide. The link URL is: https://tika.apache.org/parser_guide.apt, and does not work. The .apt extension seems odd. I don't know what the link should be. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1593) Doco: Broken link to Parser Quick Start Guide
[ https://issues.apache.org/jira/browse/TIKA-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492662#comment-14492662 ] Tyler Palsulich edited comment on TIKA-1593 at 4/13/15 5:02 PM: Fixed in r1673240 and r1673241. Thank you [~bhamail]! Please let us know if you find any more. was (Author: tpalsulich): Fixed in r1673240. Thank you [~bhamail]! Please let us know if you find any more. Doco: Broken link to Parser Quick Start Guide --- Key: TIKA-1593 URL: https://issues.apache.org/jira/browse/TIKA-1593 Project: Tika Issue Type: Bug Components: documentation Affects Versions: 1.7 Reporter: Dan Rollo Assignee: Tyler Palsulich Priority: Minor The Tika web page: https://tika.apache.org/contribute.html, under the Section: New Parsers, Detectors and Mime Types, there is a link with the text: Parser Quick Start Guide. The link URL is: https://tika.apache.org/parser_guide.apt, and does not work. The .apt extension seems odd. I don't know what the link should be. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1600) Unable to parse ODT files because of failed to close temporary resources
[ https://issues.apache.org/jira/browse/TIKA-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich resolved TIKA-1600. --- Resolution: Fixed Assignee: Hong-Thai Nguyen Thanks, [~thaichat04]! I just updated it -- reformatted the ODF parsing files (they were all a bit odd with whitespace) and moved the test into the existing test file. Marking this as fixed and will cut a new release shortly. Unable to parse ODT files because of failed to close temporary resources Key: TIKA-1600 URL: https://issues.apache.org/jira/browse/TIKA-1600 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.8 Environment: Windows Reporter: Hong-Thai Nguyen Assignee: Hong-Thai Nguyen Attachments: Manuel_koha.odt Many ODT files are failed to parse causing of this exception. A sample file in attachment {code} Apache Tika was unable to parse the document at C:\Users\hong-thai.nguyen\Downloads\Manuel_koha.odt. The full exception stack trace is included below: org.apache.tika.exception.TikaException: Failed to close temporary resources at org.apache.tika.io.TemporaryResources.dispose(TemporaryResources.java:152) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:127) at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:342) at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:299) at org.apache.tika.gui.TikaGUI.actionPerformed(TikaGUI.java:256) at javax.swing.AbstractButton.fireActionPerformed(Unknown Source) at javax.swing.AbstractButton$Handler.actionPerformed(Unknown Source) at javax.swing.DefaultButtonModel.fireActionPerformed(Unknown Source) at javax.swing.DefaultButtonModel.setPressed(Unknown Source) at javax.swing.AbstractButton.doClick(Unknown Source) at javax.swing.plaf.basic.BasicMenuItemUI.doClick(Unknown Source) at javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(Unknown Source) at java.awt.Component.processMouseEvent(Unknown Source) at javax.swing.JComponent.processMouseEvent(Unknown Source) at java.awt.Component.processEvent(Unknown Source) at java.awt.Container.processEvent(Unknown Source) at java.awt.Component.dispatchEventImpl(Unknown Source) at java.awt.Container.dispatchEventImpl(Unknown Source) at java.awt.Component.dispatchEvent(Unknown Source) at java.awt.LightweightDispatcher.retargetMouseEvent(Unknown Source) at java.awt.LightweightDispatcher.processMouseEvent(Unknown Source) at java.awt.LightweightDispatcher.dispatchEvent(Unknown Source) at java.awt.Container.dispatchEventImpl(Unknown Source) at java.awt.Window.dispatchEventImpl(Unknown Source) at java.awt.Component.dispatchEvent(Unknown Source) at java.awt.EventQueue.dispatchEventImpl(Unknown Source) at java.awt.EventQueue.access$400(Unknown Source) at java.awt.EventQueue$3.run(Unknown Source) at java.awt.EventQueue$3.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown Source) at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown Source) at java.awt.EventQueue$4.run(Unknown Source) at java.awt.EventQueue$4.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown Source) at java.awt.EventQueue.dispatchEvent(Unknown Source) at java.awt.EventDispatchThread.pumpOneEventForFilters(Unknown Source) at java.awt.EventDispatchThread.pumpEventsForFilter(Unknown Source) at java.awt.EventDispatchThread.pumpEventsForHierarchy(Unknown Source) at java.awt.EventDispatchThread.pumpEvents(Unknown Source) at java.awt.EventDispatchThread.pumpEvents(Unknown Source) at java.awt.EventDispatchThread.run(Unknown Source) Caused by: java.io.IOException: Could not delete temporary file C:\Users\HONG-T~1.NGU\AppData\Local\Temp\apache-tika-2891340188156641845.tmp at org.apache.tika.io.TemporaryResources$1.close(TemporaryResources.java:70) at org.apache.tika.io.TemporaryResources.close(TemporaryResources.java:121) at org.apache.tika.io.TemporaryResources.dispose(TemporaryResources.java:150) ... 42 more {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1600) Unable to parse ODT files because of failed to close temporary resources
[ https://issues.apache.org/jira/browse/TIKA-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-1600: -- Priority: Blocker (was: Major) Unable to parse ODT files because of failed to close temporary resources Key: TIKA-1600 URL: https://issues.apache.org/jira/browse/TIKA-1600 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.8 Environment: Windows Reporter: Hong-Thai Nguyen Assignee: Hong-Thai Nguyen Priority: Blocker Attachments: Manuel_koha.odt Many ODT files are failed to parse causing of this exception. A sample file in attachment {code} Apache Tika was unable to parse the document at C:\Users\hong-thai.nguyen\Downloads\Manuel_koha.odt. The full exception stack trace is included below: org.apache.tika.exception.TikaException: Failed to close temporary resources at org.apache.tika.io.TemporaryResources.dispose(TemporaryResources.java:152) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:127) at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:342) at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:299) at org.apache.tika.gui.TikaGUI.actionPerformed(TikaGUI.java:256) at javax.swing.AbstractButton.fireActionPerformed(Unknown Source) at javax.swing.AbstractButton$Handler.actionPerformed(Unknown Source) at javax.swing.DefaultButtonModel.fireActionPerformed(Unknown Source) at javax.swing.DefaultButtonModel.setPressed(Unknown Source) at javax.swing.AbstractButton.doClick(Unknown Source) at javax.swing.plaf.basic.BasicMenuItemUI.doClick(Unknown Source) at javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(Unknown Source) at java.awt.Component.processMouseEvent(Unknown Source) at javax.swing.JComponent.processMouseEvent(Unknown Source) at java.awt.Component.processEvent(Unknown Source) at java.awt.Container.processEvent(Unknown Source) at java.awt.Component.dispatchEventImpl(Unknown Source) at java.awt.Container.dispatchEventImpl(Unknown Source) at java.awt.Component.dispatchEvent(Unknown Source) at java.awt.LightweightDispatcher.retargetMouseEvent(Unknown Source) at java.awt.LightweightDispatcher.processMouseEvent(Unknown Source) at java.awt.LightweightDispatcher.dispatchEvent(Unknown Source) at java.awt.Container.dispatchEventImpl(Unknown Source) at java.awt.Window.dispatchEventImpl(Unknown Source) at java.awt.Component.dispatchEvent(Unknown Source) at java.awt.EventQueue.dispatchEventImpl(Unknown Source) at java.awt.EventQueue.access$400(Unknown Source) at java.awt.EventQueue$3.run(Unknown Source) at java.awt.EventQueue$3.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown Source) at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown Source) at java.awt.EventQueue$4.run(Unknown Source) at java.awt.EventQueue$4.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown Source) at java.awt.EventQueue.dispatchEvent(Unknown Source) at java.awt.EventDispatchThread.pumpOneEventForFilters(Unknown Source) at java.awt.EventDispatchThread.pumpEventsForFilter(Unknown Source) at java.awt.EventDispatchThread.pumpEventsForHierarchy(Unknown Source) at java.awt.EventDispatchThread.pumpEvents(Unknown Source) at java.awt.EventDispatchThread.pumpEvents(Unknown Source) at java.awt.EventDispatchThread.run(Unknown Source) Caused by: java.io.IOException: Could not delete temporary file C:\Users\HONG-T~1.NGU\AppData\Local\Temp\apache-tika-2891340188156641845.tmp at org.apache.tika.io.TemporaryResources$1.close(TemporaryResources.java:70) at org.apache.tika.io.TemporaryResources.close(TemporaryResources.java:121) at org.apache.tika.io.TemporaryResources.dispose(TemporaryResources.java:150) ... 42 more {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [VOTE] Release Apache Tika 1.8 Candidate #1
Hi Folks, Marking this VOTE as failed. Now that the above issues have been addressed, I'll cut a new release. Please let me know if you find any other blockers. Thanks, Tyler On Mon, Apr 13, 2015 at 12:45 AM, Hong-Thai Nguyen hngu...@customermatrix.com wrote: Not yet, I'm investigating more on TIKA-1600 today. Hong-Thai -Message d'origine- De : Allison, Timothy B. [mailto:talli...@mitre.org] Envoyé : lundi 13 avril 2015 01:07 À : dev@tika.apache.org Objet : RE: [VOTE] Release Apache Tika 1.8 Candidate #1 I don't think we've solved TIKA-1600, yet, or have we? -Original Message- From: Tyler Palsulich [mailto:tpalsul...@gmail.com] Sent: Sunday, April 12, 2015 12:12 AM To: dev@tika.apache.org Subject: Re: [VOTE] Release Apache Tika 1.8 Candidate #1 Are we ready for another RC? I'd like to make sure the above issues are (believed to be) settled before the next cut. Thanks, Tyler On Apr 10, 2015 4:55 PM, David Meikle loo...@gmail.com wrote: On 10 Apr 2015, at 11:38, Allison, Timothy B. talli...@mitre.org wrote: I agree that the ODT issue might require a respin. What do others think? +1 for re-spin. Unfortunately, there might be 2 odt docs (mime type: “application/vnd.oasis.opendocument.text”?) in govdocs1…so we wouldn't see that problem. I did do a comparison of 1.7 vs 1.8-rc1, and the results are here: https://github.com/tballison/share/blob/master/tika_comparisons/tika_1 _7_v_1_8-rc1.zip https://github.com/tballison/share/blob/master/tika_comparisons/tika_1 _7_v_1_8-rc1.zip I encourage folks (if you haven't, and if you care :) ) to take a look and see if you see something that I don’t. Thanks for this Tim. About to get on a flight, so will check through on that. Cheers, Dave
Re: [VOTE] Release Apache Tika 1.8 Candidate #1
Are we ready for another RC? I'd like to make sure the above issues are (believed to be) settled before the next cut. Thanks, Tyler On Apr 10, 2015 4:55 PM, David Meikle loo...@gmail.com wrote: On 10 Apr 2015, at 11:38, Allison, Timothy B. talli...@mitre.org wrote: I agree that the ODT issue might require a respin. What do others think? +1 for re-spin. Unfortunately, there might be 2 odt docs (mime type: “application/vnd.oasis.opendocument.text”?) in govdocs1…so we wouldn't see that problem. I did do a comparison of 1.7 vs 1.8-rc1, and the results are here: https://github.com/tballison/share/blob/master/tika_comparisons/tika_1_7_v_1_8-rc1.zip https://github.com/tballison/share/blob/master/tika_comparisons/tika_1_7_v_1_8-rc1.zip I encourage folks (if you haven't, and if you care :) ) to take a look and see if you see something that I don’t. Thanks for this Tim. About to get on a flight, so will check through on that. Cheers, Dave
Re: [VOTE] Release Apache Tika 1.8 Candidate #1
CC'ing user@tika for visibility. Tyler On Tue, Apr 7, 2015 at 4:54 PM, Tyler Palsulich tpalsul...@apache.org wrote: Hi Folks, A candidate for the Tika 1.8 release is available at: https://dist.apache.org/repos/dist/dev/tika/ The release candidate is a zip archive of the sources in: http://svn.apache.org/repos/asf/tika/tags/1.8-rc1/ The SHA1 checksum of the archive is ddeb3b43ca1c1ef346658a7005434019507e096f. In addition, a staged maven repository is available here: https://repository.apache.org/content/repositories/orgapachetika-1008 Please vote on releasing this package as Apache Tika 1.8. The vote is open for the next 72 hours and passes if a majority of at least three +1 Tika PMC votes are cast. [ ] +1 Release this package as Apache Tika 1.8 [ ] -1 Do not release this package because... Have a good night! Tyler
[VOTE] Release Apache Tika 1.8 Candidate #1
Hi Folks, A candidate for the Tika 1.8 release is available at: https://dist.apache.org/repos/dist/dev/tika/ The release candidate is a zip archive of the sources in: http://svn.apache.org/repos/asf/tika/tags/1.8-rc1/ The SHA1 checksum of the archive is ddeb3b43ca1c1ef346658a7005434019507e096f. In addition, a staged maven repository is available here: https://repository.apache.org/content/repositories/orgapachetika-1008 Please vote on releasing this package as Apache Tika 1.8. The vote is open for the next 72 hours and passes if a majority of at least three +1 Tika PMC votes are cast. [ ] +1 Release this package as Apache Tika 1.8 [ ] -1 Do not release this package because... Have a good night! Tyler
[jira] [Closed] (TIKA-1592) It seems dbus and x11 server are invoked, and fails for some reason too
[ https://issues.apache.org/jira/browse/TIKA-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-1592. - Resolution: Invalid Closing as Invalid. Feel free to create additional issues if you run into other problems with Tika! Thank you for updating with the solution! I'm glad you found it. :) (I'm also glad this wasn't a Tika issue... Ha.) It seems dbus and x11 server are invoked, and fails for some reason too --- Key: TIKA-1592 URL: https://issues.apache.org/jira/browse/TIKA-1592 Project: Tika Issue Type: Bug Affects Versions: 1.7 Environment: CentOs 6.6, Java 1.7 Reporter: Michael Couck Exception running unit tests: GConf Error: Failed to contact configuration server; some possible causes are that you need to enable TCP/IP networking for ORBit, or you have stale NFS locks due to a system crash. See http://projects.gnome.org/gconf/ for information. (Details - 1: Not running within active session) Is Tika trying to start an x11 server using dbus? Why? This breaks the unit tests, the logging is a gig for each run, and even a 64 core server is 100% cpu during the failure. I am completely confounded. Any ideas? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1592) It seems dbus and x11 server are invoked, and fails for some reason too
[ https://issues.apache.org/jira/browse/TIKA-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393246#comment-14393246 ] Tyler Palsulich commented on TIKA-1592: --- I tried building ikube on a Mac, but I ran into multiple test failures. {code} Tests in error: analyze(ikube.analytics.weka.WekaClassifierIntegration) initializationError(ikube.action.rule.IsRemoteIndexCurrentIntegration) initializationError(ikube.analytics.weka.WekaForecastClassifierIntegration) initializationError(ikube.database.DataBaseIntegration) initializationError(ikube.action.index.handler.database.TableResourceProviderIntegration) initializationError(ikube.web.service.AnalyzerIntegration) initializationError(ikube.analytics.AnalyticsServiceIntegration) initializationError(ikube.scheduling.SnapshotScheduleIntegration) initializationError(ikube.web.service.SearcherJsonIntegration) initializationError(ikube.scheduling.PruneScheduleIntegration) initializationError(ikube.action.index.handler.email.IndexableEmailHandlerIntegration) initializationError(ikube.action.index.handler.strategy.GeospatialEnrichmentStrategyIntegration) initializationError(ikube.action.index.handler.filesystem.IndexableFilesystemHandlerIntegration) initializationError(ikube.web.service.SearcherXmlIntegration) initializationError(ikube.action.ResetIntegration) initializationError(ikube.action.index.handler.internet.SvnHandlerIntegration) initializationError(ikube.toolkit.DatabaseUtilitiesIntegration) initializationError(ikube.action.rule.RulesIntegration) initializationError(ikube.analytics.neuroph.NeurophAnalyzerIntegration) initializationError(ikube.database.EntityIntegration) initializationError(ikube.cluster.hzc.ClusterManagerCacheSearchIntegration) initializationError(ikube.action.index.handler.database.IndexableTableHandlerIntegration) {code} Is Linux required? Can you give some context of how you're using Tika in the failing unit test? Tika should not have any (or, really, there is very little) OS specific code. So, it doesn't make sense why something would try to start x11. But, a dependency could definitely be up to something fishy. It seems dbus and x11 server are invoked, and fails for some reason too --- Key: TIKA-1592 URL: https://issues.apache.org/jira/browse/TIKA-1592 Project: Tika Issue Type: Bug Affects Versions: 1.7 Environment: CentOs 6.6, Java 1.7 Reporter: Michael Couck Exception running unit tests: GConf Error: Failed to contact configuration server; some possible causes are that you need to enable TCP/IP networking for ORBit, or you have stale NFS locks due to a system crash. See http://projects.gnome.org/gconf/ for information. (Details - 1: Not running within active session) Is Tika trying to start an x11 server using dbus? Why? This breaks the unit tests, the logging is a gig for each run, and even a 64 core server is 100% cpu during the failure. I am completely confounded. Any ideas? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1592) It seems dbus and x11 server are invoked, and fails for some reason too
[ https://issues.apache.org/jira/browse/TIKA-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393184#comment-14393184 ] Tyler Palsulich commented on TIKA-1592: --- Thanks for reporting this, [~michaelcouck]! Just to be clear, you're building Tika 1.7 from source? Which test case causes this? After a quick {{grep}}, I don't see any gconf or dbus references (don't know why there would be any, off the top of my head...). When you say the logging is a a gig, is that what is sent to stdout when doing {{mvn install}}? Or something else? It seems dbus and x11 server are invoked, and fails for some reason too --- Key: TIKA-1592 URL: https://issues.apache.org/jira/browse/TIKA-1592 Project: Tika Issue Type: Bug Affects Versions: 1.7 Environment: CentOs 6.6, Java 1.7 Reporter: Michael Couck Exception running unit tests: GConf Error: Failed to contact configuration server; some possible causes are that you need to enable TCP/IP networking for ORBit, or you have stale NFS locks due to a system crash. See http://projects.gnome.org/gconf/ for information. (Details - 1: Not running within active session) Is Tika trying to start an x11 server using dbus? Why? This breaks the unit tests, the logging is a gig for each run, and even a 64 core server is 100% cpu during the failure. I am completely confounded. Any ideas? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1592) It seems dbus and x11 server are invoked, and fails for some reason too
[ https://issues.apache.org/jira/browse/TIKA-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393184#comment-14393184 ] Tyler Palsulich edited comment on TIKA-1592 at 4/2/15 7:09 PM: --- Thanks for reporting this, [~michaelcouck]! Just to be clear, you're building Tika 1.7 from source? Which test case causes this? -After a quick {{grep}}, I don't see any gconf or dbus references (don't know why there would be any, off the top of my head...)- See {{grep}} output below. When you say the logging is a a gig, is that what is sent to stdout when doing {{mvn install}}? Or something else? {code} ➜ trunk grep -Ri dbus . Binary file ./tika-parsers/src/test/resources/test-documents/testTIFF.tif matches Binary file ./tika-parsers/target/test-classes/test-documents/testTIFF.tif matches Binary file ./tika-parsers/target/tika-parsers-1.8-SNAPSHOT-tests.jar matches Binary file ./tika-server/target/tika-server-1.8-SNAPSHOT.jar matches ➜ trunk grep -Ri gconf . Binary file ./tika-app/target/tika-app-1.8-SNAPSHOT.jar matches Binary file ./tika-server/target/tika-server-1.8-SNAPSHOT.jar matches {code} was (Author: tpalsulich): Thanks for reporting this, [~michaelcouck]! Just to be clear, you're building Tika 1.7 from source? Which test case causes this? After a quick {{grep}}, I don't see any gconf or dbus references (don't know why there would be any, off the top of my head...). When you say the logging is a a gig, is that what is sent to stdout when doing {{mvn install}}? Or something else? It seems dbus and x11 server are invoked, and fails for some reason too --- Key: TIKA-1592 URL: https://issues.apache.org/jira/browse/TIKA-1592 Project: Tika Issue Type: Bug Affects Versions: 1.7 Environment: CentOs 6.6, Java 1.7 Reporter: Michael Couck Exception running unit tests: GConf Error: Failed to contact configuration server; some possible causes are that you need to enable TCP/IP networking for ORBit, or you have stale NFS locks due to a system crash. See http://projects.gnome.org/gconf/ for information. (Details - 1: Not running within active session) Is Tika trying to start an x11 server using dbus? Why? This breaks the unit tests, the logging is a gig for each run, and even a 64 core server is 100% cpu during the failure. I am completely confounded. Any ideas? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Access Control Allow Origin
Thank you for the feedback! I think there's an issue (don't remember the number) to be able to specify a TikaConfig file for tika-server. So, I think that would be the ideal place to put more complex CORS configuration. Tyler On Wed, Apr 1, 2015 at 6:02 AM, Sergey Beryozkin sberyoz...@gmail.com wrote: Hi Tyler Sorry for a delay, I was off for the last few days, The change you did looks fine, the filter can check the annotations or can be configured directly (which is what you did). It might make sense to consider checking a (Java) properties resource as a possible future enhancement, as a CORS filter may have many properties, May be if a '-cors' is provided then check a well-known class resource where all of the cors properties are set, if it is absent - default to '*' otherwise work with Properties... The current approach works too, might be tricky to extend it to support more properties but great for a start Thanks, Sergey On 27/03/15 18:56, Tyler Palsulich wrote: Thank you, Sergey! I didn't know about that feature. I am going to try to work up a patch this weekend which enables CORS. I'll let you know if I run into any issues. Thanks again, Tyler On Thu, Mar 26, 2015 at 2:39 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Tyler Palsulich tpalsul...@gmail.com Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Tuesday, March 24, 2015 at 3:41 PM To: dev@tika.apache.org dev@tika.apache.org Subject: Access Control Allow Origin Hi Folks, I took a stab at creating an example website to submit a file to the form resource of our VM. See http://tpalsulich.github.io/TikaExamples/. If I try to use AJAX to submit the request to make the page prettier (see the script in the head of the page (with ev.preventDefault() commented out), I get the following error: XMLHttpRequest cannot load http://162.242.228.174:9998/tika/form. No 'Access-Control-Allow-Origin' header is present on the requested resource. Origin 'http://tpalsulich.github.io' is therefore not allowed access. The response had HTTP status code 400. We can't allow the tika-server response header to accept * in general, since that isn't secure. So, would there be interest in including this sort of site on the VM? Then, the AJAX request won't be external and we won't have this error. The version button just takes you to the version resource on the VM (doesn't do anything with the file). Tyler
Re: Access Control Allow Origin
I'll change the option to -C right now. Just looked closer -- TIKA-1426 is to provide a config for the server and app on the command line. Tyler On Wed, Apr 1, 2015 at 11:22 AM, Allison, Timothy B. talli...@mitre.org wrote: Might be thinking of TIKA-944? Mind if we switch the CORS short option to -C and use -c for the tika config file? -Original Message- From: Tyler Palsulich [mailto:tpalsul...@gmail.com] Sent: Wednesday, April 01, 2015 11:13 AM To: dev@tika.apache.org Subject: Re: Access Control Allow Origin Thank you for the feedback! I think there's an issue (don't remember the number) to be able to specify a TikaConfig file for tika-server. So, I think that would be the ideal place to put more complex CORS configuration. Tyler On Wed, Apr 1, 2015 at 6:02 AM, Sergey Beryozkin sberyoz...@gmail.com wrote: Hi Tyler Sorry for a delay, I was off for the last few days, The change you did looks fine, the filter can check the annotations or can be configured directly (which is what you did). It might make sense to consider checking a (Java) properties resource as a possible future enhancement, as a CORS filter may have many properties, May be if a '-cors' is provided then check a well-known class resource where all of the cors properties are set, if it is absent - default to '*' otherwise work with Properties... The current approach works too, might be tricky to extend it to support more properties but great for a start Thanks, Sergey On 27/03/15 18:56, Tyler Palsulich wrote: Thank you, Sergey! I didn't know about that feature. I am going to try to work up a patch this weekend which enables CORS. I'll let you know if I run into any issues. Thanks again, Tyler On Thu, Mar 26, 2015 at 2:39 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Tyler Palsulich tpalsul...@gmail.com Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Tuesday, March 24, 2015 at 3:41 PM To: dev@tika.apache.org dev@tika.apache.org Subject: Access Control Allow Origin Hi Folks, I took a stab at creating an example website to submit a file to the form resource of our VM. See http://tpalsulich.github.io/TikaExamples/. If I try to use AJAX to submit the request to make the page prettier (see the script in the head of the page (with ev.preventDefault() commented out), I get the following error: XMLHttpRequest cannot load http://162.242.228.174:9998/tika/form. No 'Access-Control-Allow-Origin' header is present on the requested resource. Origin 'http://tpalsulich.github.io' is therefore not allowed access. The response had HTTP status code 400. We can't allow the tika-server response header to accept * in general, since that isn't secure. So, would there be interest in including this sort of site on the VM? Then, the AJAX request won't be external and we won't have this error. The version button just takes you to the version resource on the VM (doesn't do anything with the file). Tyler
[jira] [Comment Edited] (TIKA-1585) Create Example Website with Form Submission
[ https://issues.apache.org/jira/browse/TIKA-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390841#comment-14390841 ] Tyler Palsulich edited comment on TIKA-1585 at 4/1/15 3:51 PM: --- Done. It works. -I'll see if I can shut 9997 down right now.- Port 9997 is now closed. was (Author: tpalsulich): Done. It works. I'll see if I can shut 9997 down right now. Create Example Website with Form Submission --- Key: TIKA-1585 URL: https://issues.apache.org/jira/browse/TIKA-1585 Project: Tika Issue Type: New Feature Components: example, server Reporter: Tyler Palsulich Assignee: Tyler Palsulich It would be great to have a website where we can direct people who ask what Tika can do for [filetype] without needing them to actually download Tika. Some initial work to do that is [here|http://tpalsulich.github.io/TikaExamples/]. I'm far from a design guru, but I imagine the site as having a form where you can upload a file at the top, checkboxes for if you want metadata, content, or both, and a submit button. The request should be sent with AJAX and the result should populate a {{div}}. One issue with AJAX requests is that Tika Server doesn't currently allow Cross-Origin-Resource-Sharing (CORS). So, we either need to maintain a slightly updated tika-server, or update the server to allow configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [jira] [Commented] (TIKA-1330) Add robust tika-batch code
All tests are passing. Only issue I see is excessive logging. The Hudson failure does just look like a hiccup. Tyler On Wed, Apr 1, 2015 at 2:55 PM, Allison, Timothy B. talli...@mitre.org wrote: This looks like a Hudson hiccup. Tyler is seeing excessive logging: Running org.apache.tika.cli.TikaCLIBatchIntegrationTest INFO - about to start driver INFO - about to start driver Anyone else having problems building from a fresh trunk? -Original Message- From: Hudson (JIRA) [mailto:j...@apache.org] Sent: Wednesday, April 01, 2015 5:36 PM To: dev@tika.apache.org Subject: [jira] [Commented] (TIKA-1330) Add robust tika-batch code [ https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391512#comment-14391512 ] Hudson commented on TIKA-1330: -- ABORTED: Integrated in tika-trunk-jdk1.7 #596 (See [ https://builds.apache.org/job/tika-trunk-jdk1.7/596/]) TIKA-1330 flush stacktrace writers (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670756) * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/util/TikaExceptionFilter.java TIKA-1330 clean up logging in tika-batch ant tika-app integration of tika-batch, take 2 (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670751) * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java Add robust tika-batch code -- Key: TIKA-1330 URL: https://issues.apache.org/jira/browse/TIKA-1330 Project: Tika Issue Type: Sub-task Components: cli, general, server Reporter: Tim Allison Assignee: Tim Allison Attachments: TIKA-1330v1-patch.zip In my current design plan, I see creating a separate component tika-batch that includes a small bit of configurable code to run Tika against a large batch of documents. This code should be robust against OOM and hangs, and it should have fairly robust logging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1558) Create a Parser Blacklist
[ https://issues.apache.org/jira/browse/TIKA-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-1558: -- Description: As talked about in TIKA-1555 and TIKA-1557, it would be nice to be able to disable Parsers without pulling their dependencies out. In some cases (e.g. disable all ExternalParsers), there may not be an easy way to exclude the dependencies via Maven. -So, an initial design would be to include another file like {{META-INF/services/org.apache.tika.parser.Parser.blacklist}}. We create a new method {{ServiceLoader#loadServiceProviderBlacklist}}. Then, in {{ServiceLoader#loadServiceProviders}}, we remove all elements of the list that are assignable to an element in {{ServiceLoader#loadServiceProviderBlacklist}}.- was: As talked about in TIKA-1555 and TIKA-1557, it would be nice to be able to disable Parsers without pulling their dependencies out. In some cases (e.g. disable all ExternalParsers), there may not be an easy way to exclude the dependencies via Maven. So, an initial design would be to include another file like {{META-INF/services/org.apache.tika.parser.Parser.blacklist}}. We create a new method {{ServiceLoader#loadServiceProviderBlacklist}}. Then, in {{ServiceLoader#loadServiceProviders}}, we remove all elements of the list that are assignable to an element in {{ServiceLoader#loadServiceProviderBlacklist}}. Create a Parser Blacklist - Key: TIKA-1558 URL: https://issues.apache.org/jira/browse/TIKA-1558 Project: Tika Issue Type: New Feature Reporter: Tyler Palsulich Assignee: Tyler Palsulich Fix For: 1.8 As talked about in TIKA-1555 and TIKA-1557, it would be nice to be able to disable Parsers without pulling their dependencies out. In some cases (e.g. disable all ExternalParsers), there may not be an easy way to exclude the dependencies via Maven. -So, an initial design would be to include another file like {{META-INF/services/org.apache.tika.parser.Parser.blacklist}}. We create a new method {{ServiceLoader#loadServiceProviderBlacklist}}. Then, in {{ServiceLoader#loadServiceProviders}}, we remove all elements of the list that are assignable to an element in {{ServiceLoader#loadServiceProviderBlacklist}}.- -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1558) Create a Parser Blacklist
[ https://issues.apache.org/jira/browse/TIKA-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1432#comment-1432 ] Tyler Palsulich edited comment on TIKA-1558 at 3/31/15 9:41 PM: -Above strategy added in r1661284. You can now blacklist Parsers by adding names to {{META-INF/services/org.apache.tika.parser.Parser.blacklist}} with the same format as the normal services file. If a class is blacklisted, all of its subclasses are automatically blacklisted.- Edit: Service loading blacklisting disabled in r1670487. Use a custom TikaConfig like [this one|https://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/test/resources/org/apache/tika/config/TIKA-1558-blacklistsub.xml] to disable a Parser. Any subclasses of that Parser will also be excluded. was (Author: tpalsulich): Above strategy added in r1661284. You can now blacklist Parsers by adding names to {{META-INF/services/org.apache.tika.parser.Parser.blacklist}} with the same format as the normal services file. If a class is blacklisted, all of its subclasses are automatically blacklisted. Create a Parser Blacklist - Key: TIKA-1558 URL: https://issues.apache.org/jira/browse/TIKA-1558 Project: Tika Issue Type: New Feature Reporter: Tyler Palsulich Assignee: Tyler Palsulich Fix For: 1.8 As talked about in TIKA-1555 and TIKA-1557, it would be nice to be able to disable Parsers without pulling their dependencies out. In some cases (e.g. disable all ExternalParsers), there may not be an easy way to exclude the dependencies via Maven. -So, an initial design would be to include another file like {{META-INF/services/org.apache.tika.parser.Parser.blacklist}}. We create a new method {{ServiceLoader#loadServiceProviderBlacklist}}. Then, in {{ServiceLoader#loadServiceProviders}}, we remove all elements of the list that are assignable to an element in {{ServiceLoader#loadServiceProviderBlacklist}}.- -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: including refactored docs from govdocs1 in test suite
Can you copy the hyperlink into a new doc and change the URL? I have no idea about including the modified version. Tyler On Mar 30, 2015 9:18 AM, Allison, Timothy B. talli...@mitre.org wrote: All, As part of TIKA-1512, I found that I can delete all of the contents, including the metadata, except for one hyperlink in two documents from govdocs1 and still get the proper behavior -- fail before fix, work after fix. These documents are in the public domain. Is it ok to include these modified documents in our test suite or should I avoid inclusion? Happy to avoid inclusion for the sake of a quick release of 1.8 and then we have time to discuss/determine way ahead... unless the answer is obvious. Best, Tim -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Monday, March 30, 2015 7:03 AM To: dev@tika.apache.org Subject: RE: [DISCUSS] Tika 1.8 or 1.7.1 Unless there are objections, I'd like these to be resolved before 1.8: TIKA-1584 -- I'll fix TIKA-1575 -- Resolved by Konstantin Gribov (thank you!) TIKA-1512 -- I'll put in a temporary fix so that we don't get IOOBEs, but I'll leave this open and do some more digging to see if we need to open a ticket at the POI level TIKA-1511 -- I'll remove provided for xerial TIKA-1549 -- We should thank Toke Eskildsen in CHANGES.txt, no? I'll have these fixes completed by noon EDT. Should I run against govdocs1 before or after the RC? My last build of Tika app (a few days ago) ballooned to ~43MB, and that's before I add ~3MB for xerial. Tika server is now ~48MB. As of my last build, we are still including ~4MB of pdfs (README.NLDAS1.pdf and README.NLDAS2.pdf) from the GRIB(?) parser in the tika-app and tika-server jars. Best, Tim -Original Message- From: Tyler Palsulich [mailto:tpalsul...@gmail.com] Sent: Sunday, March 29, 2015 9:13 AM To: dev@tika.apache.org Subject: Re: [DISCUSS] Tika 1.8 or 1.7.1 Once TIKA-1584 and TIKA-1575 are resolved, I'll work up an RC (unless something else pops up). Thank you everyone. Tyler On Mar 29, 2015 4:43 AM, Hong-Thai Nguyen thaicha...@gmail.com wrote: +1 for 1.8 Hong-Thai On 28 Mar 2015, at 16:01, Tyler Palsulich tpalsul...@apache.org wrote: Hi Folks, Now that TIKA-1581 (JHighlight licensing issues) is resolved, we need to release a new version of Tika. I'll volunteer to be the release manager again. Should we release this as 1.8 or 1.7.1? Does anyone have any last minute issues they'd like to finish and see in Tika 1.X? I'd like to get the example working with CORS (TIKA-1585 and TIKA-1586). Any others? Have a good weekend, Tyler
[jira] [Commented] (TIKA-1587) ForkParser::setJavaCommand should take ListString
[ https://issues.apache.org/jira/browse/TIKA-1587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386685#comment-14386685 ] Tyler Palsulich commented on TIKA-1587: --- Thank you for reporting this! It seems like a definite problem. Is there any way you can provide a patch? ForkParser::setJavaCommand should take ListString --- Key: TIKA-1587 URL: https://issues.apache.org/jira/browse/TIKA-1587 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.7 Reporter: Oleg Oshmyan ForkParser::setJavaCommand currently takes a string and splits it on whitespace. This makes it impossible to use commands with paths that contain spaces. In particular, it makes it impossible to reliably use System.getProperty(java.home) in order to launch the same Java that the current process is running in, because it might contain spaces. If it would just take a ListString and pass (a clone of) it directly to ProcessBuilder, this wouldn't be a problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
RE: including refactored docs from govdocs1 in test suite
Ah. I see. In general, what is the goal with handling corrupted files? Extract as much as possible and fail gracefully? Tyler On Mar 30, 2015 9:32 AM, Allison, Timothy B. talli...@mitre.org wrote: Unfortunately, no. MSOffice fixes the document when I do that. -Original Message- From: Tyler Palsulich [mailto:tpalsul...@gmail.com] Sent: Monday, March 30, 2015 9:24 AM To: dev@tika.apache.org Subject: Re: including refactored docs from govdocs1 in test suite Can you copy the hyperlink into a new doc and change the URL? I have no idea about including the modified version. Tyler On Mar 30, 2015 9:18 AM, Allison, Timothy B. talli...@mitre.org wrote: All, As part of TIKA-1512, I found that I can delete all of the contents, including the metadata, except for one hyperlink in two documents from govdocs1 and still get the proper behavior -- fail before fix, work after fix. These documents are in the public domain. Is it ok to include these modified documents in our test suite or should I avoid inclusion? Happy to avoid inclusion for the sake of a quick release of 1.8 and then we have time to discuss/determine way ahead... unless the answer is obvious. Best, Tim -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Monday, March 30, 2015 7:03 AM To: dev@tika.apache.org Subject: RE: [DISCUSS] Tika 1.8 or 1.7.1 Unless there are objections, I'd like these to be resolved before 1.8: TIKA-1584 -- I'll fix TIKA-1575 -- Resolved by Konstantin Gribov (thank you!) TIKA-1512 -- I'll put in a temporary fix so that we don't get IOOBEs, but I'll leave this open and do some more digging to see if we need to open a ticket at the POI level TIKA-1511 -- I'll remove provided for xerial TIKA-1549 -- We should thank Toke Eskildsen in CHANGES.txt, no? I'll have these fixes completed by noon EDT. Should I run against govdocs1 before or after the RC? My last build of Tika app (a few days ago) ballooned to ~43MB, and that's before I add ~3MB for xerial. Tika server is now ~48MB. As of my last build, we are still including ~4MB of pdfs (README.NLDAS1.pdf and README.NLDAS2.pdf) from the GRIB(?) parser in the tika-app and tika-server jars. Best, Tim -Original Message- From: Tyler Palsulich [mailto:tpalsul...@gmail.com] Sent: Sunday, March 29, 2015 9:13 AM To: dev@tika.apache.org Subject: Re: [DISCUSS] Tika 1.8 or 1.7.1 Once TIKA-1584 and TIKA-1575 are resolved, I'll work up an RC (unless something else pops up). Thank you everyone. Tyler On Mar 29, 2015 4:43 AM, Hong-Thai Nguyen thaicha...@gmail.com wrote: +1 for 1.8 Hong-Thai On 28 Mar 2015, at 16:01, Tyler Palsulich tpalsul...@apache.org wrote: Hi Folks, Now that TIKA-1581 (JHighlight licensing issues) is resolved, we need to release a new version of Tika. I'll volunteer to be the release manager again. Should we release this as 1.8 or 1.7.1? Does anyone have any last minute issues they'd like to finish and see in Tika 1.X? I'd like to get the example working with CORS (TIKA-1585 and TIKA-1586). Any others? Have a good weekend, Tyler
Re: [DISCUSS] Tika 1.8 or 1.7.1
I just remembered TIKA-1509 and TIKA-1558 -- testing now for blacklist functionality through TIKA-1509. If that works, I'll back out TIKA-1558. Tim, I think you should run govdocs from the RC, in case something changes between your run and the cut. Tyler On Mon, Mar 30, 2015 at 10:17 AM, Allison, Timothy B. talli...@mitre.org wrote: All, I've made the changes that I had hoped to. Grib pdf exclusion remains for any takers. Let me know when I should initiate the run against govdocs1 to see if there are any surprises on that corpus with Tika 1.8. Best, Tim -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Monday, March 30, 2015 7:03 AM To: dev@tika.apache.org Subject: RE: [DISCUSS] Tika 1.8 or 1.7.1 Unless there are objections, I'd like these to be resolved before 1.8: TIKA-1584 -- I'll fix TIKA-1575 -- Resolved by Konstantin Gribov (thank you!) TIKA-1512 -- I'll put in a temporary fix so that we don't get IOOBEs, but I'll leave this open and do some more digging to see if we need to open a ticket at the POI level TIKA-1511 -- I'll remove provided for xerial TIKA-1549 -- We should thank Toke Eskildsen in CHANGES.txt, no? I'll have these fixes completed by noon EDT. Should I run against govdocs1 before or after the RC? My last build of Tika app (a few days ago) ballooned to ~43MB, and that's before I add ~3MB for xerial. Tika server is now ~48MB. As of my last build, we are still including ~4MB of pdfs (README.NLDAS1.pdf and README.NLDAS2.pdf) from the GRIB(?) parser in the tika-app and tika-server jars. Best, Tim -Original Message- From: Tyler Palsulich [mailto:tpalsul...@gmail.com] Sent: Sunday, March 29, 2015 9:13 AM To: dev@tika.apache.org Subject: Re: [DISCUSS] Tika 1.8 or 1.7.1 Once TIKA-1584 and TIKA-1575 are resolved, I'll work up an RC (unless something else pops up). Thank you everyone. Tyler On Mar 29, 2015 4:43 AM, Hong-Thai Nguyen thaicha...@gmail.com wrote: +1 for 1.8 Hong-Thai On 28 Mar 2015, at 16:01, Tyler Palsulich tpalsul...@apache.org wrote: Hi Folks, Now that TIKA-1581 (JHighlight licensing issues) is resolved, we need to release a new version of Tika. I'll volunteer to be the release manager again. Should we release this as 1.8 or 1.7.1? Does anyone have any last minute issues they'd like to finish and see in Tika 1.X? I'd like to get the example working with CORS (TIKA-1585 and TIKA-1586). Any others? Have a good weekend, Tyler
[jira] [Comment Edited] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)
[ https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386906#comment-14386906 ] Tyler Palsulich edited comment on TIKA-1584 at 3/30/15 4:05 PM: Yup! The 1.8 release process should start this week. Ideally, it will hit the mirrors some time next week. [edit: 1.8, not 1.7!] was (Author: tpalsulich): Yup! The 1.7 release process should start this week. Ideally, it will hit the mirrors some time next week. Tika 1.7 possible regression (nested attachment files not getting parsed) - Key: TIKA-1584 URL: https://issues.apache.org/jira/browse/TIKA-1584 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.7 Reporter: Rob Tulloh Assignee: Tim Allison Priority: Blocker Fix For: 1.8 I tried to send this to the tika user list, but got a qmail failure so I am opening a jira to see if I can get help with this. There appears to be a change in the behavior of tika since 1.5 (the last version we have used). In 1.5, if we pass a file with content type of rfc822 which contains a zip that contains a docx file, the entire content would get recursed and the text returned. In 1.7, tika only unwinds as far as the zip file and ignores the content of the contained docx file. This is causing a regression failure in our search tests because the contents of the docx file are not found when searched for. We are testing with tika-server if this helps. If we ask the meta service to just characterize the test data, it correctly determines the input is of type rfc822. However, on extract, the contents of the attachment are not extracted as expected. curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream http://localhost:9998/meta 2/dev/null | grep Content-Type Content-Type,message/rfc822 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream http://localhost:9998/tika 2/dev/null | grep docx sign.docx --- this is not expected, need contents of this extracted We can easily reproduce this problem with a simple eml file with an attachment. Can someone please comment if this seems like a problem or perhaps we need to change something in our call to get the old behavior? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available
[ https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-1575: -- Fix Version/s: 1.8 Upgrade to PDFBox 1.8.9 when available -- Key: TIKA-1575 URL: https://issues.apache.org/jira/browse/TIKA-1575 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Fix For: 1.8 Attachments: 005937.pdf.json, 005937_1_8_9-SNAPSHOT.pdf.json, 10-814_Appendix B_v3.pdf, 524276_719128_diffs.zip, PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx, PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip, PDFBox_1_8_8Vs1_8_9_20150316.zip, caught_ex_1_8_9.zip, content_diffs_20150316.xlsx, diffs_1_8_9_multithread_vs_single_thread.xlsx, reports_1_8_9_multithread_vs_single.zip The PDFBox community is about to release 1.8.9. Let's use this issue to track discussions before the release and to track Tika's upgrade to PDFBox 1.8.9 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [DISCUSS] Tika 1.8 or 1.7.1
Once TIKA-1584 and TIKA-1575 are resolved, I'll work up an RC (unless something else pops up). Thank you everyone. Tyler On Mar 29, 2015 4:43 AM, Hong-Thai Nguyen thaicha...@gmail.com wrote: +1 for 1.8 Hong-Thai On 28 Mar 2015, at 16:01, Tyler Palsulich tpalsul...@apache.org wrote: Hi Folks, Now that TIKA-1581 (JHighlight licensing issues) is resolved, we need to release a new version of Tika. I'll volunteer to be the release manager again. Should we release this as 1.8 or 1.7.1? Does anyone have any last minute issues they'd like to finish and see in Tika 1.X? I'd like to get the example working with CORS (TIKA-1585 and TIKA-1586). Any others? Have a good weekend, Tyler
[jira] [Resolved] (TIKA-1579) Add file type to NetCDFParser
[ https://issues.apache.org/jira/browse/TIKA-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich resolved TIKA-1579. --- Resolution: Fixed Add file type to NetCDFParser - Key: TIKA-1579 URL: https://issues.apache.org/jira/browse/TIKA-1579 Project: Tika Issue Type: Improvement Components: parser Reporter: Ann Burgess Assignee: Ann Burgess Attachments: TIKA-1579.abburgess.190315.patch.txt [~gostep] explains that, there are three versions of NetCDF (classic format, 64-bit offset, and netCDF-4/HDF5 format). When opening an existing netCDF file, the netCDF library will transparently detect its format so we do not need to adjust according to the detected format. That said, it would be good to know the file type as each can have the .nc extension. This will add patch with add file type to the metadata. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [DISCUSS] Tika 1.8 or 1.7.1
I'm also leaning toward 1.8. Especially given the newly identified regression in TIKA-1584. Tyler On Mar 28, 2015 11:47 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Hi Tyler - I would VOTE for 1.8. Given the stuff associated with releasing (updating the website; sending emails; waiting periods, etc.) let’s ship all the updates we have too along with the jhighlight fix. Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Tyler Palsulich tpalsul...@apache.org Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Saturday, March 28, 2015 at 8:01 AM To: dev@tika.apache.org dev@tika.apache.org Subject: [DISCUSS] Tika 1.8 or 1.7.1 Hi Folks, Now that TIKA-1581 (JHighlight licensing issues) is resolved, we need to release a new version of Tika. I'll volunteer to be the release manager again. Should we release this as 1.8 or 1.7.1? Does anyone have any last minute issues they'd like to finish and see in Tika 1.X? I'd like to get the example working with CORS (TIKA-1585 and TIKA-1586). Any others? Have a good weekend, Tyler
[jira] [Commented] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)
[ https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385483#comment-14385483 ] Tyler Palsulich commented on TIKA-1584: --- We now have two major issues which need a quick release. So, I would say go for 1.8. Tim, can you chime in on the current discuss thread? Tika 1.7 possible regression (nested attachment files not getting parsed) - Key: TIKA-1584 URL: https://issues.apache.org/jira/browse/TIKA-1584 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.7 Reporter: Rob Tulloh Assignee: Tim Allison Priority: Blocker I tried to send this to the tika user list, but got a qmail failure so I am opening a jira to see if I can get help with this. There appears to be a change in the behavior of tika since 1.5 (the last version we have used). In 1.5, if we pass a file with content type of rfc822 which contains a zip that contains a docx file, the entire content would get recursed and the text returned. In 1.7, tika only unwinds as far as the zip file and ignores the content of the contained docx file. This is causing a regression failure in our search tests because the contents of the docx file are not found when searched for. We are testing with tika-server if this helps. If we ask the meta service to just characterize the test data, it correctly determines the input is of type rfc822. However, on extract, the contents of the attachment are not extracted as expected. curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream http://localhost:9998/meta 2/dev/null | grep Content-Type Content-Type,message/rfc822 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream http://localhost:9998/tika 2/dev/null | grep docx sign.docx --- this is not expected, need contents of this extracted We can easily reproduce this problem with a simple eml file with an attachment. Can someone please comment if this seems like a problem or perhaps we need to change something in our call to get the old behavior? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1585) Create Example Website with Form Submission
Tyler Palsulich created TIKA-1585: - Summary: Create Example Website with Form Submission Key: TIKA-1585 URL: https://issues.apache.org/jira/browse/TIKA-1585 Project: Tika Issue Type: New Feature Components: example, server Reporter: Tyler Palsulich Assignee: Tyler Palsulich It would be great to have a website where we can direct people who ask what Tika can do for [filetype] without needing them to actually download Tika. Some initial work to do that is [here|http://tpalsulich.github.io/TikaExamples/]. I'm far from a design guru, but I imagine the site as having a form where you can upload a file at the top, checkboxes for if you want metadata, content, or both, and a submit button. The request should be sent with AJAX and the result should populate a {{div}}. One issue with AJAX requests is that Tika Server doesn't currently allow Cross-Origin-Resource-Sharing (CORS). So, we either need to maintain a slightly updated tika-server, or update the server to allow configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1526) ExternalParser should trap/ignore/workarround JDK-8047340 JDK-8055301 so Turkish Tika users can still use non-external parsers
[ https://issues.apache.org/jira/browse/TIKA-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich resolved TIKA-1526. --- Resolution: Fixed Marking this as Fixed, per the above comments. [~thetaphi] or [~hossman] or anyone else, please reopen this if you find any other cases. Thank you everyone for the help! ExternalParser should trap/ignore/workarround JDK-8047340 JDK-8055301 so Turkish Tika users can still use non-external parsers Key: TIKA-1526 URL: https://issues.apache.org/jira/browse/TIKA-1526 Project: Tika Issue Type: Wish Reporter: Hoss Man the JDK has numerous pain points regarding the Turkish locale, posix_spawn lowercasing being one of them... https://bugs.openjdk.java.net/browse/JDK-8047340 https://bugs.openjdk.java.net/browse/JDK-8055301 As of Tika 1.7, the TesseractOCRParser (which is an ExternalParser) is enabled configured by default in Tika, and uses ExternalParser.check to see if tesseract is available -- but because of the JDK bug, this means that Tika fails fast for Turkish users on BSD/UNIX variants (including MacOSX) like so... {noformat} [junit4] Throwable #1: java.lang.Error: posix_spawn is not a supported process launch mechanism on this platform. [junit4] at java.lang.UNIXProcess$1.run(UNIXProcess.java:105) [junit4] at java.lang.UNIXProcess$1.run(UNIXProcess.java:94) [junit4] at java.security.AccessController.doPrivileged(Native Method) [junit4] at java.lang.UNIXProcess.clinit(UNIXProcess.java:92) [junit4] at java.lang.ProcessImpl.start(ProcessImpl.java:130) [junit4] at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) [junit4] at java.lang.Runtime.exec(Runtime.java:620) [junit4] at java.lang.Runtime.exec(Runtime.java:485) [junit4] at org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344) [junit4] at org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117) [junit4] at org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90) [junit4] at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) [junit4] at org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95) [junit4] at org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229) [junit4] at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) [junit4] at org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209) [junit4] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) [junit4] at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) {noformat} ...unless they go out of their way to white list only the parsers they need/want so TesseractOCRParser (and any other ExternalParsers) will never even be check()ed. It would be nice if Tika's ExternalParser class added a similar hack/workarround to what was done in SOLR-6387 to trap these types of errors. In Solr we just propogate a better error explaining why Java hates the turkish langauge... {code} } catch (Error err) { if (err.getMessage() != null (err.getMessage().contains(posix_spawn) || err.getMessage().contains(UNIXProcess))) { log.warn(Error forking command due to JVM locale bug (see https://issues.apache.org/jira/browse/SOLR-6387): + err.getMessage()); return (error executing: + cmd + ); } } {code} ...but with Tika, it might be better for all ExternalParsers to just opt out as if they don't recognize the filetype when they detect this type of error fro m the check method (or perhaps it would be better if AutoDetectParser handled this? ... i'm not really sure how it would best fit into Tika's architecture) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1581) jhighlight license concerns
[ https://issues.apache.org/jira/browse/TIKA-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385337#comment-14385337 ] Tyler Palsulich commented on TIKA-1581: --- Hi [~kkrugler]. Thanks. The comment is now bq. Tika-parsers component uses CDDL/LGPL dual-licensed dependency: jhighlight (https://github.com/codelibs/jhighlight) If this looks good, I'll start a \[DISCUSS\] thread on the list about a new version. jhighlight license concerns --- Key: TIKA-1581 URL: https://issues.apache.org/jira/browse/TIKA-1581 Project: Tika Issue Type: Bug Affects Versions: 1.7 Reporter: Karl Wright Fix For: 1.8 jhighlight jar is a Tika dependency. The Lucene team discovered that, while it claims to be a CDDL/LGPL dual-license, some of its functionality is LGPL only: {code} Solr's contrib/extraction contains jhighlight-1.0.jar which declares itself as dual CDDL or LGPL license. However, some of its classes are distributed only under LGPL, e.g. com.uwyn.jhighlight.highlighter. CppHighlighter.java GroovyHighlighter.java JavaHighlighter.java XmlHighlighter.java I downloaded the sources from Maven (http://search.maven.org/remotecontent?filepath=com/uwyn/jhighlight/1.0/jhighlight-1.0-sources.jar) to confirm that, and also found this SVN repo: http://svn.rifers.org/jhighlight/tags/release-1.0, though the project's website seems to not exist anymore (https://jhighlight.dev.java.net/). I didn't find any direct usage of it in our code, so I guess it's probably needed by a 3rd party dependency, such as Tika. Therefore if we e.g. omit it, things will compile, but may fail at runtime. {code} Is it possible to remove this dependency for future releases, or allow only optional inclusion of this package? It is of concern to the ManifoldCF project because we distribute a binary package that includes Tika and its required dependencies, which currently includes jHighlight. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1586) Enable CORS on Tika Server
Tyler Palsulich created TIKA-1586: - Summary: Enable CORS on Tika Server Key: TIKA-1586 URL: https://issues.apache.org/jira/browse/TIKA-1586 Project: Tika Issue Type: New Feature Components: server Reporter: Tyler Palsulich Assignee: Tyler Palsulich Tika Server should allow configuration of CORS requests (for uses like TIKA-1585). See [this example|http://cxf.apache.org/docs/jax-rs-cors.html] from CXF for how to add it. The only change from that site is that we will need to add a {{CrossOriginResourceSharingFilter}} as a provider. Ideally, this is configurable (limit which resources have CORS, and which origins are allowed). But, I'm not thinking of any general methods of how to do that... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1586) Enable CORS on Tika Server
[ https://issues.apache.org/jira/browse/TIKA-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich resolved TIKA-1586. --- Resolution: Fixed Fixed in r1669799. Enable CORS on Tika Server -- Key: TIKA-1586 URL: https://issues.apache.org/jira/browse/TIKA-1586 Project: Tika Issue Type: New Feature Components: server Reporter: Tyler Palsulich Assignee: Tyler Palsulich Tika Server should allow configuration of CORS requests (for uses like TIKA-1585). See [this example|http://cxf.apache.org/docs/jax-rs-cors.html] from CXF for how to add it. The only change from that site is that we will need to add a {{CrossOriginResourceSharingFilter}} as a provider. Ideally, this is configurable (limit which resources have CORS, and which origins are allowed). But, I'm not thinking of any general methods of how to do that... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1585) Create Example Website with Form Submission
[ https://issues.apache.org/jira/browse/TIKA-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385411#comment-14385411 ] Tyler Palsulich commented on TIKA-1585: --- CORS work is now integrated. [~talli...@mitre.org], can you restart the server on 162.242.228.174:9998 with the --cors http://tpalsulich.github.io; option? Then, we can close off the 9997 port (my github.io site is querying 9997, though, so I'll need to update that). Is there an official place we'd like to host the above site? Create Example Website with Form Submission --- Key: TIKA-1585 URL: https://issues.apache.org/jira/browse/TIKA-1585 Project: Tika Issue Type: New Feature Components: example, server Reporter: Tyler Palsulich Assignee: Tyler Palsulich It would be great to have a website where we can direct people who ask what Tika can do for [filetype] without needing them to actually download Tika. Some initial work to do that is [here|http://tpalsulich.github.io/TikaExamples/]. I'm far from a design guru, but I imagine the site as having a form where you can upload a file at the top, checkboxes for if you want metadata, content, or both, and a submit button. The request should be sent with AJAX and the result should populate a {{div}}. One issue with AJAX requests is that Tika Server doesn't currently allow Cross-Origin-Resource-Sharing (CORS). So, we either need to maintain a slightly updated tika-server, or update the server to allow configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[DISCUSS] Tika 1.8 or 1.7.1
Hi Folks, Now that TIKA-1581 (JHighlight licensing issues) is resolved, we need to release a new version of Tika. I'll volunteer to be the release manager again. Should we release this as 1.8 or 1.7.1? Does anyone have any last minute issues they'd like to finish and see in Tika 1.X? I'd like to get the example working with CORS (TIKA-1585 and TIKA-1586). Any others? Have a good weekend, Tyler
[jira] [Commented] (TIKA-1586) Enable CORS on Tika Server
[ https://issues.apache.org/jira/browse/TIKA-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385372#comment-14385372 ] Tyler Palsulich commented on TIKA-1586: --- Can someone take a look at the above PR and make sure I'm not doing anything bone-headed? Thanks! Enable CORS on Tika Server -- Key: TIKA-1586 URL: https://issues.apache.org/jira/browse/TIKA-1586 Project: Tika Issue Type: New Feature Components: server Reporter: Tyler Palsulich Assignee: Tyler Palsulich Tika Server should allow configuration of CORS requests (for uses like TIKA-1585). See [this example|http://cxf.apache.org/docs/jax-rs-cors.html] from CXF for how to add it. The only change from that site is that we will need to add a {{CrossOriginResourceSharingFilter}} as a provider. Ideally, this is configurable (limit which resources have CORS, and which origins are allowed). But, I'm not thinking of any general methods of how to do that... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (TIKA-1354) ForkParser doesn't work in OSGI container
[ https://issues.apache.org/jira/browse/TIKA-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-1354. - Resolution: Fixed Fix Version/s: 1.7 Marking as Fixed. ForkParser doesn't work in OSGI container - Key: TIKA-1354 URL: https://issues.apache.org/jira/browse/TIKA-1354 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.6 Reporter: Michal Hlavac Fix For: 1.7 I can't find way to run ForkParser in OSGI container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Enabling CORS
Hi Folks, I'm trying to enable CORS on a few of Tika's Server resources. But, after adding the pom.xml dependency and a @CrossOriginResourceSharing( allowOrigins = {url} ) annotation to the resources, the Access-Control-Allow-Origin header is still not given. Is there another configuration I need to add? Tika's server doesn't currently have a bean configuration like at the bottom of the examples page http://cxf.apache.org/docs/jax-rs-cors.html#JAX-RSCORS-Examples. Thanks for any help, Tyler