Re: Searching for Tika Jira issues using Lucene

2014-03-06 Thread Michael McCandless
Woops, thank you for moving this to the right mailing list Oleg! Mike McCandless http://blog.mikemccandless.com On Thu, Mar 6, 2014 at 12:56 AM, Oleg Tikhonov o...@apache.org wrote: Hi Mike! Sounds great! Thanks. Oleg On Wed, Mar 5, 2014 at 6:47 PM, Michael McCandless

[jira] [Comment Edited] (TIKA-1252) Tika is not indexing all authors of a PDF

2014-03-06 Thread Alexandre Madurell (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922321#comment-13922321 ] Alexandre Madurell edited comment on TIKA-1252 at 3/6/14 11:10 AM:

Re: Using guava on tika ?

2014-03-06 Thread Nick Burch
On Thu, 6 Mar 2014, Hong-Thai Nguyen wrote: Guava (https://code.google.com/p/guava-libraries/) provides many facilities on text, file, collection ... manipuation. Should we use in Tika ? Can you give an example of where using Guava would either simplify some existing code, or improve its

[jira] [Created] (TIKA-1257) MS Word Filter out control characters on ouput

2014-03-06 Thread Hong-Thai Nguyen (JIRA)
Hong-Thai Nguyen created TIKA-1257: -- Summary: MS Word Filter out control characters on ouput Key: TIKA-1257 URL: https://issues.apache.org/jira/browse/TIKA-1257 Project: Tika Issue Type:

[jira] [Updated] (TIKA-1257) MS Word Filter out control characters on ouput

2014-03-06 Thread Hong-Thai Nguyen (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen updated TIKA-1257: --- Attachment: tika-doc-control-char.png 5f01ae23-9e6e-4faa-808a-f78dbb20cc71.doc

[jira] [Resolved] (TIKA-1257) MS Word Filter out control characters on ouput

2014-03-06 Thread Hong-Thai Nguyen (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen resolved TIKA-1257. Resolution: Fixed Fixed on r1574874 MS Word Filter out control characters on ouput

buildbot failure in ASF Buildbot on tika-trunk

2014-03-06 Thread buildbot
The Buildbot has detected a new failure on builder tika-trunk while building ASF Buildbot. Full details are available at: http://ci.apache.org/builders/tika-trunk/builds/1169 Buildbot URL: http://ci.apache.org/ Buildslave for this Build: portunus_ubuntu Build Reason: scheduler Build Source

RE: [ANNOUNCE] Apache Tika 1.5 Released

2014-03-06 Thread Hong-Thai Nguyen
Hi, Anyone can create branch remotes/origin/1.5 on git ? Thanks Hong-Thai -Message d'origine- De : David Meikle [mailto:loo...@gmail.com] De la part de David Meikle Envoyé : mercredi 19 février 2014 23:19 À : annou...@apache.org Cc : dev@tika.apache.org; u...@tika.apache.org Objet :

buildbot success in ASF Buildbot on tika-trunk

2014-03-06 Thread buildbot
The Buildbot has detected a restored build on builder tika-trunk while building ASF Buildbot. Full details are available at: http://ci.apache.org/builders/tika-trunk/builds/1170 Buildbot URL: http://ci.apache.org/ Buildslave for this Build: portunus_ubuntu Build Reason: scheduler Build Source

[jira] [Comment Edited] (TIKA-1257) MS Word Filter out control characters on ouput

2014-03-06 Thread Hong-Thai Nguyen (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922490#comment-13922490 ] Hong-Thai Nguyen edited comment on TIKA-1257 at 3/6/14 1:50 PM:

Re: [ANNOUNCE] Apache Tika 1.5 Released

2014-03-06 Thread Jukka Zitting
Hi, On Thu, Mar 6, 2014 at 8:27 AM, Hong-Thai Nguyen hong-thai.ngu...@polyspot.com wrote: Anyone can create branch remotes/origin/1.5 on git ? Do we need a 1.5 branch? BR, Jukka Zitting

RE: [ANNOUNCE] Apache Tika 1.5 Released

2014-03-06 Thread Hong-Thai Nguyen
I guess that users could maintain hotfixes basing on a released branch in attending next release. We have already branches for old releases: hong-thai.nguyen@HTN-PC /c/git/tika (trunk) $ git branch -a * trunk remotes/origin/0.1-incubating remotes/origin/0.10 remotes/origin/0.2

Re: [ANNOUNCE] Apache Tika 1.5 Released

2014-03-06 Thread Jukka Zitting
Hi, On Thu, Mar 6, 2014 at 10:14 AM, Hong-Thai Nguyen hong-thai.ngu...@polyspot.com wrote: I guess that users could maintain hotfixes basing on a released branch in attending next release. Right, at least there's no harm in having the branch, so I just created it in revision 1574919. BR,

Re: Using guava on tika ?

2014-03-06 Thread Konstantin Gribov
If you will bring it as a dependency -- don't use guava 15, use guava 16. It breaks CDI in major appservers (jboss as 7, glassfish 3, websphere) with incorrect beans.xml. See https://issues.jboss.org/browse/WELD-1007 and https://code.google.com/p/guava-libraries/issues/detail?id=1527. -- Best

Re: Using guava on tika ?

2014-03-06 Thread Jukka Zitting
Hi, On Thu, Mar 6, 2014 at 6:54 AM, Nick Burch apa...@gagravarr.org wrote: On Thu, 6 Mar 2014, Hong-Thai Nguyen wrote: Guava (https://code.google.com/p/guava-libraries/) provides many facilities on text, file, collection ... manipuation. Should we use in Tika? Can you give an example of

RE: Using guava on tika ?

2014-03-06 Thread Hong-Thai Nguyen
Thank for feedback. Nothing we can't do with our code :) Guava is just 'facilities' make code more clear, shorter and sometime faster. I agree that this integration brings more dependencies, may create conflicts in end-users applications. Leave as it for now. Cheers, Hong-Thai -Message

[jira] [Updated] (TIKA-1257) MS Word Filter out control characters on ouput

2014-03-06 Thread Hong-Thai Nguyen (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen updated TIKA-1257: --- Attachment: (was: 5f01ae23-9e6e-4faa-808a-f78dbb20cc71.doc) MS Word Filter out control

[jira] [Updated] (TIKA-1257) MS Word Filter out control characters on ouput

2014-03-06 Thread Hong-Thai Nguyen (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen updated TIKA-1257: --- Attachment: testControlCharacters.doc MS Word Filter out control characters on ouput

[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output

2014-03-06 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922736#comment-13922736 ] Tim Allison commented on TIKA-1232: --- Fixed r1574959. Reopen if any tweaks remain to me

[jira] [Resolved] (TIKA-1232) Add PDF version to PDFParser output

2014-03-06 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1232. --- Resolution: Fixed r1574959 Add PDF version to PDFParser output ---

RE: Tika 1.5 vs 1.4 testing

2014-03-06 Thread Allison, Timothy B.
Hong-Thai, Thank you for running these tests. I suspect (mea culpa) that the increase in PDF runtime exception failures was caused by PDFBOX-1803/TIKA-1233, which was not fixed before 1.5 was cut. I recently made major modifications to the metadata extraction components of the PDFParser

[jira] [Resolved] (TIKA-1252) Tika is not indexing all authors of a PDF

2014-03-06 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1252. --- Resolution: Fixed Fixed as of r1574964. Thank you, Alexandre, for raising this issue and for

[jira] [Commented] (TIKA-623) Add support for Outlook PST

2014-03-06 Thread Luis Filipe Nassif (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922890#comment-13922890 ] Luis Filipe Nassif commented on TIKA-623: - Good job. I think a possible improvement

[jira] [Commented] (TIKA-623) Add support for Outlook PST

2014-03-06 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922921#comment-13922921 ] Tim Allison commented on TIKA-623: -- Agreed. Is there any way to reuse OutlookParser or to

Unconsistent logging in current tika (1.5)

2014-03-06 Thread Konstantin Gribov
Hi, folks. Tika-core is quite pure (uses only java.util.logging) but tika-parsers uses commons-logging 1.1.1 (through pdfbox), slf4j-api 1.5.6 (through netcdf) and log4j 1.2.14 (through slf4j-log4j as test scope dependency). Also some parsers (like pdfbox) logs just to stdout/stderr. It's

[jira] [Commented] (TIKA-1252) Tika is not indexing all authors of a PDF

2014-03-06 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923019#comment-13923019 ] Tim Allison commented on TIKA-1252: --- [~alexandre.madur...@gmail.com], before opening an

[jira] [Commented] (TIKA-1252) Tika is not indexing all authors of a PDF

2014-03-06 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923399#comment-13923399 ] Tim Allison commented on TIKA-1252: --- Not immediately obvious to me how to use xmpbox with

Re: Unconsistent logging in current tika (1.5)

2014-03-06 Thread Nick Burch
On Fri, 7 Mar 2014, Konstantin Gribov wrote: Tika-core is quite pure (uses only java.util.logging) but tika-parsers uses commons-logging 1.1.1 (through pdfbox), slf4j-api 1.5.6 (through netcdf) and log4j 1.2.14 (through slf4j-log4j as test scope dependency). Also some parsers (like pdfbox)