Re: Searching for Tika Jira issues using Lucene
Woops, thank you for moving this to the right mailing list Oleg! Mike McCandless http://blog.mikemccandless.com On Thu, Mar 6, 2014 at 12:56 AM, Oleg Tikhonov o...@apache.org wrote: Hi Mike! Sounds great! Thanks. Oleg On Wed, Mar 5, 2014 at 6:47 PM, Michael McCandless luc...@mikemccandless.com wrote: Team, If you want to search for Tika Jira issues, I just added Tika coverage into the Lucene dog food server we use for finding Lucene/Solr issues at http://jirasearch.mikemccandless.com. I just posted a blog post describing recent changes: http://blog.mikemccandless.com/2014/03/using-lucenes-search-server-to-search.html Basically I started this as an effort to test Lucene's functionality in a real application/server (searching for issues), and to eat our own dog food, but then over time I think it's proven quite useful and I now use it almost exclusively when I need to find a Lucene issue. Compared to Jira's builtin search, it's more full text like; e.g., makes suggestions as you type, produces snippets and highlights, ranks by blended relevence+recency, etc. It has facets so you can quickly drill down/sideways by various metadata. In the results, you can click on a snippet to go straight to the specific comment and issue that it came from. It uses Lucene's near-real-time indexing + searching, so issue updates should be visible within ~ 30 seconds or so. I hope you find it useful too! Mike McCandless http://blog.mikemccandless.com
[jira] [Comment Edited] (TIKA-1252) Tika is not indexing all authors of a PDF
[ https://issues.apache.org/jira/browse/TIKA-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922321#comment-13922321 ] Alexandre Madurell edited comment on TIKA-1252 at 3/6/14 11:10 AM: --- Hi, [~talli...@apache.org], I was checking the specs doc again, and I read on page 17 the difference between Bag and Seq. Beats me why Adobe would choose an unordered array over an ordered array for the Author field in Acrobat's document properties form. In any case, as you mentioned, it makes it necessary to check on both before falling back to PDDocumentInformation's getAuthor(). I've just checked Acrobat XI, and it still exports its XMP with a Bag wrapper instead of a Seq one. I'll open a ticket on Adobe's bugbase. In the meantime, I modified the XSLT file I was using to automate the metadata insertion so it uses the rdf:Seq, and will re-process the entire collection (I will probably add PDFBox to the next implementation of our automated metadata insertion workflow, thanks again for the tip!). Have a great one! was (Author: alexandre.madur...@gmail.com): Hi again, [~talli...@apache.org], I was checking the specs doc again, and I read on page 17 the difference between Bag and Seq. Beats me why Adobe would choose an unordered array over an ordered array for the Author field in Acrobat's document properties form. In any case, as you mentioned, it makes it necessary to check on both before falling back to PDDocumentInformation's getAuthor(). I've just checked Acrobat XI, and it still exports its XMP with a Bag wrapper instead of a Seq one. I'll open a ticket on Adobe's bugbase. In the meantime, I modified the XSLT file I was using to automate the metadata insertion so it uses the rdf:Seq, and will re-process the entire collection (I will probably add PDFBox to the next implementation of our automated metadata insertion workflow, thanks again for the tip!). Have a great one! Tika is not indexing all authors of a PDF - Key: TIKA-1252 URL: https://issues.apache.org/jira/browse/TIKA-1252 Project: Tika Issue Type: Bug Components: metadata, parser Affects Versions: 1.4 Environment: Ubuntu 12.04 (x64) Solr 4.6.0 (Amazon Web Services, Bitnami Stack) Reporter: Alexandre Madurell Assignee: Tim Allison Attachments: Sample (Acrobat 4.x).pdf, Sample (Acrobat 5.x).pdf, Sample-One-Author.pdf, Sample-Two-Authors.pdf, Sample.pdf, Sample.xmp, XMP-Import-with-Seq.jpg When submitting a PDF with this information in its XMP metadata: ... dc:creator rdf:Bag rdf:liAuthor 1/rdf:li rdf:liAuthor 2/rdf:li /rdf:Bag /dc:creator ... Only the first one appears in the collection: ... author:[Author 1], author_s:Author 1, ... In spite of having set the field to multiValued in the Solr schema: field name=author type=text_general indexed=true stored=true multiValued=true/ Let me know if there's any further specific information I could provide. Thanks in advance! -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: Using guava on tika ?
On Thu, 6 Mar 2014, Hong-Thai Nguyen wrote: Guava (https://code.google.com/p/guava-libraries/) provides many facilities on text, file, collection ... manipuation. Should we use in Tika ? Can you give an example of where using Guava would either simplify some existing code, or improve its effectiveness, or permit something we couldn't otherwise do? Nick
[jira] [Created] (TIKA-1257) MS Word Filter out control characters on ouput
Hong-Thai Nguyen created TIKA-1257: -- Summary: MS Word Filter out control characters on ouput Key: TIKA-1257 URL: https://issues.apache.org/jira/browse/TIKA-1257 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Hong-Thai Nguyen Fix For: 1.6 Control characters present mostly in table of index and un-visualizable. We should filter out them. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TIKA-1257) MS Word Filter out control characters on ouput
[ https://issues.apache.org/jira/browse/TIKA-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen updated TIKA-1257: --- Attachment: tika-doc-control-char.png 5f01ae23-9e6e-4faa-808a-f78dbb20cc71.doc MS Word Filter out control characters on ouput -- Key: TIKA-1257 URL: https://issues.apache.org/jira/browse/TIKA-1257 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Hong-Thai Nguyen Fix For: 1.6 Attachments: 5f01ae23-9e6e-4faa-808a-f78dbb20cc71.doc, tika-doc-control-char.png Control characters present mostly in table of index and un-visualizable. We should filter out them. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (TIKA-1257) MS Word Filter out control characters on ouput
[ https://issues.apache.org/jira/browse/TIKA-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen resolved TIKA-1257. Resolution: Fixed Fixed on r1574874 MS Word Filter out control characters on ouput -- Key: TIKA-1257 URL: https://issues.apache.org/jira/browse/TIKA-1257 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Hong-Thai Nguyen Fix For: 1.6 Attachments: 5f01ae23-9e6e-4faa-808a-f78dbb20cc71.doc, tika-doc-control-char.png Control characters present mostly in table of index and un-visualizable. We should filter out them. -- This message was sent by Atlassian JIRA (v6.2#6252)
buildbot failure in ASF Buildbot on tika-trunk
The Buildbot has detected a new failure on builder tika-trunk while building ASF Buildbot. Full details are available at: http://ci.apache.org/builders/tika-trunk/builds/1169 Buildbot URL: http://ci.apache.org/ Buildslave for this Build: portunus_ubuntu Build Reason: scheduler Build Source Stamp: [branch tika/trunk] 1574874 Blamelist: thaichat04 BUILD FAILED: failed compile sincerely, -The Buildbot
RE: [ANNOUNCE] Apache Tika 1.5 Released
Hi, Anyone can create branch remotes/origin/1.5 on git ? Thanks Hong-Thai -Message d'origine- De : David Meikle [mailto:loo...@gmail.com] De la part de David Meikle Envoyé : mercredi 19 février 2014 23:19 À : annou...@apache.org Cc : dev@tika.apache.org; u...@tika.apache.org Objet : [ANNOUNCE] Apache Tika 1.5 Released The Apache Tika project is pleased to announce the release of Apache Tika 1.5. The release contents have been pushed out to the main Apache release site and to the Maven Central sync, so the releases should be available as soon as the mirrors get the syncs. Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. Apache Tika 1.5 contains a number of improvements and bug fixes. Details can be found in the changes file: http://www.apache.org/dist/tika/CHANGES-1.5.txt Apache Tika is available in source form from the following download page: http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.5-src.zip Apache Tika is also available in binary form or for use using Maven 2 from the Central Repository: http://repo1.maven.org/maven2/org/apache/tika/ In the initial 48 hours, the release may not be available on all mirrors. When downloading from a mirror site, please remember to verify the downloads using signatures found on the Apache site: https://people.apache.org/keys/group/tika.asc For more information on Apache Tika, visit the project home page: http://tika.apache.org/ -- Dave Meikle, on behalf of the Apache Tika community
buildbot success in ASF Buildbot on tika-trunk
The Buildbot has detected a restored build on builder tika-trunk while building ASF Buildbot. Full details are available at: http://ci.apache.org/builders/tika-trunk/builds/1170 Buildbot URL: http://ci.apache.org/ Buildslave for this Build: portunus_ubuntu Build Reason: scheduler Build Source Stamp: [branch tika/trunk] 1574877 Blamelist: thaichat04 Build succeeded! sincerely, -The Buildbot
[jira] [Comment Edited] (TIKA-1257) MS Word Filter out control characters on ouput
[ https://issues.apache.org/jira/browse/TIKA-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922490#comment-13922490 ] Hong-Thai Nguyen edited comment on TIKA-1257 at 3/6/14 1:50 PM: Fixed on r1574874 r1574877 was (Author: thaichat04): Fixed on r1574874 MS Word Filter out control characters on ouput -- Key: TIKA-1257 URL: https://issues.apache.org/jira/browse/TIKA-1257 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Hong-Thai Nguyen Fix For: 1.6 Attachments: 5f01ae23-9e6e-4faa-808a-f78dbb20cc71.doc, tika-doc-control-char.png Control characters present mostly in table of index and un-visualizable. We should filter out them. -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: [ANNOUNCE] Apache Tika 1.5 Released
Hi, On Thu, Mar 6, 2014 at 8:27 AM, Hong-Thai Nguyen hong-thai.ngu...@polyspot.com wrote: Anyone can create branch remotes/origin/1.5 on git ? Do we need a 1.5 branch? BR, Jukka Zitting
RE: [ANNOUNCE] Apache Tika 1.5 Released
I guess that users could maintain hotfixes basing on a released branch in attending next release. We have already branches for old releases: hong-thai.nguyen@HTN-PC /c/git/tika (trunk) $ git branch -a * trunk remotes/origin/0.1-incubating remotes/origin/0.10 remotes/origin/0.2 remotes/origin/0.3 remotes/origin/0.4-rc1 remotes/origin/0.4-rc2 remotes/origin/0.5 remotes/origin/0.6 remotes/origin/0.7 remotes/origin/0.8 remotes/origin/0.9 remotes/origin/0.x remotes/origin/1.2 remotes/origin/1.3 remotes/origin/1.4 remotes/origin/HEAD - origin/trunk remotes/origin/TIKA-204 remotes/origin/trunk Hong-Thai -Message d'origine- De : Jukka Zitting [mailto:jukka.zitt...@gmail.com] Envoyé : jeudi 6 mars 2014 15:48 À : Tika Development Objet : Re: [ANNOUNCE] Apache Tika 1.5 Released Hi, On Thu, Mar 6, 2014 at 8:27 AM, Hong-Thai Nguyen hong-thai.ngu...@polyspot.com wrote: Anyone can create branch remotes/origin/1.5 on git ? Do we need a 1.5 branch? BR, Jukka Zitting
Re: [ANNOUNCE] Apache Tika 1.5 Released
Hi, On Thu, Mar 6, 2014 at 10:14 AM, Hong-Thai Nguyen hong-thai.ngu...@polyspot.com wrote: I guess that users could maintain hotfixes basing on a released branch in attending next release. Right, at least there's no harm in having the branch, so I just created it in revision 1574919. BR, Jukka Zitting
Re: Using guava on tika ?
If you will bring it as a dependency -- don't use guava 15, use guava 16. It breaks CDI in major appservers (jboss as 7, glassfish 3, websphere) with incorrect beans.xml. See https://issues.jboss.org/browse/WELD-1007 and https://code.google.com/p/guava-libraries/issues/detail?id=1527. -- Best regards, Konstantin Gribov. 2014-03-06 15:54 GMT+04:00 Nick Burch apa...@gagravarr.org: On Thu, 6 Mar 2014, Hong-Thai Nguyen wrote: Guava (https://code.google.com/p/guava-libraries/) provides many facilities on text, file, collection ... manipuation. Should we use in Tika ? Can you give an example of where using Guava would either simplify some existing code, or improve its effectiveness, or permit something we couldn't otherwise do? Nick
Re: Using guava on tika ?
Hi, On Thu, Mar 6, 2014 at 6:54 AM, Nick Burch apa...@gagravarr.org wrote: On Thu, 6 Mar 2014, Hong-Thai Nguyen wrote: Guava (https://code.google.com/p/guava-libraries/) provides many facilities on text, file, collection ... manipuation. Should we use in Tika? Can you give an example of where using Guava would either simplify some existing code, or improve its effectiveness, or permit something we couldn't otherwise do? Also, especially in tika-core we've explicitly avoided any external dependencies to keep it as simple and easy as possible to include as a dependency in client applications. We've even gone as far as including copies of some Commons IO classes in org.apache.tika.io instead of referring to commons-io as a dependency. BR, Jukka Zitting
RE: Using guava on tika ?
Thank for feedback. Nothing we can't do with our code :) Guava is just 'facilities' make code more clear, shorter and sometime faster. I agree that this integration brings more dependencies, may create conflicts in end-users applications. Leave as it for now. Cheers, Hong-Thai -Message d'origine- De : Jukka Zitting [mailto:jukka.zitt...@gmail.com] Envoyé : jeudi 6 mars 2014 16:47 À : Tika Development Objet : Re: Using guava on tika ? Hi, On Thu, Mar 6, 2014 at 6:54 AM, Nick Burch apa...@gagravarr.org wrote: On Thu, 6 Mar 2014, Hong-Thai Nguyen wrote: Guava (https://code.google.com/p/guava-libraries/) provides many facilities on text, file, collection ... manipuation. Should we use in Tika? Can you give an example of where using Guava would either simplify some existing code, or improve its effectiveness, or permit something we couldn't otherwise do? Also, especially in tika-core we've explicitly avoided any external dependencies to keep it as simple and easy as possible to include as a dependency in client applications. We've even gone as far as including copies of some Commons IO classes in org.apache.tika.io instead of referring to commons-io as a dependency. BR, Jukka Zitting
[jira] [Updated] (TIKA-1257) MS Word Filter out control characters on ouput
[ https://issues.apache.org/jira/browse/TIKA-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen updated TIKA-1257: --- Attachment: (was: 5f01ae23-9e6e-4faa-808a-f78dbb20cc71.doc) MS Word Filter out control characters on ouput -- Key: TIKA-1257 URL: https://issues.apache.org/jira/browse/TIKA-1257 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Hong-Thai Nguyen Fix For: 1.6 Attachments: tika-doc-control-char.png Control characters present mostly in table of index and un-visualizable. We should filter out them. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TIKA-1257) MS Word Filter out control characters on ouput
[ https://issues.apache.org/jira/browse/TIKA-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen updated TIKA-1257: --- Attachment: testControlCharacters.doc MS Word Filter out control characters on ouput -- Key: TIKA-1257 URL: https://issues.apache.org/jira/browse/TIKA-1257 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Hong-Thai Nguyen Fix For: 1.6 Attachments: testControlCharacters.doc, tika-doc-control-char.png Control characters present mostly in table of index and un-visualizable. We should filter out them. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922736#comment-13922736 ] Tim Allison commented on TIKA-1232: --- Fixed r1574959. Reopen if any tweaks remain to me made. Thank you, all, for your contributions! Add PDF version to PDFParser output --- Key: TIKA-1232 URL: https://issues.apache.org/jira/browse/TIKA-1232 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Environment: JDK6 Reporter: William Palmer Assignee: Tim Allison Priority: Minor Attachments: Sample 10.x.pdf, Sample 11.x PDFA-1b.pdf, Sample 4.x.pdf, Sample 5.x.pdf, Sample 6.x.pdf, Sample 7.x.pdf, Sample 8.x.pdf, Sample 9.x.pdf, TIKA-1232v1.patch, TIKA-1232v2.patch, pdfversion.patch I'd like to identify the PDF version of files, this is not currently reported by the PDFParser although the information is available via PDFBox. I have attached a patch that adds the format version to the Metadata object. However, I am not familiar enough with the Tika source to know if an alternative metadata key should be used, or this new one added. Comments welcome. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (TIKA-1232) Add PDF version to PDFParser output
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1232. --- Resolution: Fixed r1574959 Add PDF version to PDFParser output --- Key: TIKA-1232 URL: https://issues.apache.org/jira/browse/TIKA-1232 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Environment: JDK6 Reporter: William Palmer Assignee: Tim Allison Priority: Minor Attachments: Sample 10.x.pdf, Sample 11.x PDFA-1b.pdf, Sample 4.x.pdf, Sample 5.x.pdf, Sample 6.x.pdf, Sample 7.x.pdf, Sample 8.x.pdf, Sample 9.x.pdf, TIKA-1232v1.patch, TIKA-1232v2.patch, pdfversion.patch I'd like to identify the PDF version of files, this is not currently reported by the PDFParser although the information is available via PDFBox. I have attached a patch that adds the format version to the Metadata object. However, I am not familiar enough with the Tika source to know if an alternative metadata key should be used, or this new one added. Comments welcome. -- This message was sent by Atlassian JIRA (v6.2#6252)
RE: Tika 1.5 vs 1.4 testing
Hong-Thai, Thank you for running these tests. I suspect (mea culpa) that the increase in PDF runtime exception failures was caused by PDFBOX-1803/TIKA-1233, which was not fixed before 1.5 was cut. I recently made major modifications to the metadata extraction components of the PDFParser (TIKA-1232 and TIKA-1252). If you have time, would you mind rerunning these tests with trunk on your test corpus? I'd be interested to see if the temporary fix to TIKA-1233 lowers the number of PDF runtime exception failures, and I'd be very interested to see if there are any surprises caused by 1232 and 1252. Thank you! Best, Tim -Original Message- From: Hong-Thai Nguyen [mailto:hong-thai.ngu...@polyspot.com] Sent: Monday, March 03, 2014 8:19 AM To: dev@tika.apache.org Subject: Tika 1.5 vs 1.4 testing Hi all, I've checked on same corpus. Here's the comparaison : ||Tika||POI||PDFbox||Failed docs|| |1.4|3.9|1.8.1|92| |1.5|3.10-beta2|1.8.4|182| == TIKA 1.4 - pdf (7) * (1) com.polyspot.document.converter.ConversionException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.ParserDecorator$1@4d39a96c * (3) com.polyspot.document.converter.ConversionException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.ParserDecorator$1@4d39a96c * (3) com.polyspot.document.converter.ConversionException: org.apache.tika.exception.TikaException: Unable to extract PDF content - pptx (8) * (7) com.polyspot.document.converter.ConversionException: org.apache.tika.exception.TikaException: Error creating OOXML extractor * (1) com.polyspot.document.converter.ConversionException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.ParserDecorator$1@4db190a5 - doc (2) * (2) com.polyspot.document.converter.ConversionException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.ParserDecorator$1@6ddd7ea2 - ppt (40) * (39) com.polyspot.document.converter.ConversionException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.ParserDecorator$1@6ddd7ea2 * (1) com.polyspot.document.converter.ConversionException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.ParserDecorator$1@6ddd7ea2 - xls (9) * (7) com.polyspot.document.converter.ConversionException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.ParserDecorator$1@6ddd7ea2 * (2) com.polyspot.document.converter.ConversionException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.ParserDecorator$1@6ddd7ea2 - dwg (4) * (4) com.polyspot.document.converter.ConversionException: org.apache.tika.exception.TikaException: Unsupported AutoCAD drawing version: AC1014 - odp (2) * (2) com.polyspot.document.converter.ConversionException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.ParserDecorator$1@7286f080 - rtf (13) * (13) com.polyspot.document.converter.ConversionException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.ParserDecorator$1@455a7af4 - pps (5) * (5) com.polyspot.document.converter.ConversionException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.ParserDecorator$1@6ddd7ea2 == TIKA 1.5 - pdf (16) * (10) com.polyspot.document.converter.ConversionException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.ParserDecorator$1@1e59efa5 * (3) com.polyspot.document.converter.ConversionException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.ParserDecorator$1@1e59efa5 * (3) com.polyspot.document.converter.ConversionException: org.apache.tika.exception.TikaException: Unable to extract PDF content - pptx (19) * (7)
[jira] [Resolved] (TIKA-1252) Tika is not indexing all authors of a PDF
[ https://issues.apache.org/jira/browse/TIKA-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1252. --- Resolution: Fixed Fixed as of r1574964. Thank you, Alexandre, for raising this issue and for supplying test files for this and TIKA-1232! Tika is not indexing all authors of a PDF - Key: TIKA-1252 URL: https://issues.apache.org/jira/browse/TIKA-1252 Project: Tika Issue Type: Bug Components: metadata, parser Affects Versions: 1.4 Environment: Ubuntu 12.04 (x64) Solr 4.6.0 (Amazon Web Services, Bitnami Stack) Reporter: Alexandre Madurell Assignee: Tim Allison Attachments: Sample (Acrobat 4.x).pdf, Sample (Acrobat 5.x).pdf, Sample-One-Author.pdf, Sample-Two-Authors.pdf, Sample.pdf, Sample.xmp, XMP-Import-with-Seq.jpg When submitting a PDF with this information in its XMP metadata: ... dc:creator rdf:Bag rdf:liAuthor 1/rdf:li rdf:liAuthor 2/rdf:li /rdf:Bag /dc:creator ... Only the first one appears in the collection: ... author:[Author 1], author_s:Author 1, ... In spite of having set the field to multiValued in the Solr schema: field name=author type=text_general indexed=true stored=true multiValued=true/ Let me know if there's any further specific information I could provide. Thanks in advance! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-623) Add support for Outlook PST
[ https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922890#comment-13922890 ] Luis Filipe Nassif commented on TIKA-623: - Good job. I think a possible improvement would be to generate a html for each email, containing its metadata and content, and call the embeddedExtractor to process the generated html, instead of printing all emails directly to xhtmlContentHandler. So, in addition to attachments, emails could also be extracted from PST files if that is the goal of the application. What do you think? Add support for Outlook PST --- Key: TIKA-623 URL: https://issues.apache.org/jira/browse/TIKA-623 Project: Tika Issue Type: New Feature Components: parser Reporter: Tran Nam Quang Assignee: Hong-Thai Nguyen Fix For: 1.6 Attachments: OutlookPSTParser.java Hello everyone, As you might know, Outlook stores its mails and other stuff in a single PST file. There's a relatively new Java library called java-libpst for reading Outlook PST files. It is licensed under the LGPL and available over here: http://code.google.com/p/java-libpst/ I have tested the library on Outlook 2000 and Outlook 2003, with good results. It would be great if the library could be integrated into Tika. Best regards Tran Nam Quang -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-623) Add support for Outlook PST
[ https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922921#comment-13922921 ] Tim Allison commented on TIKA-623: -- Agreed. Is there any way to reuse OutlookParser or to refactor so that we're using the same lib for an email, whether .pst or .msg. There are lots of lessons learned embedded in the OutlookParser. I'll be happy to chip in as I can. [~thaichat04], thank you for getting this rolling! Add support for Outlook PST --- Key: TIKA-623 URL: https://issues.apache.org/jira/browse/TIKA-623 Project: Tika Issue Type: New Feature Components: parser Reporter: Tran Nam Quang Assignee: Hong-Thai Nguyen Fix For: 1.6 Attachments: OutlookPSTParser.java Hello everyone, As you might know, Outlook stores its mails and other stuff in a single PST file. There's a relatively new Java library called java-libpst for reading Outlook PST files. It is licensed under the LGPL and available over here: http://code.google.com/p/java-libpst/ I have tested the library on Outlook 2000 and Outlook 2003, with good results. It would be great if the library could be integrated into Tika. Best regards Tran Nam Quang -- This message was sent by Atlassian JIRA (v6.2#6252)
Unconsistent logging in current tika (1.5)
Hi, folks. Tika-core is quite pure (uses only java.util.logging) but tika-parsers uses commons-logging 1.1.1 (through pdfbox), slf4j-api 1.5.6 (through netcdf) and log4j 1.2.14 (through slf4j-log4j as test scope dependency). Also some parsers (like pdfbox) logs just to stdout/stderr. It's confusing. Tika-core use only JUL. Tika-parsers use JCL and log4j (in tests) and depends on slf4j-api. Tika-app use JCL, configures log4j in runtime (to change verbosity level) and depends on slf4j-log4j12. Tika-server use only JCL but depends on slj4j-api 1.7.5 (through cxf). What do you think about change all the logging to actual slf4j and excluding JCL from dependencies at all? First option group is about add slf4j-api to tika-core dependencies or not. If it's added we won't use JUL. If it isn't added -- jul-to-slf4j can be added to tika-parsers deps. Second option group is related to commons-logging. We can: - exclude it and force developer to add either jcl-over-slf4j or commons-logging as dependency, - exclude it and add jcl-over-slf4j as dependency, so someone uses JCL will be forced to exclude jcl-over-slf4j, - leave it and force one to use either slf4j-jcl + commons-logging or exclude commons-logging and include jcl-over-slf4j. I think, second way is preferred because developer can use any slf4j backend and will be forced to do something only when he/she is using JCL. Third option group is about backend for slf4j. We can use log4j or logback. I prefer logback-classic but we can use any of them. Either of them supports log level changing in runtime. I can refactor tika codebase to use logging in consistent manner and create pull request on github or jira ticket with patch after that, if my solution on this issue will be accepted. By the way, I think we also should update edu.ucar:netcdf to 4.2.20 that depends on newer slf4j-api 1.6.1. -- Best regards, Konstantin Gribov.
[jira] [Commented] (TIKA-1252) Tika is not indexing all authors of a PDF
[ https://issues.apache.org/jira/browse/TIKA-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923019#comment-13923019 ] Tim Allison commented on TIKA-1252: --- [~alexandre.madur...@gmail.com], before opening an issue in PDFBOX on the seq vs bag, let me see if that issue disappears if we move to xmpbox from jempbox. I've only had a chance to look at the source, but I think that will prevent us from having to reinvent the fix. Tika is not indexing all authors of a PDF - Key: TIKA-1252 URL: https://issues.apache.org/jira/browse/TIKA-1252 Project: Tika Issue Type: Bug Components: metadata, parser Affects Versions: 1.4 Environment: Ubuntu 12.04 (x64) Solr 4.6.0 (Amazon Web Services, Bitnami Stack) Reporter: Alexandre Madurell Assignee: Tim Allison Attachments: Sample (Acrobat 4.x).pdf, Sample (Acrobat 5.x).pdf, Sample-One-Author.pdf, Sample-Two-Authors.pdf, Sample.pdf, Sample.xmp, XMP-Import-with-Seq.jpg When submitting a PDF with this information in its XMP metadata: ... dc:creator rdf:Bag rdf:liAuthor 1/rdf:li rdf:liAuthor 2/rdf:li /rdf:Bag /dc:creator ... Only the first one appears in the collection: ... author:[Author 1], author_s:Author 1, ... In spite of having set the field to multiValued in the Solr schema: field name=author type=text_general indexed=true stored=true multiValued=true/ Let me know if there's any further specific information I could provide. Thanks in advance! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1252) Tika is not indexing all authors of a PDF
[ https://issues.apache.org/jira/browse/TIKA-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923399#comment-13923399 ] Tim Allison commented on TIKA-1252: --- Not immediately obvious to me how to use xmpbox with a regular PDDocument not generated by preflight's parser. Will stick with our jempbox work around for now. Tika is not indexing all authors of a PDF - Key: TIKA-1252 URL: https://issues.apache.org/jira/browse/TIKA-1252 Project: Tika Issue Type: Bug Components: metadata, parser Affects Versions: 1.4 Environment: Ubuntu 12.04 (x64) Solr 4.6.0 (Amazon Web Services, Bitnami Stack) Reporter: Alexandre Madurell Assignee: Tim Allison Attachments: Sample (Acrobat 4.x).pdf, Sample (Acrobat 5.x).pdf, Sample-One-Author.pdf, Sample-Two-Authors.pdf, Sample.pdf, Sample.xmp, XMP-Import-with-Seq.jpg When submitting a PDF with this information in its XMP metadata: ... dc:creator rdf:Bag rdf:liAuthor 1/rdf:li rdf:liAuthor 2/rdf:li /rdf:Bag /dc:creator ... Only the first one appears in the collection: ... author:[Author 1], author_s:Author 1, ... In spite of having set the field to multiValued in the Solr schema: field name=author type=text_general indexed=true stored=true multiValued=true/ Let me know if there's any further specific information I could provide. Thanks in advance! -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: Unconsistent logging in current tika (1.5)
On Fri, 7 Mar 2014, Konstantin Gribov wrote: Tika-core is quite pure (uses only java.util.logging) but tika-parsers uses commons-logging 1.1.1 (through pdfbox), slf4j-api 1.5.6 (through netcdf) and log4j 1.2.14 (through slf4j-log4j as test scope dependency). Also some parsers (like pdfbox) logs just to stdout/stderr. I think part of the issue is that many of the libraries that Tika depends on have their own chosen logging library / setup. IIRC, the Tika parsers often log in a similar manner to the underlying library they use. That's not to say that we can't tidy things up a bit, but it does restrict how much we can do where log messages come from underlying libraries It's confusing. Tika-core use only JUL. Tika-Core ideally shouldn't have any external depdencies, so I'm not sure what else it can use while maintaining that? Tika-parsers use JCL and log4j (in tests) and depends on slf4j-api. Tika-app use JCL, configures log4j in runtime (to change verbosity level) and depends on slf4j-log4j12. Tika-server use only JCL but depends on slj4j-api 1.7.5 (through cxf). Potentially some of these could be rationalised, though maybe the best we can hope for is to ensure they only use whatever their underlying dependencies use By the way, I think we also should update edu.ucar:netcdf to 4.2.20 that depends on newer slf4j-api 1.6.1. Can you open a jira for that upgrade? If you can also try it locally, and report on the jira if all the unit tests still pass, that'd be a help! Thanks Nick