[ https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15206855#comment-15206855 ]
Tim Allison commented on TIKA-1285: ----------------------------------- We'll see what Hudson says, but I just pushed the mods to Tika's 2.x branch as well. A few notes: 1) XMPBox is currently designed to handle PDF/A. There were exceptions on roughly 40% of XMPs extracted from our test corpus. We'll stick with jempbox 1.8.x for now for XMP parsing. We may consider migrating to Adobe's xmpcore. If anyone wants to help make XMPBox more robust, that'd be a huge service. Ref: [this email|https://mail-archives.apache.org/mod_mbox/pdfbox-dev/201603.mbox/%3C56DF3F6F.8000201%40lehmi.de%3E] 2) PDFBox 2.0 has gotten rid of the classic parser, and now all parsing is done by the non-sequential parser. In my opinion, the PDFBox devs put a tremendous amount of work into making this new parser quite robust. However, for truncated or other truly damaged files, users may have some luck with the classic parser in 1.8.x. 3) PDFBox 2.0 no longer extracts tiff files. See [this exchange|https://mail-archives.apache.org/mod_mbox/pdfbox-dev/201507.mbox/%3c559cca2c.7050...@t-online.de%3e], and consider adding the optional dependencies to handle Tiffs, jpeg2000 and ... Other than those major points, in my opinion, PDFBox 2.0.0 should fix quite a few issues and is far more robust for bidi documents. Many thanks to the PDFBox devs, especially [~lehmi], [~msahyoun] and [~tilman], for their work on PDFBox and on their collaboration on the eval process....more work remains on the latter. :) > Upgrade to PDFBox 2.0.0 when available > -------------------------------------- > > Key: TIKA-1285 > URL: https://issues.apache.org/jira/browse/TIKA-1285 > Project: Tika > Issue Type: Improvement > Components: parser > Affects Versions: 1.6 > Reporter: Jeremy Anderson > Priority: Minor > Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, > TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, > testPDF_childAttachments.pdf > > > This issue is to track fixes required when upgrading the PDFbox dependency to > 2.0.0 Final once it's available, and using PDFBox's daily build before then. > See TIKA-1268 comment. > Relates to PDFBOX-1893 -- This message was sent by Atlassian JIRA (v6.3.4#6332)