[ 
https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15206855#comment-15206855
 ] 

Tim Allison commented on TIKA-1285:
-----------------------------------

We'll see what Hudson says, but I just pushed the mods to Tika's 2.x branch as 
well.

A few notes:
1) XMPBox is currently designed to handle PDF/A.  There were exceptions on 
roughly 40% of XMPs extracted from our test corpus.  We'll stick with jempbox 
1.8.x for now for XMP parsing.  We may consider migrating to Adobe's xmpcore.  
If anyone wants to help make XMPBox more robust, that'd be a huge service.  
Ref: [this 
email|https://mail-archives.apache.org/mod_mbox/pdfbox-dev/201603.mbox/%3C56DF3F6F.8000201%40lehmi.de%3E]

2) PDFBox 2.0 has gotten rid of the classic parser, and now all parsing is done 
by the non-sequential parser.  In my opinion, the PDFBox devs put a tremendous 
amount of work  into making this new parser quite robust.  However, for 
truncated or other truly damaged files, users may have some luck with the 
classic parser in 1.8.x.

3) PDFBox 2.0 no longer extracts tiff files. See [this 
exchange|https://mail-archives.apache.org/mod_mbox/pdfbox-dev/201507.mbox/%3c559cca2c.7050...@t-online.de%3e],
 and consider adding the optional dependencies to handle Tiffs, jpeg2000 and ...

Other than those major points, in my opinion, PDFBox 2.0.0 should fix quite a 
few issues and is far more robust for bidi documents.

Many thanks to the PDFBox devs, especially [~lehmi], [~msahyoun] and [~tilman], 
for their work on PDFBox and on their collaboration on the eval process....more 
work remains on the latter. :)

> Upgrade to PDFBox 2.0.0 when available
> --------------------------------------
>
>                 Key: TIKA-1285
>                 URL: https://issues.apache.org/jira/browse/TIKA-1285
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.6
>            Reporter: Jeremy Anderson
>            Priority: Minor
>         Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, 
> TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, 
> testPDF_childAttachments.pdf
>
>
> This issue is to track fixes required when upgrading the PDFbox dependency to 
> 2.0.0 Final once it's available, and using PDFBox's daily build before then.
> See TIKA-1268 comment.
> Relates to PDFBOX-1893



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to