[ 
https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15214111#comment-15214111
 ] 

Tim Allison commented on TIKA-1285:
-----------------------------------

As I mentioned on the pdfbox dev list, I'm hesitant to waste your time by 
submitting issues for truncated files.  If AR can't parse it, I wouldn't expect 
PDFBox to have much luck.  

However, the classic parser in 1.8 was able to get some text+metadata out of 
some truncated files.

If you go to my last pre-release-2.0.0 reports zip here: 
https://github.com/tballison/share/blob/master/pdfbox_comparisons/reports_pdfbox_2_0_20160310.zip?raw=true

there's a file called textLostFromACausedByNewExceptionsInB.xlsx.  That 
documents what text 1.8.11 (with the classic parser) was able to extract from 
files that 2.0.0 (with nonsequential parser) was not.  By Nearly all of the 
"new" exceptions in 2.0.0 were caused by truncated files.

> Upgrade to PDFBox 2.0.0 when available
> --------------------------------------
>
>                 Key: TIKA-1285
>                 URL: https://issues.apache.org/jira/browse/TIKA-1285
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.6
>            Reporter: Jeremy Anderson
>             Fix For: 1.13
>
>         Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, 
> TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, 
> testPDF_childAttachments.pdf
>
>
> This issue is to track fixes required when upgrading the PDFbox dependency to 
> 2.0.0 Final once it's available, and using PDFBox's daily build before then.
> See TIKA-1268 comment.
> Relates to PDFBOX-1893



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to