[ https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15214111#comment-15214111 ]
Tim Allison edited comment on TIKA-1285 at 3/28/16 11:31 AM: ------------------------------------------------------------- As I mentioned on the pdfbox dev list, I'm hesitant to waste your time by submitting issues for truncated files. If AR can't parse it, I wouldn't expect PDFBox to have much luck. However, the classic parser in 1.8 was able to get some text+metadata out of some truncated files. If you go to my last pre-release-2.0.0 reports zip here: https://github.com/tballison/share/blob/master/pdfbox_comparisons/reports_pdfbox_2_0_20160310.zip?raw=true there's a file called textLostFromACausedByNewExceptionsInB.xlsx. That documents what text 1.8.11 (with the classic parser) was able to extract from files that 2.0.0 (with nonsequential parser) was not. Nearly all of the "new" exceptions in 2.0.0 were caused by truncated files. was (Author: talli...@mitre.org): As I mentioned on the pdfbox dev list, I'm hesitant to waste your time by submitting issues for truncated files. If AR can't parse it, I wouldn't expect PDFBox to have much luck. However, the classic parser in 1.8 was able to get some text+metadata out of some truncated files. If you go to my last pre-release-2.0.0 reports zip here: https://github.com/tballison/share/blob/master/pdfbox_comparisons/reports_pdfbox_2_0_20160310.zip?raw=true there's a file called textLostFromACausedByNewExceptionsInB.xlsx. That documents what text 1.8.11 (with the classic parser) was able to extract from files that 2.0.0 (with nonsequential parser) was not. By Nearly all of the "new" exceptions in 2.0.0 were caused by truncated files. > Upgrade to PDFBox 2.0.0 when available > -------------------------------------- > > Key: TIKA-1285 > URL: https://issues.apache.org/jira/browse/TIKA-1285 > Project: Tika > Issue Type: Improvement > Components: parser > Affects Versions: 1.6 > Reporter: Jeremy Anderson > Fix For: 1.13 > > Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, > TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, > testPDF_childAttachments.pdf > > > This issue is to track fixes required when upgrading the PDFbox dependency to > 2.0.0 Final once it's available, and using PDFBox's daily build before then. > See TIKA-1268 comment. > Relates to PDFBOX-1893 -- This message was sent by Atlassian JIRA (v6.3.4#6332)