[ 
https://issues.apache.org/jira/browse/PDFBOX-3887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16114486#comment-16114486
 ] 

Tilman Hausherr commented on PDFBOX-3887:
-----------------------------------------

The last commit makes PDFBox even more lenient, it accepts the total loss of an 
object stream. I'm now able to extract some of the text. It fails after some 
time. If the purpose of your application is to extract as much as possible e.g. 
for indexing, then do the extraction page by page and don't stop on exceptions. 
Apache Tika has such an option.

This commit may be reverted if we find regressions on pre-release mass tests 
(which includes many corrupt files).

You'll find a snapshot here in a few minutes:
https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.8-SNAPSHOT/
(at the bottom, look for the date)

> Getting a "DataFormatException: invalid distance too far back" exception for 
> the attached file
> ----------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-3887
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3887
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.7
>         Environment: Windows 10 64-bit, Ubuntu 14.04 64-bit. 
> java version "1.8.0_141" 
> Java(TM) SE Runtime Environment (build 1.8.0_141-b15) 
> Java HotSpot(TM) 64-Bit Server VM (build 25.141-b15, mixed mode)
>            Reporter: Harun Reşit Zafer
>              Labels: extraction, parsing
>         Attachments: non-contract_00025.pdf
>
>
> PdfBox throws the following exception:
> {code:java}
> Caused by: java.io.IOException: java.util.zip.DataFormatException: invalid 
> distance too far back
>       at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:82)
>       at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:69)
>       at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:162)
>       at 
> org.apache.pdfbox.pdfparser.PDFObjectStreamParser.<init>(PDFObjectStreamParser.java:55)
>       at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectStream(COSParser.java:847)
>       at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:753)
>       at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:678)
>       at 
> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:638)
>       at 
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:236)
>       at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:271)
>       at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
>       at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:940)
>       at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:888)
>       at 
> com.diligen.parser.pdf.PdfBoxHelper.getDocumentWithLineSegments(PdfBoxHelper.java:131)
>       ... 7 more
> Caused by: java.util.zip.DataFormatException: invalid distance too far back
>       at java.util.zip.Inflater.inflateBytes(Native Method)
>       at java.util.zip.Inflater.inflate(Inflater.java:259)
>       at java.util.zip.Inflater.inflate(Inflater.java:280)
>       at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:107)
>       at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:73)
>       ... 20 more
> {code}
> If there is no quick solution for this bug, is there a workaround? Can I 
> somehow catch the exception and take some action?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to