[
https://issues.apache.org/jira/browse/PDFBOX-3887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16113409#comment-16113409
]
Tilman Hausherr edited comment on PDFBOX-3887 at 8/3/17 8:19 PM:
-----------------------------------------------------------------
Weird thing: the SEC has the same document as HTML, but the number of shares
(bottom of page 1) is different.
https://www.sec.gov/Archives/edgar/data/1139614/000151712615000204/form10q.htm
was (Author: tilman):
Weird thing: the SEC has the same document as HTML, but the number of shares
(bottom of page 1) is different.
> Getting a "DataFormatException: invalid distance too far back" exception for
> the attached file
> ----------------------------------------------------------------------------------------------
>
> Key: PDFBOX-3887
> URL: https://issues.apache.org/jira/browse/PDFBOX-3887
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.7
> Environment: Windows 10 64-bit, Ubuntu 14.04 64-bit.
> java version "1.8.0_141"
> Java(TM) SE Runtime Environment (build 1.8.0_141-b15)
> Java HotSpot(TM) 64-Bit Server VM (build 25.141-b15, mixed mode)
> Reporter: Harun Reşit Zafer
> Labels: extraction, parsing
> Attachments: non-contract_00025.pdf
>
>
> PdfBox throws the following exception:
> {code:java}
> Caused by: java.io.IOException: java.util.zip.DataFormatException: invalid
> distance too far back
> at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:82)
> at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:69)
> at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:162)
> at
> org.apache.pdfbox.pdfparser.PDFObjectStreamParser.<init>(PDFObjectStreamParser.java:55)
> at
> org.apache.pdfbox.pdfparser.COSParser.parseObjectStream(COSParser.java:847)
> at
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:753)
> at
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:678)
> at
> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:638)
> at
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:236)
> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:271)
> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:940)
> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:888)
> at
> com.diligen.parser.pdf.PdfBoxHelper.getDocumentWithLineSegments(PdfBoxHelper.java:131)
> ... 7 more
> Caused by: java.util.zip.DataFormatException: invalid distance too far back
> at java.util.zip.Inflater.inflateBytes(Native Method)
> at java.util.zip.Inflater.inflate(Inflater.java:259)
> at java.util.zip.Inflater.inflate(Inflater.java:280)
> at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:107)
> at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:73)
> ... 20 more
> {code}
> If there is no quick solution for this bug, is there a workaround? Can I
> somehow catch the exception and take some action?
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]