Hello,

I have recently converted some 2500 PDF files to text using PDFBox
1.7.0. While doing so, I ran into two problems on a minority of the PDF
files (some 5% are affected for each problem). Usually, I would now file
a bug and attach a sample PDF so that the problem can be reproduced.

However, the PDFs in question are not public, and I am not entitled to
publish them to the public. Is there any person who I could mail two
affected PDFs files, so that that person could nail down the actual bug
for a good bug description while keeping the actual files secret?

Either case, here is what I see. In all cases, the affected document can
be displayed with no problems in Adobe Reader.

Problem 1: The document is parsed to be empty (no pages), although it in
fact contains > 50 pages full of text. Running PDFDebugger on this
document produces this output (WARNUNG = WARNING):
17.07.2012 14:01:50 org.apache.pdfbox.pdfparser.XrefTrailerResolver
setStartxref
WARNUNG: Did not found XRef object at specified startxref position 116

Problem 2: On attempting to parse the document, I get an IOException.
PDFDebugger outputs the following on this document (SCHWERWIEGEND = SEVERE):
17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode
SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a
DataFormatException
17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode
SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a
DataFormatException
17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode
SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a
DataFormatException
17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode
SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a
DataFormatException
17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode
SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a
DataFormatException
17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode
SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a
DataFormatException
17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode
SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a
DataFormatException
17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode
SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a
DataFormatException
17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode
SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a
DataFormatException
17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode
SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a
DataFormatException
PDFDebugger failed with the following exception:
java.io.IOException
        at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:138)
        at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:301)
        at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:221)
        at
org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:156)
        at
org.apache.pdfbox.pdfparser.PDFXrefStreamParser.<init>(PDFXrefStreamParser.java:61)
        at
org.apache.pdfbox.pdfparser.PDFParser.parseXrefStream(PDFParser.java:846)
        at
org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:574)
        at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:187)
        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1071)
        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1038)
        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1009)
        at org.apache.pdfbox.PDFDebugger.parseDocument(PDFDebugger.java:408)
        at org.apache.pdfbox.PDFDebugger.readPDFFile(PDFDebugger.java:388)
        at org.apache.pdfbox.PDFDebugger.main(PDFDebugger.java:376)
        at org.apache.pdfbox.PDFBox.main(PDFBox.java:48)
Caused by: java.util.zip.DataFormatException: unknown compression method
        at java.util.zip.Inflater.inflateBytes(Native Method)
        at java.util.zip.Inflater.inflate(Unknown Source)
        at java.util.zip.Inflater.inflate(Unknown Source)
        at
org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:169)
        at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:98)
        ... 14 more

Best Regards,
Wolfgang

--
Dipl.-Math.
Wolfgang Kronberg
Senior Software Architect

financial.com AG

(t) +49 89 318528-75
(f) +49 89 318528-28
e-mail: [email protected]
http://www.financial.com


financial.com AG

Munich head office/Hauptsitz München: Georg-Muche-Straße 3 | 80807 München | 
Germany | Tel. +49 89 318528-0 | Google Maps: http://g.co/maps/4wcz
Frankfurt branch office/Niederlassung Frankfurt: Messeturm | 
Friedrich-Ebert-Anlage 49 | 60327 Frankfurt | Germany
Management board/Vorstand: Dr. Steffen Boehnert | Dr. Alexis Eisenhofer | Dr. 
Yann Samson | Matthias Wiederwach
Supervisory board/Aufsichtsrat: Dr. Dr. Ernst zur Linden (Chairman/Vorsitzender)
Register court/Handelsregister: Munich – HRB 128 972 | Sales tax ID 
number/St.Nr.: DE205 370 553

Reply via email to