Re: submitting non-public PDFs for bugfixing

Maruan Sahyoun Tue, 17 Jul 2012 07:54:20 -0700

Hello Wolfgang,

did you try using the NonSequentialParser which was a new addition in 1.7. 
improving the parsing of PDF documents? see 
https://issues.apache.org/jira/browse/PDFBOX-1199 for details.


With kind regards

Maruan


Am 17.07.2012 um 16:09 schrieb Wolfgang Kronberg:

> 
> Hello,
> 
> I have recently converted some 2500 PDF files to text using PDFBox
> 1.7.0. While doing so, I ran into two problems on a minority of the PDF
> files (some 5% are affected for each problem). Usually, I would now file
> a bug and attach a sample PDF so that the problem can be reproduced.
> 
> However, the PDFs in question are not public, and I am not entitled to
> publish them to the public. Is there any person who I could mail two
> affected PDFs files, so that that person could nail down the actual bug
> for a good bug description while keeping the actual files secret?
> 
> Either case, here is what I see. In all cases, the affected document can
> be displayed with no problems in Adobe Reader.
> 
> Problem 1: The document is parsed to be empty (no pages), although it in
> fact contains > 50 pages full of text. Running PDFDebugger on this
> document produces this output (WARNUNG = WARNING):
> 17.07.2012 14:01:50 org.apache.pdfbox.pdfparser.XrefTrailerResolver
> setStartxref
> WARNUNG: Did not found XRef object at specified startxref position 116
> 
> Problem 2: On attempting to parse the document, I get an IOException.
> PDFDebugger outputs the following on this document (SCHWERWIEGEND = SEVERE):
> 17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode
> SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a
> DataFormatException
> 17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode
> SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a
> DataFormatException
> 17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode
> SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a
> DataFormatException
> 17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode
> SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a
> DataFormatException
> 17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode
> SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a
> DataFormatException
> 17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode
> SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a
> DataFormatException
> 17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode
> SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a
> DataFormatException
> 17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode
> SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a
> DataFormatException
> 17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode
> SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a
> DataFormatException
> 17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode
> SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a
> DataFormatException
> PDFDebugger failed with the following exception:
> java.io.IOException
>        at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:138)
>        at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:301)
>        at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:221)
>        at
> org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:156)
>        at
> org.apache.pdfbox.pdfparser.PDFXrefStreamParser.<init>(PDFXrefStreamParser.java:61)
>        at
> org.apache.pdfbox.pdfparser.PDFParser.parseXrefStream(PDFParser.java:846)
>        at
> org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:574)
>        at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:187)
>        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1071)
>        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1038)
>        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1009)
>        at org.apache.pdfbox.PDFDebugger.parseDocument(PDFDebugger.java:408)
>        at org.apache.pdfbox.PDFDebugger.readPDFFile(PDFDebugger.java:388)
>        at org.apache.pdfbox.PDFDebugger.main(PDFDebugger.java:376)
>        at org.apache.pdfbox.PDFBox.main(PDFBox.java:48)
> Caused by: java.util.zip.DataFormatException: unknown compression method
>        at java.util.zip.Inflater.inflateBytes(Native Method)
>        at java.util.zip.Inflater.inflate(Unknown Source)
>        at java.util.zip.Inflater.inflate(Unknown Source)
>        at
> org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:169)
>        at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:98)
>        ... 14 more
> 
> Best Regards,
> Wolfgang
> 
> --
> Dipl.-Math.
> Wolfgang Kronberg
> Senior Software Architect
> 
> financial.com AG
> 
> (t) +49 89 318528-75
> (f) +49 89 318528-28
> e-mail: [email protected]
> http://www.financial.com
> 
> 
> financial.com AG
> 
> Munich head office/Hauptsitz München: Georg-Muche-Straße 3 | 80807 München | 
> Germany | Tel. +49 89 318528-0 | Google Maps: http://g.co/maps/4wcz
> Frankfurt branch office/Niederlassung Frankfurt: Messeturm | 
> Friedrich-Ebert-Anlage 49 | 60327 Frankfurt | Germany
> Management board/Vorstand: Dr. Steffen Boehnert | Dr. Alexis Eisenhofer | Dr. 
> Yann Samson | Matthias Wiederwach
> Supervisory board/Aufsichtsrat: Dr. Dr. Ernst zur Linden 
> (Chairman/Vorsitzender)
> Register court/Handelsregister: Munich – HRB 128 972 | Sales tax ID 
> number/St.Nr.: DE205 370 553

Re: submitting non-public PDFs for bugfixing

Reply via email to