Hello Wolfgang, did you try using the NonSequentialParser which was a new addition in 1.7. improving the parsing of PDF documents? see https://issues.apache.org/jira/browse/PDFBOX-1199 for details.
With kind regards Maruan Am 17.07.2012 um 16:09 schrieb Wolfgang Kronberg: > > Hello, > > I have recently converted some 2500 PDF files to text using PDFBox > 1.7.0. While doing so, I ran into two problems on a minority of the PDF > files (some 5% are affected for each problem). Usually, I would now file > a bug and attach a sample PDF so that the problem can be reproduced. > > However, the PDFs in question are not public, and I am not entitled to > publish them to the public. Is there any person who I could mail two > affected PDFs files, so that that person could nail down the actual bug > for a good bug description while keeping the actual files secret? > > Either case, here is what I see. In all cases, the affected document can > be displayed with no problems in Adobe Reader. > > Problem 1: The document is parsed to be empty (no pages), although it in > fact contains > 50 pages full of text. Running PDFDebugger on this > document produces this output (WARNUNG = WARNING): > 17.07.2012 14:01:50 org.apache.pdfbox.pdfparser.XrefTrailerResolver > setStartxref > WARNUNG: Did not found XRef object at specified startxref position 116 > > Problem 2: On attempting to parse the document, I get an IOException. > PDFDebugger outputs the following on this document (SCHWERWIEGEND = SEVERE): > 17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode > SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a > DataFormatException > 17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode > SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a > DataFormatException > 17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode > SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a > DataFormatException > 17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode > SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a > DataFormatException > 17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode > SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a > DataFormatException > 17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode > SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a > DataFormatException > 17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode > SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a > DataFormatException > 17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode > SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a > DataFormatException > 17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode > SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a > DataFormatException > 17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode > SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a > DataFormatException > PDFDebugger failed with the following exception: > java.io.IOException > at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:138) > at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:301) > at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:221) > at > org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:156) > at > org.apache.pdfbox.pdfparser.PDFXrefStreamParser.<init>(PDFXrefStreamParser.java:61) > at > org.apache.pdfbox.pdfparser.PDFParser.parseXrefStream(PDFParser.java:846) > at > org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:574) > at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:187) > at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1071) > at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1038) > at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1009) > at org.apache.pdfbox.PDFDebugger.parseDocument(PDFDebugger.java:408) > at org.apache.pdfbox.PDFDebugger.readPDFFile(PDFDebugger.java:388) > at org.apache.pdfbox.PDFDebugger.main(PDFDebugger.java:376) > at org.apache.pdfbox.PDFBox.main(PDFBox.java:48) > Caused by: java.util.zip.DataFormatException: unknown compression method > at java.util.zip.Inflater.inflateBytes(Native Method) > at java.util.zip.Inflater.inflate(Unknown Source) > at java.util.zip.Inflater.inflate(Unknown Source) > at > org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:169) > at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:98) > ... 14 more > > Best Regards, > Wolfgang > > -- > Dipl.-Math. > Wolfgang Kronberg > Senior Software Architect > > financial.com AG > > (t) +49 89 318528-75 > (f) +49 89 318528-28 > e-mail: [email protected] > http://www.financial.com > > > financial.com AG > > Munich head office/Hauptsitz München: Georg-Muche-Straße 3 | 80807 München | > Germany | Tel. +49 89 318528-0 | Google Maps: http://g.co/maps/4wcz > Frankfurt branch office/Niederlassung Frankfurt: Messeturm | > Friedrich-Ebert-Anlage 49 | 60327 Frankfurt | Germany > Management board/Vorstand: Dr. Steffen Boehnert | Dr. Alexis Eisenhofer | Dr. > Yann Samson | Matthias Wiederwach > Supervisory board/Aufsichtsrat: Dr. Dr. Ernst zur Linden > (Chairman/Vorsitzender) > Register court/Handelsregister: Munich – HRB 128 972 | Sales tax ID > number/St.Nr.: DE205 370 553

