hi, thanks for reply. your advice helped.
Best regards Juraj Lonc GI-BÓN, spol. s r.o. Management Systems Bratislavská 11 SK - 010 01 Žilina Tel: +421-41-564 3437-8 Mobil: +421-907-815 147 Fax: +421-41-564 3439 e-mail: [email protected] homepage: http://www.gi-bon.sk From: Andreas Lehmkuehler <[email protected]> To: [email protected], Date: 23. 08. 2012 18:21 Subject: Re: problems with pdf parsing Hi, Am 16.08.2012 16:11, schrieb [email protected]: > hi, > > i'm trying to load some sample pdf documents but only 1 of 4 is parsed by > pdfbox without exception. > adobe reader opens all those pdf documents without any sign of problems. > > > public static void main(String[] args) throws Exception { > InputStream ins=TestGetTexts.class.getResourceAsStream( > "/034352.pdf"); // sample document > > PDFParser parser=new PDFParser(ins); > parser.parse(); > COSDocument cosDoc=parser.getDocument(); > PDDocument pdDoc = new PDDocument(cosDoc); > > } First of all, you should use one of the static load-methods provided by PDDocument. InputStream ins=TestGetTexts.class.getResourceAsStream("/034352.pdf"); PDDocument pdDoc = PDDocument.load(ins); > it throws exceptions at line "parser.parse();" > what is wrong with that? Hard to say without having a hand on one of these pdfs. Did you ever try the new non-sequential parser (use loadNonSeq instead of load )? > 16.8.2012 15:49:49 org.apache.pdfbox.pdfparser.BaseParser parseCOSStream > WARNING: Specified stream length 252 is wrong. Fall back to reading stream > until 'endstream'. > 16.8.2012 15:49:49 org.apache.pdfbox.pdfparser.BaseParser parseCOSStream > WARNING: Specified stream length 34 is wrong. Fall back to reading stream > until 'endstream'. > 16.8.2012 15:49:49 org.apache.pdfbox.filter.FlateFilter decode > SEVERE: FlateFilter: stop reading corrupt stream due to a > DataFormatException > 16.8.2012 15:49:49 org.apache.pdfbox.filter.FlateFilter decode > SEVERE: FlateFilter: stop reading corrupt stream due to a > DataFormatException > 16.8.2012 15:49:49 org.apache.pdfbox.filter.FlateFilter decode > SEVERE: FlateFilter: stop reading corrupt stream due to a > DataFormatException > 16.8.2012 15:49:49 org.apache.pdfbox.filter.FlateFilter decode > SEVERE: FlateFilter: stop reading corrupt stream due to a > DataFormatException > 16.8.2012 15:49:49 org.apache.pdfbox.filter.FlateFilter decode > SEVERE: FlateFilter: stop reading corrupt stream due to a > DataFormatException > 16.8.2012 15:49:49 org.apache.pdfbox.filter.FlateFilter decode > SEVERE: FlateFilter: stop reading corrupt stream due to a > DataFormatException > 16.8.2012 15:49:49 org.apache.pdfbox.filter.FlateFilter decode > SEVERE: FlateFilter: stop reading corrupt stream due to a > DataFormatException > 16.8.2012 15:49:49 org.apache.pdfbox.filter.FlateFilter decode > SEVERE: FlateFilter: stop reading corrupt stream due to a > DataFormatException > 16.8.2012 15:49:49 org.apache.pdfbox.filter.FlateFilter decode > SEVERE: FlateFilter: stop reading corrupt stream due to a > DataFormatException > 16.8.2012 15:49:49 org.apache.pdfbox.filter.FlateFilter decode > SEVERE: FlateFilter: stop reading corrupt stream due to a > DataFormatException > Exception in thread "main" java.io.IOException > at org.apache.pdfbox.filter.FlateFilter.decode( > FlateFilter.java:138) > at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:301) > at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:221) > at org.apache.pdfbox.cos.COSStream.getUnfilteredStream( > COSStream.java:156) > at org.apache.pdfbox.pdfparser.PDFXrefStreamParser.<init>( > PDFXrefStreamParser.java:61) > at org.apache.pdfbox.pdfparser.PDFParser.parseXrefStream( > PDFParser.java:846) > at org.apache.pdfbox.pdfparser.PDFParser.parseObject( > PDFParser.java:574) > at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:187) > at test.TestGetTexts.main(TestGetTexts.java:20) > Caused by: java.util.zip.DataFormatException: incorrect header check > at java.util.zip.Inflater.inflateBytes(Native Method) > at java.util.zip.Inflater.inflate(Inflater.java:238) > at java.util.zip.Inflater.inflate(Inflater.java:256) > at org.apache.pdfbox.filter.FlateFilter.decompress( > FlateFilter.java:169) > at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:98 > ) > ... 8 more > > > the other pdf: > > 16.8.2012 16:08:44 org.apache.pdfbox.pdfparser.BaseParser parseCOSStream > WARNING: Specified stream length 4192 is wrong. Fall back to reading > stream until 'endstream'. > 16.8.2012 16:08:45 org.apache.pdfbox.pdfparser.BaseParser parseCOSStream > WARNING: Specified stream length 576 is wrong. Fall back to reading stream > until 'endstream'. > 16.8.2012 16:08:45 org.apache.pdfbox.pdfparser.BaseParser parseCOSStream > WARNING: Specified stream length 432 is wrong. Fall back to reading stream > until 'endstream'. > 16.8.2012 16:08:45 org.apache.pdfbox.pdfparser.BaseParser parseCOSStream > WARNING: Specified stream length 304 is wrong. Fall back to reading stream > until 'endstream'. > 16.8.2012 16:08:45 org.apache.pdfbox.pdfparser.BaseParser parseCOSStream > WARNING: Specified stream length 480 is wrong. Fall back to reading stream > until 'endstream'. > 16.8.2012 16:08:45 org.apache.pdfbox.pdfparser.BaseParser parseCOSStream > WARNING: Specified stream length 176 is wrong. Fall back to reading stream > until 'endstream'. > 16.8.2012 16:08:45 org.apache.pdfbox.pdfparser.BaseParser parseCOSStream > WARNING: Specified stream length 2096 is wrong. Fall back to reading > stream until 'endstream'. > 16.8.2012 16:08:45 org.apache.pdfbox.pdfparser.BaseParser parseCOSStream > WARNING: Specified stream length 137440 is wrong. Fall back to reading > stream until 'endstream'. > Exception in thread "main" org.apache.pdfbox.exceptions.WrappedIOException > : Could not push back 137440 bytes in order to reparse stream. Try > increasing push back buffer using system property > org.apache.pdfbox.baseParser.pushBackSize > at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream( > BaseParser.java:546) > at org.apache.pdfbox.pdfparser.PDFParser.parseObject( > PDFParser.java:566) > at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:187) > at test.TestGetTexts.main(TestGetTexts.java:20) > Caused by: java.io.IOException: Push back buffer is full > at java.io.PushbackInputStream.unread(PushbackInputStream.java:215 > ) > at org.apache.pdfbox.io.PushBackInputStream.unread( > PushBackInputStream.java:144) > at org.apache.pdfbox.io.PushBackInputStream.unread( > PushBackInputStream.java:133) > at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream( > BaseParser.java:542) > ... 3 more > > > > or: > > 16.8.2012 16:10:27 org.apache.pdfbox.pdfparser.BaseParser parseCOSStream > WARNING: Specified stream length 8 is wrong. Fall back to reading stream > until 'endstream'. > 16.8.2012 16:10:27 org.apache.pdfbox.pdfparser.BaseParser parseCOSStream > WARNING: Specified stream length 77788 is wrong. Fall back to reading > stream until 'endstream'. > Exception in thread "main" org.apache.pdfbox.exceptions.WrappedIOException > : Could not push back 77788 bytes in order to reparse stream. Try > increasing push back buffer using system property > org.apache.pdfbox.baseParser.pushBackSize > at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream( > BaseParser.java:546) > at org.apache.pdfbox.pdfparser.PDFParser.parseObject( > PDFParser.java:566) > at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:187) > at test.TestGetTexts.main(TestGetTexts.java:21) > Caused by: java.io.IOException: Push back buffer is full > at java.io.PushbackInputStream.unread(PushbackInputStream.java:215 > ) > at org.apache.pdfbox.io.PushBackInputStream.unread( > PushBackInputStream.java:144) > at org.apache.pdfbox.io.PushBackInputStream.unread( > PushBackInputStream.java:133) > at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream( > BaseParser.java:542) > ... 3 more > > best regards > Juraj Lonc BR Andreas Lehmkühler

