Hi,

Am 16.08.2012 16:11, schrieb [email protected]:
hi,

i'm trying to load some sample pdf documents but only 1 of 4 is parsed by
pdfbox without exception.
adobe reader opens all those pdf documents without any sign of problems.


public static void main(String[] args) throws Exception {
                 InputStream ins=TestGetTexts.class.getResourceAsStream(
"/034352.pdf");  // sample document

                 PDFParser parser=new PDFParser(ins);
                 parser.parse();
                 COSDocument cosDoc=parser.getDocument();
                 PDDocument pdDoc = new PDDocument(cosDoc);

}
First of all, you should use one of the static load-methods provided by 
PDDocument.

        InputStream ins=TestGetTexts.class.getResourceAsStream("/034352.pdf");
        PDDocument pdDoc = PDDocument.load(ins);


it throws exceptions at line "parser.parse();"
what is wrong with that?
Hard to say without having a hand on one of these pdfs. Did you ever try the new non-sequential parser (use loadNonSeq instead of load )?

16.8.2012 15:49:49 org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
WARNING: Specified stream length 252 is wrong. Fall back to reading stream
until 'endstream'.
16.8.2012 15:49:49 org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
WARNING: Specified stream length 34 is wrong. Fall back to reading stream
until 'endstream'.
16.8.2012 15:49:49 org.apache.pdfbox.filter.FlateFilter decode
SEVERE: FlateFilter: stop reading corrupt stream due to a
DataFormatException
16.8.2012 15:49:49 org.apache.pdfbox.filter.FlateFilter decode
SEVERE: FlateFilter: stop reading corrupt stream due to a
DataFormatException
16.8.2012 15:49:49 org.apache.pdfbox.filter.FlateFilter decode
SEVERE: FlateFilter: stop reading corrupt stream due to a
DataFormatException
16.8.2012 15:49:49 org.apache.pdfbox.filter.FlateFilter decode
SEVERE: FlateFilter: stop reading corrupt stream due to a
DataFormatException
16.8.2012 15:49:49 org.apache.pdfbox.filter.FlateFilter decode
SEVERE: FlateFilter: stop reading corrupt stream due to a
DataFormatException
16.8.2012 15:49:49 org.apache.pdfbox.filter.FlateFilter decode
SEVERE: FlateFilter: stop reading corrupt stream due to a
DataFormatException
16.8.2012 15:49:49 org.apache.pdfbox.filter.FlateFilter decode
SEVERE: FlateFilter: stop reading corrupt stream due to a
DataFormatException
16.8.2012 15:49:49 org.apache.pdfbox.filter.FlateFilter decode
SEVERE: FlateFilter: stop reading corrupt stream due to a
DataFormatException
16.8.2012 15:49:49 org.apache.pdfbox.filter.FlateFilter decode
SEVERE: FlateFilter: stop reading corrupt stream due to a
DataFormatException
16.8.2012 15:49:49 org.apache.pdfbox.filter.FlateFilter decode
SEVERE: FlateFilter: stop reading corrupt stream due to a
DataFormatException
Exception in thread "main" java.io.IOException
         at org.apache.pdfbox.filter.FlateFilter.decode(
FlateFilter.java:138)
         at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:301)
         at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:221)
         at org.apache.pdfbox.cos.COSStream.getUnfilteredStream(
COSStream.java:156)
         at org.apache.pdfbox.pdfparser.PDFXrefStreamParser.<init>(
PDFXrefStreamParser.java:61)
         at org.apache.pdfbox.pdfparser.PDFParser.parseXrefStream(
PDFParser.java:846)
         at org.apache.pdfbox.pdfparser.PDFParser.parseObject(
PDFParser.java:574)
         at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:187)
         at test.TestGetTexts.main(TestGetTexts.java:20)
Caused by: java.util.zip.DataFormatException: incorrect header check
         at java.util.zip.Inflater.inflateBytes(Native Method)
         at java.util.zip.Inflater.inflate(Inflater.java:238)
         at java.util.zip.Inflater.inflate(Inflater.java:256)
         at org.apache.pdfbox.filter.FlateFilter.decompress(
FlateFilter.java:169)
         at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:98
)
         ... 8 more


the other pdf:

16.8.2012 16:08:44 org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
WARNING: Specified stream length 4192 is wrong. Fall back to reading
stream until 'endstream'.
16.8.2012 16:08:45 org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
WARNING: Specified stream length 576 is wrong. Fall back to reading stream
until 'endstream'.
16.8.2012 16:08:45 org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
WARNING: Specified stream length 432 is wrong. Fall back to reading stream
until 'endstream'.
16.8.2012 16:08:45 org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
WARNING: Specified stream length 304 is wrong. Fall back to reading stream
until 'endstream'.
16.8.2012 16:08:45 org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
WARNING: Specified stream length 480 is wrong. Fall back to reading stream
until 'endstream'.
16.8.2012 16:08:45 org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
WARNING: Specified stream length 176 is wrong. Fall back to reading stream
until 'endstream'.
16.8.2012 16:08:45 org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
WARNING: Specified stream length 2096 is wrong. Fall back to reading
stream until 'endstream'.
16.8.2012 16:08:45 org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
WARNING: Specified stream length 137440 is wrong. Fall back to reading
stream until 'endstream'.
Exception in thread "main" org.apache.pdfbox.exceptions.WrappedIOException
: Could not push back 137440 bytes in order to reparse stream. Try
increasing push back buffer using system property
org.apache.pdfbox.baseParser.pushBackSize
         at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(
BaseParser.java:546)
         at org.apache.pdfbox.pdfparser.PDFParser.parseObject(
PDFParser.java:566)
         at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:187)
         at test.TestGetTexts.main(TestGetTexts.java:20)
Caused by: java.io.IOException: Push back buffer is full
         at java.io.PushbackInputStream.unread(PushbackInputStream.java:215
)
         at org.apache.pdfbox.io.PushBackInputStream.unread(
PushBackInputStream.java:144)
         at org.apache.pdfbox.io.PushBackInputStream.unread(
PushBackInputStream.java:133)
         at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(
BaseParser.java:542)
         ... 3 more



or:

16.8.2012 16:10:27 org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
WARNING: Specified stream length 8 is wrong. Fall back to reading stream
until 'endstream'.
16.8.2012 16:10:27 org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
WARNING: Specified stream length 77788 is wrong. Fall back to reading
stream until 'endstream'.
Exception in thread "main" org.apache.pdfbox.exceptions.WrappedIOException
: Could not push back 77788 bytes in order to reparse stream. Try
increasing push back buffer using system property
org.apache.pdfbox.baseParser.pushBackSize
         at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(
BaseParser.java:546)
         at org.apache.pdfbox.pdfparser.PDFParser.parseObject(
PDFParser.java:566)
         at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:187)
         at test.TestGetTexts.main(TestGetTexts.java:21)
Caused by: java.io.IOException: Push back buffer is full
         at java.io.PushbackInputStream.unread(PushbackInputStream.java:215
)
         at org.apache.pdfbox.io.PushBackInputStream.unread(
PushBackInputStream.java:144)
         at org.apache.pdfbox.io.PushBackInputStream.unread(
PushBackInputStream.java:133)
         at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(
BaseParser.java:542)
         ... 3 more

best regards
Juraj Lonc


BR
Andreas Lehmkühler

Reply via email to