org.apache.pdfbox.io.PushBackInputStream on some PDFs

Alex Rodriguez Lopez Tue, 07 Dec 2010 08:27:45 -0800

Hi list!

I'm using PdfBox through Apache TIKA 0.8 and it gives me an error onsome files when parsing, the resulting file (after the exception israised) is a partial text extraction, like text from some pages at thebeginning followed by text from the end of the PDF, missing pages at themiddle...


Is happens only in some files, for example (~2MB):

http://biblioteca.sinbad.ua.pt/DisQSws/get.aspx?filename=2010001615.pdf&catalog=Teses&type=pdf

It happens when called from a webapp as well as when called from thecommand line:


java -jar tika-app-0.8.jar -t < input.pdf > extracted.txt

I have tried with different heap sizes too.

I'm getting:


WARN - Parsing Error, Skipping Object

java.io.IOException: expected='endobj' firstReadAttempt=''secondReadAttempt='' org.apache.pdfbox.io.pushbackinputstr...@53ab04atorg.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:607)

        at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:172)
        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:878)
        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:843)
        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:74)

atorg.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)atorg.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)atorg.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)

        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:218)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:84)

What does this error means? Can somebody tell if there is somethingwrong with the pdfs I'm giving PDFBox? I thing this is most probably thecase, but then what's wrong with the file I linked to at the beginning?


Thanks for any clarifications,

Alex

org.apache.pdfbox.io.PushBackInputStream on some PDFs

Reply via email to