Hi list!

I'm using PdfBox through Apache TIKA 0.8 and it gives me an error on some files when parsing, the resulting file (after the exception is raised) is a partial text extraction, like text from some pages at the beginning followed by text from the end of the PDF, missing pages at the middle...

Is happens only in some files, for example (~2MB):

http://biblioteca.sinbad.ua.pt/DisQSws/get.aspx?filename=2010001615.pdf&catalog=Teses&type=pdf

It happens when called from a webapp as well as when called from the command line:

java -jar tika-app-0.8.jar -t < input.pdf > extracted.txt

I have tried with different heap sizes too.

I'm getting:


WARN - Parsing Error, Skipping Object
java.io.IOException: expected='endobj' firstReadAttempt='' secondReadAttempt='' org.apache.pdfbox.io.pushbackinputstr...@53ab04 at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:607)
        at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:172)
        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:878)
        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:843)
        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:74)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:218)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:84)


What does this error means? Can somebody tell if there is something wrong with the pdfs I'm giving PDFBox? I thing this is most probably the case, but then what's wrong with the file I linked to at the beginning?

Thanks for any clarifications,

Alex

Reply via email to