Hi list!
I'm using PdfBox through Apache TIKA 0.8 and it gives me an error on
some files when parsing, the resulting file (after the exception is
raised) is a partial text extraction, like text from some pages at the
beginning followed by text from the end of the PDF, missing pages at the
middle...
Is happens only in some files, for example (~2MB):
http://biblioteca.sinbad.ua.pt/DisQSws/get.aspx?filename=2010001615.pdf&catalog=Teses&type=pdf
It happens when called from a webapp as well as when called from the
command line:
java -jar tika-app-0.8.jar -t < input.pdf > extracted.txt
I have tried with different heap sizes too.
I'm getting:
WARN - Parsing Error, Skipping Object
java.io.IOException: expected='endobj' firstReadAttempt=''
secondReadAttempt='' org.apache.pdfbox.io.pushbackinputstr...@53ab04
at
org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:607)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:172)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:878)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:843)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:74)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:218)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:84)
What does this error means? Can somebody tell if there is something
wrong with the pdfs I'm giving PDFBox? I thing this is most probably the
case, but then what's wrong with the file I linked to at the beginning?
Thanks for any clarifications,
Alex