Re: org.apache.pdfbox.io.PushBackInputStream on some PDFs

Adam Thu, 09 Dec 2010 10:57:30 -0800

I haven't had time to look into the specifics, but I can tell by the error 
you posted that your PDF is "non-conforming" (aka it violates the PDF 
specification... it's not a valid PDF).  This isn't to say that some 
programs won't be able to read it anyway, but that's the root cause of 
this error.  Specifically, it looks like there's an "endobj" line which is 
missing.  If I have some extra time, I'll debug through it and see exactly 
what section of the PDF is corrupt.


If you can, go ahead and create an issue on JIRA[1] and attach this file. 
One of our goals is to make PDFBox be able to read non-conforming PDFs as 
best it can without throwing an exception; this PDF could serve as a good 
example on some things we need to watch out for.

[1] https://issues.apache.org/jira/browse/PDFBOX

---- 
Thanks,
Adam



From:
Alex Rodriguez Lopez <[email protected]>
To:
[email protected]
Date:
12/07/2010 08:28
Subject:
org.apache.pdfbox.io.PushBackInputStream on some PDFs



Hi list!

I'm using PdfBox through Apache TIKA 0.8 and it gives me an error on 
some files when parsing, the resulting file (after the exception is 
raised) is a partial text extraction, like text from some pages at the 
beginning followed by text from the end of the PDF, missing pages at the 
middle...

Is happens only in some files, for example (~2MB):

http://biblioteca.sinbad.ua.pt/DisQSws/get.aspx?filename=2010001615.pdf&catalog=Teses&type=pdf


It happens when called from a webapp as well as when called from the 
command line:

java -jar tika-app-0.8.jar -t < input.pdf > extracted.txt

I have tried with different heap sizes too.

I'm getting:


WARN - Parsing Error, Skipping Object
java.io.IOException: expected='endobj' firstReadAttempt='' 
secondReadAttempt='' org.apache.pdfbox.io.pushbackinputstr...@53ab04
         at 
org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:607)
         at 
org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:172)
         at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:878)
         at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:843)
         at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:74)
         at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
         at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
         at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
         at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:218)
         at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:84)


What does this error means? Can somebody tell if there is something 
wrong with the pdfs I'm giving PDFBox? I thing this is most probably the 
case, but then what's wrong with the file I linked to at the beginning?

Thanks for any clarifications,

Alex





- FHA 203b; 203k; HECM; VA; USDA; Conventional 
- Warehouse Lines; FHA-Authorized Originators 
- Lending and Servicing in over 45 States 
www.swmc.com   -  www.simplehecmcalculator.com   
Visit  www.swmc.com/resources   for helpful links on Training, Webinars, Lender 
Alerts and Submitting Conditions  

This email and any content within or attached hereto from Sun West Mortgage 
Company, Inc. is confidential and/or legally privileged. The information is 
intended only for the use of the individual or entity named on this email. If 
you are not the intended recipient, you are hereby notified that any 
disclosure, copying, distribution or taking any action in reliance on the 
contents of this email information is strictly prohibited, and that the 
documents should be returned to this office immediately by email. Receipt by 
anyone other than the intended recipient is not a waiver of any privilege. 
Please do not include your social security number, account number, or any other 
personal or financial information in the content of the email. Should you have 
any questions, please call (800) 453 7884.

Re: org.apache.pdfbox.io.PushBackInputStream on some PDFs

Reply via email to