I haven't had time to look into the specifics, but I can tell by the error you posted that your PDF is "non-conforming" (aka it violates the PDF specification... it's not a valid PDF). This isn't to say that some programs won't be able to read it anyway, but that's the root cause of this error. Specifically, it looks like there's an "endobj" line which is missing. If I have some extra time, I'll debug through it and see exactly what section of the PDF is corrupt.
If you can, go ahead and create an issue on JIRA[1] and attach this file. One of our goals is to make PDFBox be able to read non-conforming PDFs as best it can without throwing an exception; this PDF could serve as a good example on some things we need to watch out for. [1] https://issues.apache.org/jira/browse/PDFBOX ---- Thanks, Adam From: Alex Rodriguez Lopez <[email protected]> To: [email protected] Date: 12/07/2010 08:28 Subject: org.apache.pdfbox.io.PushBackInputStream on some PDFs Hi list! I'm using PdfBox through Apache TIKA 0.8 and it gives me an error on some files when parsing, the resulting file (after the exception is raised) is a partial text extraction, like text from some pages at the beginning followed by text from the end of the PDF, missing pages at the middle... Is happens only in some files, for example (~2MB): http://biblioteca.sinbad.ua.pt/DisQSws/get.aspx?filename=2010001615.pdf&catalog=Teses&type=pdf It happens when called from a webapp as well as when called from the command line: java -jar tika-app-0.8.jar -t < input.pdf > extracted.txt I have tried with different heap sizes too. I'm getting: WARN - Parsing Error, Skipping Object java.io.IOException: expected='endobj' firstReadAttempt='' secondReadAttempt='' org.apache.pdfbox.io.pushbackinputstr...@53ab04 at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:607) at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:172) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:878) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:843) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:74) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:218) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:84) What does this error means? Can somebody tell if there is something wrong with the pdfs I'm giving PDFBox? I thing this is most probably the case, but then what's wrong with the file I linked to at the beginning? Thanks for any clarifications, Alex - FHA 203b; 203k; HECM; VA; USDA; Conventional - Warehouse Lines; FHA-Authorized Originators - Lending and Servicing in over 45 States www.swmc.com - www.simplehecmcalculator.com Visit www.swmc.com/resources for helpful links on Training, Webinars, Lender Alerts and Submitting Conditions This email and any content within or attached hereto from Sun West Mortgage Company, Inc. is confidential and/or legally privileged. The information is intended only for the use of the individual or entity named on this email. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this email information is strictly prohibited, and that the documents should be returned to this office immediately by email. Receipt by anyone other than the intended recipient is not a waiver of any privilege. Please do not include your social security number, account number, or any other personal or financial information in the content of the email. Should you have any questions, please call (800) 453 7884.

