Thanks Adam,
I was kind of expecting this to be a problem with the PDF file. Is there
a quick and easy way to tell if PDF is valid before going ahead with the
parsing/text extraction? (Although I'm thinking now maybe is better to
assume it is invalid when some exception like this one is thrown...).
https://issues.apache.org/jira/browse/PDFBOX-917
Tell me if anything missing, never reported a JIRA issue before.
Alex
[email protected], 09-12-2010 18:56:
I haven't had time to look into the specifics, but I can tell by the error
you posted that your PDF is "non-conforming" (aka it violates the PDF
specification... it's not a valid PDF). This isn't to say that some
programs won't be able to read it anyway, but that's the root cause of
this error. Specifically, it looks like there's an "endobj" line which is
missing. If I have some extra time, I'll debug through it and see exactly
what section of the PDF is corrupt.
If you can, go ahead and create an issue on JIRA[1] and attach this file.
One of our goals is to make PDFBox be able to read non-conforming PDFs as
best it can without throwing an exception; this PDF could serve as a good
example on some things we need to watch out for.
[1] https://issues.apache.org/jira/browse/PDFBOX
----
Thanks,
Adam
From:
Alex Rodriguez Lopez<[email protected]>
To:
[email protected]
Date:
12/07/2010 08:28
Subject:
org.apache.pdfbox.io.PushBackInputStream on some PDFs
Hi list!
I'm using PdfBox through Apache TIKA 0.8 and it gives me an error on
some files when parsing, the resulting file (after the exception is
raised) is a partial text extraction, like text from some pages at the
beginning followed by text from the end of the PDF, missing pages at the
middle...
Is happens only in some files, for example (~2MB):
http://biblioteca.sinbad.ua.pt/DisQSws/get.aspx?filename=2010001615.pdf&catalog=Teses&type=pdf
It happens when called from a webapp as well as when called from the
command line:
java -jar tika-app-0.8.jar -t< input.pdf> extracted.txt
I have tried with different heap sizes too.
I'm getting:
WARN - Parsing Error, Skipping Object
java.io.IOException: expected='endobj' firstReadAttempt=''
secondReadAttempt='' org.apache.pdfbox.io.pushbackinputstr...@53ab04
at
org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:607)
at
org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:172)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:878)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:843)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:74)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:218)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:84)
What does this error means? Can somebody tell if there is something
wrong with the pdfs I'm giving PDFBox? I thing this is most probably the
case, but then what's wrong with the file I linked to at the beginning?
Thanks for any clarifications,
Alex
- FHA 203b; 203k; HECM; VA; USDA; Conventional
- Warehouse Lines; FHA-Authorized Originators
- Lending and Servicing in over 45 States
www.swmc.com - www.simplehecmcalculator.com
Visit www.swmc.com/resources for helpful links on Training, Webinars, Lender
Alerts and Submitting Conditions
This email and any content within or attached hereto from Sun West Mortgage
Company, Inc. is confidential and/or legally privileged. The information is
intended only for the use of the individual or entity named on this email. If
you are not the intended recipient, you are hereby notified that any
disclosure, copying, distribution or taking any action in reliance on the
contents of this email information is strictly prohibited, and that the
documents should be returned to this office immediately by email. Receipt by
anyone other than the intended recipient is not a waiver of any privilege.
Please do not include your social security number, account number, or any other
personal or financial information in the content of the email. Should you have
any questions, please call (800) 453 7884.