Determining if a PDF is conforming or not is neither quick nor easy.  The 
PDF spec (ISO32000) is 756 pages long so, as you can imagine, there's a 
lot of things to check.

I see that Martjn already responded with an explanation and a patch.  I 
looked at the patch briefly, and it certainly makes the code simpler, but 
I'll have to see if it handles all the cases that the current code handles 
even when force parsing is off.  At a glance it appears that it does, but 
I want to confirm and add some test cases so these things are tested 
automatically.

---- 
Thanks,
Adam



From:
Alex Rodriguez Lopez <[email protected]>
To:
[email protected]
Date:
12/10/2010 02:23
Subject:
Re: org.apache.pdfbox.io.PushBackInputStream on some PDFs



Thanks Adam,

I was kind of expecting this to be a problem with the PDF file. Is there 
a quick and easy way to tell if PDF is valid before going ahead with the 
parsing/text extraction? (Although I'm thinking now maybe is better to 
assume it is invalid when some exception like this one is thrown...).

https://issues.apache.org/jira/browse/PDFBOX-917
Tell me if anything missing, never reported a JIRA issue before.

Alex

[email protected], 09-12-2010 18:56:
> I haven't had time to look into the specifics, but I can tell by the 
error
> you posted that your PDF is "non-conforming" (aka it violates the PDF
> specification... it's not a valid PDF).  This isn't to say that some
> programs won't be able to read it anyway, but that's the root cause of
> this error.  Specifically, it looks like there's an "endobj" line which 
is
> missing.  If I have some extra time, I'll debug through it and see 
exactly
> what section of the PDF is corrupt.
>
> If you can, go ahead and create an issue on JIRA[1] and attach this 
file.
> One of our goals is to make PDFBox be able to read non-conforming PDFs 
as
> best it can without throwing an exception; this PDF could serve as a 
good
> example on some things we need to watch out for.
>
> [1] https://issues.apache.org/jira/browse/PDFBOX
>
> ----
> Thanks,
> Adam
>
>
>
> From:
> Alex Rodriguez Lopez<[email protected]>
> To:
> [email protected]
> Date:
> 12/07/2010 08:28
> Subject:
> org.apache.pdfbox.io.PushBackInputStream on some PDFs
>
>
>
> Hi list!
>
> I'm using PdfBox through Apache TIKA 0.8 and it gives me an error on
> some files when parsing, the resulting file (after the exception is
> raised) is a partial text extraction, like text from some pages at the
> beginning followed by text from the end of the PDF, missing pages at the
> middle...
>
> Is happens only in some files, for example (~2MB):
>
> 
http://biblioteca.sinbad.ua.pt/DisQSws/get.aspx?filename=2010001615.pdf&catalog=Teses&type=pdf

>
>
> It happens when called from a webapp as well as when called from the
> command line:
>
> java -jar tika-app-0.8.jar -t<  input.pdf>  extracted.txt
>
> I have tried with different heap sizes too.
>
> I'm getting:
>
>
> WARN - Parsing Error, Skipping Object
> java.io.IOException: expected='endobj' firstReadAttempt=''
> secondReadAttempt='' org.apache.pdfbox.io.pushbackinputstr...@53ab04
>           at
> org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:607)
>           at
> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:172)
>           at 
org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:878)
>           at 
org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:843)
>           at 
org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:74)
>           at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>           at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>           at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
>           at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:218)
>           at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:84)
>
>
> What does this error means? Can somebody tell if there is something
> wrong with the pdfs I'm giving PDFBox? I thing this is most probably the
> case, but then what's wrong with the file I linked to at the beginning?
>
> Thanks for any clarifications,
>
> Alex
>
>
>
>
>
> - FHA 203b; 203k; HECM; VA; USDA; Conventional
> - Warehouse Lines; FHA-Authorized Originators
> - Lending and Servicing in over 45 States
> www.swmc.com   -  www.simplehecmcalculator.com
> Visit  www.swmc.com/resources   for helpful links on Training, Webinars, 
Lender Alerts and Submitting Conditions
>
> This email and any content within or attached hereto from Sun West 
Mortgage Company, Inc. is confidential and/or legally privileged. The 
information is intended only for the use of the individual or entity named 
on this email. If you are not the intended recipient, you are hereby 
notified that any disclosure, copying, distribution or taking any action 
in reliance on the contents of this email information is strictly 
prohibited, and that the documents should be returned to this office 
immediately by email. Receipt by anyone other than the intended recipient 
is not a waiver of any privilege. Please do not include your social 
security number, account number, or any other personal or financial 
information in the content of the email. Should you have any questions, 
please call (800) 453 7884.



- FHA 203b; 203k; HECM; VA; USDA; Conventional 
- Warehouse Lines; FHA-Authorized Originators 
- Lending and Servicing in over 45 States 
www.swmc.com   -  www.simplehecmcalculator.com   Visit  www.swmc.com/resources  
 for helpful links on Training, Webinars, Lender Alerts and Submitting 
Conditions  
This email and any content within or attached hereto from Sun West Mortgage 
Company, Inc. is confidential and/or legally privileged. The information is 
intended only for the use of the individual or entity named on this email. If 
you are not the intended recipient, you are hereby notified that any 
disclosure, copying, distribution or taking any action in reliance on the 
contents of this email information is strictly prohibited, and that the 
documents should be returned to this office immediately by email. Receipt by 
anyone other than the intended recipient is not a waiver of any privilege. 
Please do not include your social security number, account number, or any other 
personal or financial information in the content of the email. Should you have 
any questions, please call (800) 453 7884.  

Reply via email to