Re: Conforming parser

martijn.list Tue, 07 Dec 2010 12:17:20 -0800

I'm sorry I cannot help you with the startxref issue but I have some
thoughts about parsing non-conforming PDFs.


> Any other suggestions, words of warning, etc.?  Like, how should I
> deal with violations of the spec?

I think it's important to graceful handle non-conforming PDFs. Currently
PDFBox cannot handle certain PDFs that are can be read by most PDF
readers. PDFBox should imho try it's best to cope with PDF errors if
forceParsing is enabled.

I have added a JIRA entry
(https://issues.apache.org/jira/browse/PDFBOX-908) which contains some
patches to make PDFBox parse a large batch of commercial ebooks. I have
added a couple of PDF example to the JIRA entry that try to mimic the
problems I found in read life ebooks. The example PDFs cannot always be
opened by Acrobat because they are created by me using a text editor.
The problems that were replicated were copied from PDFs that could be
opened by Acrobat.

What I think is important is that in case of an exception, the parser
should not unread the data. If data is unread when an exception occurs
the parser can get stuck in an unlimited loop (for example
test-integer-too-large.pdf results in unlimited loop on current PDFBox).

Kind regards,

Martijn Brinkers


On 12/07/2010 07:14 PM, a...@swmc.com wrote:
> I'm trying to write a conforming parser, which should help out with 
> various issues, and I'm hoping that someone can help me understand the PDF 
> spec so I can get this done exactly to the specifications.
> 
> I noticed in 7.5.5 of ISO 32000-1:2008 it says that the startxref location 
> is the byte-offset from "the decoded stream".  This seems strange that it 
> would be the *decoded* position if the first thing to do is to skip to the 
> end of the file and read the EOF flag, xref location and trailer info. 
> Does this mean that the expected process would be to read and decode the 
> entire stream and write it to a temp file (or hold it in memory) before 
> skipping to the end, reading the EOF flag, etc.?
> 
> If this is correct, I'll just read in the File/InputStream/URL/URI/etc. 
> and decode/write it to a RandomAccess object.  This should keep memory 
> usage low since I'll be working off the RandomAccess object, so a 500MB 
> PDF won't require 500MB of memory (and I have dealt with PDFs this large).
> 
> Finally, as a test, I ran WriteDecodedDoc on my test document and then I 
> expected the xref table to match up, but it still wasn't pointing to the 
> location I expected.  Is there any existing code in PDFBox which would 
> help me read/decode/write a PDF?
> 
> Any other suggestions, words of warning, etc.?  Like, how should I deal 
> with violations of the spec?  Log and ignore, throw exception, have an 
> object which deals with exceptions on a case-by-case basis?  It'd be 
> pretty cool to have an object which would be smart enough to look and see 
> "Read: '%%EO'; Expected: '%%EOF'" and not throw an exception, but if it 
> were "Read: 'obj 49 0'; Expected: '%%EOF'" it might throw an exception. 
> But I'm not going to go through the work of doing all that unless people 
> will actually find it useful.  Maybe the conforming PDF parser could just 
> throw an exception for non-conforming documents and then fall back to the 
> PDFParser?  I'm looking for input from the community here.  Let me know 
> what you think.
> 
> ---- 
> Thanks,
> Adam
> 
> 
> 
> - FHA 203b; 203k; HECM; VA; USDA; Conventional 
> - Warehouse Lines; FHA-Authorized Originators 
> - Lending and Servicing in over 45 States 
> www.swmc.com   -  www.simplehecmcalculator.com   
> Visit  www.swmc.com/resources   for helpful links on Training, Webinars, 
> Lender Alerts and Submitting Conditions  
> 
> This email and any content within or attached hereto from Sun West Mortgage 
> Company, Inc. is confidential and/or legally privileged. The information is 
> intended only for the use of the individual or entity named on this email. If 
> you are not the intended recipient, you are hereby notified that any 
> disclosure, copying, distribution or taking any action in reliance on the 
> contents of this email information is strictly prohibited, and that the 
> documents should be returned to this office immediately by email. Receipt by 
> anyone other than the intended recipient is not a waiver of any privilege. 
> Please do not include your social security number, account number, or any 
> other personal or financial information in the content of the email. Should 
> you have any questions, please call (800) 453 7884.  


-- 
Djigzo open source email encryption

Re: Conforming parser

Reply via email to