[ 
https://issues.apache.org/jira/browse/PDFBOX-798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Nichols updated PDFBOX-798:
--------------------------------

    Attachment: PDFBOX-798.patch

If anyone has a better idea than a series of nested IFs, please let me know.  I 
just didn't want to read the whole line in case it's not end[obj|stream].

> Better handle out of spec PDFs
> ------------------------------
>
>                 Key: PDFBOX-798
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-798
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>         Environment: 32-bti Windows Vista, Java 1.5, HEAD tag of PDFBox
>            Reporter: Adam Nichols
>            Assignee: Adam Nichols
>             Fix For: 1.3.0
>
>         Attachments: PDFBOX-798.patch
>
>
> I came across another out-of-spec issue which causes PDFBox to crash.  Here's 
> the object:
> 5 0 obj
> <</Type /Page
> /Parent 6 0 R
> /MediaBox [ 0 0 610.560 783.360
> endstream
> endobj
> There are numerous issues here.  The mediabox doesn't have a closing right 
> square bracket, there's no ">>" to end the dictionary, and there's an 
> "endstream" stuck in there for no apparent reason.  This is something I 
> actually found out in the wild, however I do not know if it's a bug in the 
> creation program, some data corruption or how this happened.  However, I do 
> know that Adobe Reader parses it without crashing.  Since this is not a 
> conforming PDF, the result is undefined, so crashing (which is what PDFBox 
> will eventually do, when trying to process the next object in the file) is a 
> perfectly acceptable thing to do.
> However, I'd like to make PDFBox be able to detect that the array is 
> completed when it sees endstream, then ignore the rogue endstream, and then 
> know that the object has ended when it sees "endobj".  I'm actually going to 
> go one step further and also accept the same object even if endstream or 
> endobj is missing.  In addition to the above object, I also tested it with 
> these objects:
> % end obj, without the endstream
> 5 0 obj
> <</Type /Page
> /Parent 6 0 R
> /MediaBox [ 0 0 610.560 783.360
> endobj
> % end endstream, without the endobj
> 5 0 obj
> <</Type /Page
> /Parent 6 0 R
> /MediaBox [ 0 0 610.560 783.360
> endstream
> % properly ended array, dictionary and object (aka conforming PDF)
> 5 0 obj
> <</Type /Page
> /Parent 6 0 R
> /MediaBox [ 0 0 610.560 783.360 ]
> >>
> endobj
> Although this change will only affect PDFs which do not conform to the spec, 
> I want to put the patch up for review before committing it to SVN since it is 
> a modification to BaseParser.java.  If I do not hear any objections/concerns 
> in the few days, I'll go ahead an commit it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to