[ https://issues.apache.org/jira/browse/PDFBOX-798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Adam Nichols updated PDFBOX-798: -------------------------------- Attachment: PDFBOX-798.patch If anyone has a better idea than a series of nested IFs, please let me know. I just didn't want to read the whole line in case it's not end[obj|stream]. > Better handle out of spec PDFs > ------------------------------ > > Key: PDFBOX-798 > URL: https://issues.apache.org/jira/browse/PDFBOX-798 > Project: PDFBox > Issue Type: Bug > Components: Parsing > Environment: 32-bti Windows Vista, Java 1.5, HEAD tag of PDFBox > Reporter: Adam Nichols > Assignee: Adam Nichols > Fix For: 1.3.0 > > Attachments: PDFBOX-798.patch > > > I came across another out-of-spec issue which causes PDFBox to crash. Here's > the object: > 5 0 obj > <</Type /Page > /Parent 6 0 R > /MediaBox [ 0 0 610.560 783.360 > endstream > endobj > There are numerous issues here. The mediabox doesn't have a closing right > square bracket, there's no ">>" to end the dictionary, and there's an > "endstream" stuck in there for no apparent reason. This is something I > actually found out in the wild, however I do not know if it's a bug in the > creation program, some data corruption or how this happened. However, I do > know that Adobe Reader parses it without crashing. Since this is not a > conforming PDF, the result is undefined, so crashing (which is what PDFBox > will eventually do, when trying to process the next object in the file) is a > perfectly acceptable thing to do. > However, I'd like to make PDFBox be able to detect that the array is > completed when it sees endstream, then ignore the rogue endstream, and then > know that the object has ended when it sees "endobj". I'm actually going to > go one step further and also accept the same object even if endstream or > endobj is missing. In addition to the above object, I also tested it with > these objects: > % end obj, without the endstream > 5 0 obj > <</Type /Page > /Parent 6 0 R > /MediaBox [ 0 0 610.560 783.360 > endobj > % end endstream, without the endobj > 5 0 obj > <</Type /Page > /Parent 6 0 R > /MediaBox [ 0 0 610.560 783.360 > endstream > % properly ended array, dictionary and object (aka conforming PDF) > 5 0 obj > <</Type /Page > /Parent 6 0 R > /MediaBox [ 0 0 610.560 783.360 ] > >> > endobj > Although this change will only affect PDFs which do not conform to the spec, > I want to put the patch up for review before committing it to SVN since it is > a modification to BaseParser.java. If I do not hear any objections/concerns > in the few days, I'll go ahead an commit it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.