[ https://issues.apache.org/jira/browse/PDFBOX-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13145743#comment-13145743 ]
Adam Nichols commented on PDFBOX-1000: -------------------------------------- There were a few reasons why I wanted to re-write the parser: 1.) I was tired of tweaking hacks in our parser to deal with non-conforming PDFs. Some of the issues have been resolved, but not all of them (e.g. parsing invalid objects which are never referenced) 2.) We should comply with the ISO-32000 standard. This makes sure we're handing things in the proper manner; being part of the solution, not part of the problem. 3.) The ISO way of parsing is more efficient. It's worst case performance is as good as our best case. It generally uses less memory (which is especially important for mobile devices); it shouldn't need to parse all the objects in every case, so it'll use less CPU; it doesn't always need to read all the bytes of the file, reducing disk I/O. While this doesn't completely solve all of our problems (especially when it comes to non-conforming documents), it is a step in the right direction. Also, I don't have any uncommitted code for the non-conforming parser. Been very busy lately and haven't had a chance to go back and dig into it. > Conforming parser > ----------------- > > Key: PDFBOX-1000 > URL: https://issues.apache.org/jira/browse/PDFBOX-1000 > Project: PDFBox > Issue Type: New Feature > Components: Parsing > Reporter: Adam Nichols > Assignee: Adam Nichols > Attachments: COSUnread.java, ConformingPDDocument.java, > ConformingPDFParser.java, ConformingPDFParserTest.java, XrefEntry.java, > conforming-parser.patch, gdb-refcard.pdf > > > A conforming parser will start at the end of the file and read backward until > it has read the EOF marker, the xref location, and trailer[1]. Once this is > read, it will read in the xref table so it can locate other objects and > revisions. This also allows skipping objects which have been rendered > obsolete (per the xref table)[2]. It also allows the minimum amount of > information to be read when the file is loaded, and then subsequent > information will be loaded if and when it is requested. This is all laid out > in the official PDF specification, ISO 32000-1:2008. > Existing code will be re-used where possible, but this will require new > classes in order to accommodate the lazy reading which is a very different > paradigm from the existing parser. Using separate classes will also > eliminate the possibility of regression bugs from making their way into the > PDDocument or BaseParser classes. Changes to existing classes will be kept > to a minimum in order to prevent regression bugs. > [1] Section 7.5.5 "Conforming readers should read a PDF file from its end" > [2] Section 7.5.4 "the entire file need not be read to locate any particular > object" -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira