[ https://issues.apache.org/jira/browse/PDFBOX-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13182446#comment-13182446 ]
Timo Boehme commented on PDFBOX-1000: ------------------------------------- a) while I think that 2 parsing modes are ok, it is important to distinguish between 1) not strict conforming, but parseable without loss/change of information (e.g. not allowed whitespaces) and 2) recover from/workaround an error with possible information change. Thus we would have two states for relaxed parsing. Case 1 may be hidden but case 2 needs to be signaled to the user of an application. b) putting the logic into the objects sound like a clean OO approach. Nevertheless I would keep it in the parser, because to do parsing access to environment settings (encryption) and other objects (e.g. object streams) is needed which is more complex if the objects would have to known about this. Furthermore classes of COS objects are easier to maintain if they are not cluttered by parsing code (in my opinion). c) absolutely fine with me. Maybe looking at the methods in COSDocument one can find which information is needed, e.g. MediaBox. d) A clear separation of workaround code paths with possibility of extension/overwriting is a good idea. > Conforming parser > ----------------- > > Key: PDFBOX-1000 > URL: https://issues.apache.org/jira/browse/PDFBOX-1000 > Project: PDFBox > Issue Type: New Feature > Components: Parsing > Reporter: Adam Nichols > Assignee: Adam Nichols > Attachments: COSUnread.java, ConformingPDDocument.java, > ConformingPDFParser.java, ConformingPDFParserTest.java, XrefEntry.java, > conforming-parser.patch, gdb-refcard.pdf > > > A conforming parser will start at the end of the file and read backward until > it has read the EOF marker, the xref location, and trailer[1]. Once this is > read, it will read in the xref table so it can locate other objects and > revisions. This also allows skipping objects which have been rendered > obsolete (per the xref table)[2]. It also allows the minimum amount of > information to be read when the file is loaded, and then subsequent > information will be loaded if and when it is requested. This is all laid out > in the official PDF specification, ISO 32000-1:2008. > Existing code will be re-used where possible, but this will require new > classes in order to accommodate the lazy reading which is a very different > paradigm from the existing parser. Using separate classes will also > eliminate the possibility of regression bugs from making their way into the > PDDocument or BaseParser classes. Changes to existing classes will be kept > to a minimum in order to prevent regression bugs. > [1] Section 7.5.5 "Conforming readers should read a PDF file from its end" > [2] Section 7.5.4 "the entire file need not be read to locate any particular > object" -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira