[ https://issues.apache.org/jira/browse/PDFBOX-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025092#comment-13025092 ]
Thomas Chojecki commented on PDFBOX-1000: ----------------------------------------- Hi, 1) your right, in some case it would be easier to seperate values with space. the object reader do the follow. (some abstract code) readValue { char c1 = readChar; if (c1 == SPACE) { readValue; // maybe something like skipSpaces at the beginning will be better. because someone // can kill the heap with whitespaces duo to the recursion. } else if ( c1 == '/') { // unreadByte; // Not needed, because the COSName can read till whitespace or / and remove the beginning / if it exist. readCOSName; } else if ( c1 == INTEGER) { unreadByte; readCOSIntOrRef; } else if ( c1 == '(' ) { // unreadByte; // see readCOSName readCOSString; } ... // and so on. '<' is tricky, can be a dictionary or hex string. then we need to read one byte more and see if its a new dict or string. } readCOSIntOrRef { buffer b1; char c1; while((c1 = readChar) != '/' or '>') { if(c1 == SPACE) { new COSRef(buffer, readChar); readChar;readChar; // to skip the R } writeBuffer(c1); } new COSInteger(writeBuffer); } readObj { readCOSName readValue } to write a parser for a pdf is one of the hardest things. the spec is inaccurate and give the developer room for interpret it the way he means is spec conform. 2) one way is to read the last xxx bytes (maybe 100) and search for the "startxref". after getting this, we can jump to the xref table / stream and parse till end. or we read the last xxx bytes and try to find the "trailer". i whould prefer the first step. after parsing the first xref table and the trailer, we should look if another one is in the document and parse it also and skip parsed references. 3) a good parser needs time and we should keep the old implementation if the user can't parse all the docs he have. the next thing is the seek time. i can't imagine that parsing a document lazy is quicker as parsing it complete from the beginning. if the parser need to jump between the objects, this costs much time on harddisk. this can take much time. the last question is. how much of the document do the user need to parse to get as much informations as he need to work with it. if he need to read 50% or 70% so we can parse the whole document. a) that idea is good. so we can grab minimal informations without parsing it complete and the first request for e.g. a page, parse the needed informations. My plan is to take a look after work and debug some documents and show what documents maybe fail and fix it. > Conforming parser > ----------------- > > Key: PDFBOX-1000 > URL: https://issues.apache.org/jira/browse/PDFBOX-1000 > Project: PDFBox > Issue Type: New Feature > Components: Parsing > Reporter: Adam Nichols > Assignee: Adam Nichols > Attachments: ConformingPDDocument.java, ConformingPDFParser.java, > ConformingPDFParserTest.java, XrefEntry.java, gdb-refcard.pdf > > > A conforming parser will start at the end of the file and read backward until > it has read the EOF marker, the xref location, and trailer[1]. Once this is > read, it will read in the xref table so it can locate other objects and > revisions. This also allows skipping objects which have been rendered > obsolete (per the xref table)[2]. It also allows the minimum amount of > information to be read when the file is loaded, and then subsequent > information will be loaded if and when it is requested. This is all laid out > in the official PDF specification, ISO 32000-1:2008. > Existing code will be re-used where possible, but this will require new > classes in order to accommodate the lazy reading which is a very different > paradigm from the existing parser. Using separate classes will also > eliminate the possibility of regression bugs from making their way into the > PDDocument or BaseParser classes. Changes to existing classes will be kept > to a minimum in order to prevent regression bugs. > [1] Section 7.5.5 "Conforming readers should read a PDF file from its end" > [2] Section 7.5.4 "the entire file need not be read to locate any particular > object" -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira