[jira] [Commented] (PDFBOX-1000) Conforming parser

Maruan Sahyoun (Commented) (JIRA) Fri, 13 Jan 2012 14:07:06 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185902#comment-13185902
 ]


Maruan Sahyoun commented on PDFBOX-1000:
----------------------------------------

thanks for the review and the effort taken. 

# the while loops I will fix - thanks for the hint. 
# the structure validation I'm more in favor of putting that into the parser. 
The reason behind that is that for being able to check for compliance I need 
the 'raw' data being read by the lexer instead of the 'parsed' data. E.g. 
checking that the offset entry in an xref entry is 10 digits. If I do the 
parsing from a 'raw' number in the lexer and let's say return a COSInteger that 
information will be gone. In addition e.g. reading/skipping the stream data can 
be done more efficiently after parsing the dictionarys length entry. The lexer 
doesn't know about that. So my current favorite is that the lexer is only 
creating tokens but doesn't ensure validity, creates COSObjects etc. - WDYT? 
# I fully agree that JUnit test cases will be needed and I'm about creating 
some basic cases. 
# I'm very interested in ensuring that parsing is done as quickly as possible 
without compromising the goal of ensuring/validating conformance to the spec. I 
don't think that the current implementation will offer the best performance 
simply because there will be a lot of unbuffered read() calls. This should be 
enhanced I think by using a small buffer to read more data and then work on 
that buffer. Because of the random nature of PDFs it might be that we read to 
many bytes into the buffer but the overall performance would still benefit as I 
think it's very rare that only single bytes are needed before doing another 
seek to a completly different location. WDYT?
# there will be code which handles PDF's which are not inline with the ISO 
spec. and I do trust that the new parser will offer better results than the 
current one but putting all current workarounds in will take some time as one 
needs to scan through the sources to identify these. What I'm planning to do is 
having some exits within the code for parsing individual sections to put the 
workarounds in. This way they stand out and are seperated from the 'clean' 
parsing. In addition one might also overwrite these. 
                
> Conforming parser
> -----------------
>
>                 Key: PDFBOX-1000
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1000
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: Parsing
>            Reporter: Adam Nichols
>            Assignee: Adam Nichols
>         Attachments: COSUnread.java, ConformingPDDocument.java, 
> ConformingPDFParser.java, ConformingPDFParserTest.java, PDFLexer.java, 
> XrefEntry.java, conforming-parser.patch, gdb-refcard.pdf
>
>
> A conforming parser will start at the end of the file and read backward until 
> it has read the EOF marker, the xref location, and trailer[1].  Once this is 
> read, it will read in the xref table so it can locate other objects and 
> revisions.  This also allows skipping objects which have been rendered 
> obsolete (per the xref table)[2].  It also allows the minimum amount of 
> information to be read when the file is loaded, and then subsequent 
> information will be loaded if and when it is requested.  This is all laid out 
> in the official PDF specification, ISO 32000-1:2008.
> Existing code will be re-used where possible, but this will require new 
> classes in order to accommodate the lazy reading which is a very different 
> paradigm from the existing parser.  Using separate classes will also 
> eliminate the possibility of regression bugs from making their way into the 
> PDDocument or BaseParser classes.  Changes to existing classes will be kept 
> to a minimum in order to prevent regression bugs.
> [1] Section 7.5.5 "Conforming readers should read a PDF file from its end"
> [2] Section 7.5.4 "the entire file need not be read to locate any particular 
> object"

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PDFBOX-1000) Conforming parser

Reply via email to