[jira] [Commented] (PDFBOX-1000) Conforming parser

Adam Nichols (Commented) (JIRA) Wed, 11 Jan 2012 18:38:14 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13184677#comment-13184677
 ]


Adam Nichols commented on PDFBOX-1000:
--------------------------------------

First and foremost, I like the good documentation, comments and references to 
the PDF spec.  Bugs are easy to fix, a lack of documentation is not, so it's 
good to have this up front.

Secondly, I like the design.  Being able to read just one or two bytes and know 
what the next object will be is great.  I don't really care for "while(true)" 
loops such as is seen in processKeyword() (it could simply be 
"while(!isDelimiter(ch) && !isWhitespace(ch))" with the unread(ch); after the 
loop).  It's of no functional difference, but putting the stop condition in the 
loop just makes sense.

As for structure validation, it seems like it'd make sense to do that here, 
since this is what's dealing with the structure of objects.  The parser may 
also validate structure, but it will be looking for different things (for 
example, an indirect object which refers to an object that doesn't exist; or an 
series of indirect objects which form a loop, such as where a -> b -> c -> a; 
or other logical errors).  As discussed, this would only be enforced if it was 
in "strict mode".  I can see you've already taken care to deal with 
non-conforming PDFs (e.g. processStream(boolean keepData) where it checks to 
see if the end of line marker is there and makes sure that rawData is set 
properly in either case.

The PDFLexer looks very good.  About the only suggestion I can think of would 
be to add some JUnit test cases.  I remember that the current parser has some 
strange code for detecting the endstream, but it made a huge performance 
difference, so I'd suggest testing the Lexer with a file which contains a lot 
of streams in it to make sure that everything is okay.  Also, I know there are 
some PDFs in the JUnit tests that are non-conforming (sometimes in very major 
ways, not just missing newlines, but things like "[" with no matching "]").  As 
absurd as these may seem, these are things which I've personally seen in the 
wild and things which Adobe Reader is able to recover from, so it'd be 
preferable to deal with them at least as well as the current implementation.  
If I remember correctly, there's also some code in the current parser about 
dealing with missing/malformed end of file marker (i.e. "%%EOF").  I can't 
recall if there's an example PDF & JUnit test for that one, but it not it's 
easy to mangle/remove the "%%EOF" at the end (or in the middle of a file in the 
case of a PDF which has been incrementally updated).
                
> Conforming parser
> -----------------
>
>                 Key: PDFBOX-1000
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1000
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: Parsing
>            Reporter: Adam Nichols
>            Assignee: Adam Nichols
>         Attachments: COSUnread.java, ConformingPDDocument.java, 
> ConformingPDFParser.java, ConformingPDFParserTest.java, PDFLexer.java, 
> XrefEntry.java, conforming-parser.patch, gdb-refcard.pdf
>
>
> A conforming parser will start at the end of the file and read backward until 
> it has read the EOF marker, the xref location, and trailer[1].  Once this is 
> read, it will read in the xref table so it can locate other objects and 
> revisions.  This also allows skipping objects which have been rendered 
> obsolete (per the xref table)[2].  It also allows the minimum amount of 
> information to be read when the file is loaded, and then subsequent 
> information will be loaded if and when it is requested.  This is all laid out 
> in the official PDF specification, ISO 32000-1:2008.
> Existing code will be re-used where possible, but this will require new 
> classes in order to accommodate the lazy reading which is a very different 
> paradigm from the existing parser.  Using separate classes will also 
> eliminate the possibility of regression bugs from making their way into the 
> PDDocument or BaseParser classes.  Changes to existing classes will be kept 
> to a minimum in order to prevent regression bugs.
> [1] Section 7.5.5 "Conforming readers should read a PDF file from its end"
> [2] Section 7.5.4 "the entire file need not be read to locate any particular 
> object"

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PDFBOX-1000) Conforming parser

Reply via email to