[jira] [Commented] (PDFBOX-1000) Conforming parser

Maruan Sahyoun (Commented) (JIRA) Sun, 08 Jan 2012 23:16:18 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13182398#comment-13182398
 ]


Maruan Sahyoun commented on PDFBOX-1000:
----------------------------------------

I'm starting the work on the ConformingPDFParser now and there are some 
questions/ideas I would like to discuss:

a) as discussed earlier there will be two parsing modes, where strict will be 
conforming to the ISO spec. For strict I'm planning to check full compliance 
with the spec for areas I'm touching e.g. make sure that the (text based) xref 
table entries are really 20 bytes... - is that fine?
b) when constructing COS objects such as COSString the parser can make sure or 
complain that the data is according to the spec. The other alternative would be 
to put that into the COS object e.g. COSxxx.newInstance(). Both have it's 
benefits. Putting it into the parser means that all parsing is done in a 
central place. Putting it into the COS Object would mean that we have the 
reading and writing logic in the object itself so it's fully aware about it's 
lifecycle. I tend to put it into the parser initially but think that it should 
put into the COS object at a later stage. WDYT?
c) I would like to defer the parsing of an object to the state when this is 
requested. This will be for most objects but the very basic PDF objects needed 
to allow for some very basic information e.g. number of pages, metadata, 
encryption... - is that fine? Which information would need to be available from 
the start on?
d) I think about putting code which is a work around for buggy PDFs into some 
special methods - recoverXXXError. E.g. the current PDFParser has code where 
the xref table entries have three numbers instead of two (PDFBOX-474). Benefit 
will be that workarounds are clearly visible and not hidden within the main 
parsing code and we are offering a solution which can be extended. WDYT? 
Initially some exits will be made available - the code will come at a later 
date.
                
> Conforming parser
> -----------------
>
>                 Key: PDFBOX-1000
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1000
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: Parsing
>            Reporter: Adam Nichols
>            Assignee: Adam Nichols
>         Attachments: COSUnread.java, ConformingPDDocument.java, 
> ConformingPDFParser.java, ConformingPDFParserTest.java, XrefEntry.java, 
> conforming-parser.patch, gdb-refcard.pdf
>
>
> A conforming parser will start at the end of the file and read backward until 
> it has read the EOF marker, the xref location, and trailer[1].  Once this is 
> read, it will read in the xref table so it can locate other objects and 
> revisions.  This also allows skipping objects which have been rendered 
> obsolete (per the xref table)[2].  It also allows the minimum amount of 
> information to be read when the file is loaded, and then subsequent 
> information will be loaded if and when it is requested.  This is all laid out 
> in the official PDF specification, ISO 32000-1:2008.
> Existing code will be re-used where possible, but this will require new 
> classes in order to accommodate the lazy reading which is a very different 
> paradigm from the existing parser.  Using separate classes will also 
> eliminate the possibility of regression bugs from making their way into the 
> PDDocument or BaseParser classes.  Changes to existing classes will be kept 
> to a minimum in order to prevent regression bugs.
> [1] Section 7.5.5 "Conforming readers should read a PDF file from its end"
> [2] Section 7.5.4 "the entire file need not be read to locate any particular 
> object"

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PDFBOX-1000) Conforming parser

Reply via email to