[jira] [Commented] (PDFBOX-1000) Conforming parser

Maruan Sahyoun (Commented) (JIRA) Sat, 07 Jan 2012 01:18:18 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13181905#comment-13181905
 ]


Maruan Sahyoun commented on PDFBOX-1000:
----------------------------------------

I think I didn't do a good job describing what I'm heading for. It's clear that 
PDFs do need random access to get to the portions one is interested in. And 
that will be up to the parser to make sure that this is done. The lexer is only 
a helper to the parser when a certain section should be parsed. I think there 
something like hasNext and next is helpful. 

For example when parsing the xref table the parser will seek to the start and 
the lexer will start creating events/tokens from there which the parser can 
inspect - in this case until the parser get's to a token signaling the end of 
the trailer. Parsing the PDF header will be done in a similar manner. The 
parser seeks to the start of the file and then inspects the events/tokens 
delivered by the lexer. For an object the parsers seeks to the start of the 
object using the information in the xref table and again inspects the 
events/tokens delivered by the lexer.

Removing the dependency on RandomAccessFile was only meant for the lexer. The 
parser still needs the ability for random access. What I discussed with Timo 
Boehme was the possibility in using an InputStream as an input to the parser in 
addition to a file. If I understood him correctly he already implemented 
something which can be extended. But that's a different topic. For now the 
parser relies on RandomAccess and it will need a RandomAccess capability in the 
future.

I have to admit that writing such a parser is an ambitious project for me and 
I'm certain that there will be lot's of ways in improving the code. But I do 
hope the general approach is better understood now and seems to be the right 
approach. That's why I wrote about the status. On the other hand I do know the 
PDF spec very well so at least I know what PDF is about :-)

                
> Conforming parser
> -----------------
>
>                 Key: PDFBOX-1000
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1000
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: Parsing
>            Reporter: Adam Nichols
>            Assignee: Adam Nichols
>         Attachments: COSUnread.java, ConformingPDDocument.java, 
> ConformingPDFParser.java, ConformingPDFParserTest.java, XrefEntry.java, 
> conforming-parser.patch, gdb-refcard.pdf
>
>
> A conforming parser will start at the end of the file and read backward until 
> it has read the EOF marker, the xref location, and trailer[1].  Once this is 
> read, it will read in the xref table so it can locate other objects and 
> revisions.  This also allows skipping objects which have been rendered 
> obsolete (per the xref table)[2].  It also allows the minimum amount of 
> information to be read when the file is loaded, and then subsequent 
> information will be loaded if and when it is requested.  This is all laid out 
> in the official PDF specification, ISO 32000-1:2008.
> Existing code will be re-used where possible, but this will require new 
> classes in order to accommodate the lazy reading which is a very different 
> paradigm from the existing parser.  Using separate classes will also 
> eliminate the possibility of regression bugs from making their way into the 
> PDDocument or BaseParser classes.  Changes to existing classes will be kept 
> to a minimum in order to prevent regression bugs.
> [1] Section 7.5.5 "Conforming readers should read a PDF file from its end"
> [2] Section 7.5.4 "the entire file need not be read to locate any particular 
> object"

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PDFBOX-1000) Conforming parser

Reply via email to