[ 
https://issues.apache.org/jira/browse/PDFBOX-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13181228#comment-13181228
 ] 

Maruan Sahyoun commented on PDFBOX-1000:
----------------------------------------

Just before the weekend another info about my progress.

Just to let you know about my approach.

There will be a new (PDF) lexer which works similar to StAX XML Stream Reader 
going through the PDF and producing events. One can walk through them using 
hastNext() and next(). Events are produced only for very basic PDF objects such 
as comments, string literals, keywords and numbers. Using getData() the content 
of the token belonging to the event can be retrieved in it's raw format. The 
lexer is using lazy loading so the data building up the token is only 
constructed when getData() is called, otherwise next() will skip to the next 
event without keeping the data. Cursor movement is always forward.

I'm now working on the next component SimpleParser (maybe should be called 
BaseParser later) which will extend the lexer. Taking the same approach as for 
the lexer this component is able to handle complex PDF Objects such as 
Dictionaries and Arrays.

ConformingParser will then extend SimpleParser to deal with Streams and all 
other PDF structures such as Xrefs ...

The lexer is feature complete. There will be some refinements as I'm working on 
the SimpleParser esp. remove the dependency on java.io.RandomAccessFile. Timo 
Boehme offered some help here.
I'm currently working on the SimpleParser. When this is ready I will submit the 
code for review.


                
> Conforming parser
> -----------------
>
>                 Key: PDFBOX-1000
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1000
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: Parsing
>            Reporter: Adam Nichols
>            Assignee: Adam Nichols
>         Attachments: COSUnread.java, ConformingPDDocument.java, 
> ConformingPDFParser.java, ConformingPDFParserTest.java, XrefEntry.java, 
> conforming-parser.patch, gdb-refcard.pdf
>
>
> A conforming parser will start at the end of the file and read backward until 
> it has read the EOF marker, the xref location, and trailer[1].  Once this is 
> read, it will read in the xref table so it can locate other objects and 
> revisions.  This also allows skipping objects which have been rendered 
> obsolete (per the xref table)[2].  It also allows the minimum amount of 
> information to be read when the file is loaded, and then subsequent 
> information will be loaded if and when it is requested.  This is all laid out 
> in the official PDF specification, ISO 32000-1:2008.
> Existing code will be re-used where possible, but this will require new 
> classes in order to accommodate the lazy reading which is a very different 
> paradigm from the existing parser.  Using separate classes will also 
> eliminate the possibility of regression bugs from making their way into the 
> PDDocument or BaseParser classes.  Changes to existing classes will be kept 
> to a minimum in order to prevent regression bugs.
> [1] Section 7.5.5 "Conforming readers should read a PDF file from its end"
> [2] Section 7.5.4 "the entire file need not be read to locate any particular 
> object"

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to