[ 
https://issues.apache.org/jira/browse/PDFBOX-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025339#comment-13025339
 ] 

Adam Nichols commented on PDFBOX-1000:
--------------------------------------

I updated readWord as described above (ending a "word" on characters like '/', 
']', etc.) and was able to remove all the ugly hacks.  I confirmed that it 
worked on my test PDF.

I'm started work on the lazy evaluation by creating a COSUnread object which is 
just a placeholder to let us know that the object hasn't been read yet.  
That'll allow reading an indirect reference as a COSObject consisting of: an 
objectNumber, generaion, and COSUnread.  Later, when we need the data, the 
COSUnread will be replaced with the actual object.  Or at least that's how I 
imagine it working...

I'll post the code again once I'm at least able to read the trailer in a lazy 
way, and am able to retrieve the info by automagically reading the data when a 
COSUnread is found.

> Conforming parser
> -----------------
>
>                 Key: PDFBOX-1000
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1000
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: Parsing
>            Reporter: Adam Nichols
>            Assignee: Adam Nichols
>         Attachments: ConformingPDDocument.java, ConformingPDFParser.java, 
> ConformingPDFParserTest.java, XrefEntry.java, gdb-refcard.pdf
>
>
> A conforming parser will start at the end of the file and read backward until 
> it has read the EOF marker, the xref location, and trailer[1].  Once this is 
> read, it will read in the xref table so it can locate other objects and 
> revisions.  This also allows skipping objects which have been rendered 
> obsolete (per the xref table)[2].  It also allows the minimum amount of 
> information to be read when the file is loaded, and then subsequent 
> information will be loaded if and when it is requested.  This is all laid out 
> in the official PDF specification, ISO 32000-1:2008.
> Existing code will be re-used where possible, but this will require new 
> classes in order to accommodate the lazy reading which is a very different 
> paradigm from the existing parser.  Using separate classes will also 
> eliminate the possibility of regression bugs from making their way into the 
> PDDocument or BaseParser classes.  Changes to existing classes will be kept 
> to a minimum in order to prevent regression bugs.
> [1] Section 7.5.5 "Conforming readers should read a PDF file from its end"
> [2] Section 7.5.4 "the entire file need not be read to locate any particular 
> object"

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to