[jira] [Commented] (PDFBOX-1000) Conforming parser

Maruan Sahyoun (Commented) (JIRA) Sun, 01 Jan 2012 03:31:55 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13178143#comment-13178143
 ]


Maruan Sahyoun commented on PDFBOX-1000:
----------------------------------------

Thanks for your valuable feedback. I'll try to provide a status from time to 
time to inform about the progress. 

With the startxref - my mistake it's EOF being required [PDF 1.7 App. H 18]. 
That was the idea behind Acrobat parsing mode to implement the notes in App. H. 
But I think you are right, 2 Strict and Relaxed should be enough.

For the documentation I'm putting links to the reference into the code wherever 
I feel that structures are defined which are related to the spec, to describe 
what is going on or where assumptions are made. Small sample:

                case DelimiterChars.OpeningAngleBracket:                // 
Dictionary or Hex String
                        // This could be either the start of a 
                        // Dictionary [PDF 1.7: 3.2.6] or a 
                        // Hexadecimal String [PDF 1.7: 3.2.3]
                        // so we need to read the next ch to make 
                        // a decision

At the moment I'm trying to get to a state where I can submit the code and it's 
really doing something useful. There will be TODOs I'm documenting within the 
code. I think at that point in time I'm looking for feedback and help. One of 
the lacking areas is doing formal unit tests although I'm testing individual 
functions against some PDFs I have as development moves forward. So I'm glad 
that you can commit your PDFs for unit testing.
                
> Conforming parser
> -----------------
>
>                 Key: PDFBOX-1000
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1000
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: Parsing
>            Reporter: Adam Nichols
>            Assignee: Adam Nichols
>         Attachments: COSUnread.java, ConformingPDDocument.java, 
> ConformingPDFParser.java, ConformingPDFParserTest.java, XrefEntry.java, 
> conforming-parser.patch, gdb-refcard.pdf
>
>
> A conforming parser will start at the end of the file and read backward until 
> it has read the EOF marker, the xref location, and trailer[1].  Once this is 
> read, it will read in the xref table so it can locate other objects and 
> revisions.  This also allows skipping objects which have been rendered 
> obsolete (per the xref table)[2].  It also allows the minimum amount of 
> information to be read when the file is loaded, and then subsequent 
> information will be loaded if and when it is requested.  This is all laid out 
> in the official PDF specification, ISO 32000-1:2008.
> Existing code will be re-used where possible, but this will require new 
> classes in order to accommodate the lazy reading which is a very different 
> paradigm from the existing parser.  Using separate classes will also 
> eliminate the possibility of regression bugs from making their way into the 
> PDDocument or BaseParser classes.  Changes to existing classes will be kept 
> to a minimum in order to prevent regression bugs.
> [1] Section 7.5.5 "Conforming readers should read a PDF file from its end"
> [2] Section 7.5.4 "the entire file need not be read to locate any particular 
> object"

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PDFBOX-1000) Conforming parser

Reply via email to