[ 
https://issues.apache.org/jira/browse/PDFBOX-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13024985#comment-13024985
 ] 

Thomas Chojecki commented on PDFBOX-1000:
-----------------------------------------

Hi Adam,

first of all, thx for publishing the code. I think you forgot one class 
"org.apache.pdfbox.pdmodel.common.XrefEntry"

@ 1)
i take a look at [1] and can't find a error. 
The indirect object 31 is a dictionary object with 4 key-value pairs as 
followed:

The first entry has the name object "Length" and redirect to the indirect 
object 45. So you need to take a look inside the xref table for the object 45 
to see the value (e.g. 45 0 obj 500 endobj).
The other three entries named "Length1", "Length2" and "Length3" have the 
integer object 568, 1017 and 0

For parsing the key-value pairs. Each key is a name object beginning with / 
(0x2F) immediately followed by the name without whitespaces. After the key you 
will find a blank (0x20) and the related value. In case that the value is also 
a name object, the blank will be omited. 

So if you try to read the whole object 31, you need also refer to object 45.

For more informations about the objects, look at the section 7.3 and 7.3.7 of 
the spec.

Have you take a look at the current parser? the parser categorize the engine 
into small parts like parsing objects, parsing trailer. each object has rules 
for parsing it. by example. if you find a indirect object you will parse the 
prefix first (number generation R) then you parse the object (parseObject()) 
the next byte will be a delimiter like whitespace, linefeed or maybe a 
"less-than sign" ... more you will find in section 7.2.2 table 1 and 2. then 
you know you will find the key beginning with a / and followed by the name. 
after the name you need to parse again an object.

hard to explain how it work proper. the actual parser do a good work and should 
not be replaced completely. maybe some parts can be copied.

The string objects start and end with parenthesis. if the text also has 
paranthesis, they shall be balanced. if not you need to escape it. see section 
7.3.4.2. 

@ 2)
the dictionary is parsed before xref table? if you want to do it spec conform, 
the first thing is to find the whole trailer with the startxref.
then you can know where to find the root dictionary and the xref table. so you 
can parse the xref table first.

the most informations about the document can be extract from the trailer and 
the root dictionary. inside the root dict you can find the page dictionary (i 
hope this can be parsed lazy), also you can find the acroform field with forms 
and annotations. i think there are more informations, but i don't study all of 
them. 

parsing the page dictionary will offer you the page structure as a tree and 
will refer though most of the objects of the pdf. but i don't know how this 
exactly work. for creating a lazy parser someone need to study this part of the 
spec.

@ 3) 
i will take a look at the classes next days and try also to work on it. is 
there a easier way to confirm changes to it? like an extra repository? i can 
provide a cvs repository if this can help.

otherway i will try to do the RandomAccessFile-like structure for the pdfbox.

> Conforming parser
> -----------------
>
>                 Key: PDFBOX-1000
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1000
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: Parsing
>            Reporter: Adam Nichols
>            Assignee: Adam Nichols
>         Attachments: ConformingPDDocument.java, ConformingPDFParser.java, 
> ConformingPDFParserTest.java, gdb-refcard.pdf
>
>
> A conforming parser will start at the end of the file and read backward until 
> it has read the EOF marker, the xref location, and trailer[1].  Once this is 
> read, it will read in the xref table so it can locate other objects and 
> revisions.  This also allows skipping objects which have been rendered 
> obsolete (per the xref table)[2].  It also allows the minimum amount of 
> information to be read when the file is loaded, and then subsequent 
> information will be loaded if and when it is requested.  This is all laid out 
> in the official PDF specification, ISO 32000-1:2008.
> Existing code will be re-used where possible, but this will require new 
> classes in order to accommodate the lazy reading which is a very different 
> paradigm from the existing parser.  Using separate classes will also 
> eliminate the possibility of regression bugs from making their way into the 
> PDDocument or BaseParser classes.  Changes to existing classes will be kept 
> to a minimum in order to prevent regression bugs.
> [1] Section 7.5.5 "Conforming readers should read a PDF file from its end"
> [2] Section 7.5.4 "the entire file need not be read to locate any particular 
> object"

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to