Am 21.04.2011 01:50, schrieb Kevin Jackson:
Basing the parser on RandomAccess would significantly improve the
accuracy of the parser.  I have found many files which can be read by

The accuracy will be better, of course, but can bring also problems. I think tests with lot of pdf documents will show it.

Adobe Reader and fail with PDFBox.  The problem was that bytes that are
unreachable from the xref table can cause parsing errors. Using the

Adobe works also with normal people who try to do a good parser, but also makes mistakes. But the adobe reader is the reference in reading pdf documents and is also the first attempt to test the own product. The reader is all but not spec conform and people think, if the adobe reader can read a document, this doc is good enough.

Yeah you are right, parsing only the xref table can cause problems. Also parsing the whole file from the beginning :) this is a never ending story.

This is originally Andreas project and his opinion interest me. Do you try something similar in the past? or what problems do you have while programming this parser.

force flag helped but didn't solve everything.  I also found documents
where the unreachable stuff what just valid enough to cause reachable
valid data to be skipped.  This resulted in null pointer exceptions
later.

This documents should be collect at one place and analyzed to improve the pdfbox. Better is to use this documents for junit tests.

Reading PDF files sequentially like PDFBox currently does works fine
most of the time but basing the parse off of the offsets in the xref
table would be more reliable.

Yeah, reliable and spec conform, but i think i will cause also problems parsing malformed documents. Can't imagine that the adobe reader use only the xref table to find the proper objects.


Kevin Jackson

-----Original Message-----
From: [email protected] [mailto:[email protected]]
Sent: Wednesday, April 20, 2011 11:57 AM
To: [email protected]
Subject: Re: RandomAccessFile for PDFBox

Yeah, I'll create a JIRA issue for the conforming parser, which uses a
RandomAccess, and I'll attach what I have done so far.  Right now the
new
parser can read in the trailing and xref table, which allows it to jump
directly to any object which needs loaded.  Per the PDF spec, it starts
at
the end of the file and goes backward to get the EOF flag, xref
location,
and trailer information.

Great news :) Hope this work also for incremental updates and xref streams. ;)

Thomas has the right idea, read in the minimum amount of objects
possible
and then read the rest if/when the user requests them.  Last I remember
I
was still working on parsing all the different types of objects so I
could
read the Root entry.

To implement such a parser is much work. What i think is to do this work in two steps. First of all to create a storage for the document. if this works an alternative parser will be a good next step.


BR
Thomas

Reply via email to