Re: RandomAccessFile for PDFBox

Thomas Chojecki Thu, 21 Apr 2011 12:07:34 -0700

Am 21.04.2011 01:50, schrieb Kevin Jackson:

Basing the parser on RandomAccess would significantly improve the
accuracy of the parser.  I have found many files which can be read by

The accuracy will be better, of course, but can bring also problems. Ithink tests with lot of pdf documents will show it.

Adobe Reader and fail with PDFBox.  The problem was that bytes that are
unreachable from the xref table can cause parsing errors. Using the

Adobe works also with normal people who try to do a good parser, butalso makes mistakes. But the adobe reader is the reference in readingpdf documents and is also the first attempt to test the own product. Thereader is all but not spec conform and people think, if the adobe readercan read a document, this doc is good enough.

Yeah you are right, parsing only the xref table can cause problems. Alsoparsing the whole file from the beginning :) this is a never ending story.

This is originally Andreas project and his opinion interest me. Do youtry something similar in the past? or what problems do you have whileprogramming this parser.

force flag helped but didn't solve everything.  I also found documents
where the unreachable stuff what just valid enough to cause reachable
valid data to be skipped.  This resulted in null pointer exceptions
later.

This documents should be collect at one place and analyzed to improvethe pdfbox. Better is to use this documents for junit tests.

Reading PDF files sequentially like PDFBox currently does works fine
most of the time but basing the parse off of the offsets in the xref
table would be more reliable.

Yeah, reliable and spec conform, but i think i will cause also problemsparsing malformed documents. Can't imagine that the adobe reader useonly the xref table to find the proper objects.

Kevin Jackson

-----Original Message-----
From: [email protected] [mailto:[email protected]]
Sent: Wednesday, April 20, 2011 11:57 AM
To: [email protected]
Subject: Re: RandomAccessFile for PDFBox

Yeah, I'll create a JIRA issue for the conforming parser, which uses a
RandomAccess, and I'll attach what I have done so far.  Right now the
new
parser can read in the trailing and xref table, which allows it to jump
directly to any object which needs loaded.  Per the PDF spec, it starts
at
the end of the file and goes backward to get the EOF flag, xref
location,
and trailer information.

Great news :) Hope this work also for incremental updates and xrefstreams. ;)

Thomas has the right idea, read in the minimum amount of objects
possible
and then read the rest if/when the user requests them.  Last I remember
I
was still working on parsing all the different types of objects so I
could
read the Root entry.

To implement such a parser is much work. What i think is to do this workin two steps. First of all to create a storage for the document. if thisworks an alternative parser will be a good next step.



BR
Thomas

Re: RandomAccessFile for PDFBox

Reply via email to