Hi,

Am 21.04.2011 21:09, schrieb Thomas Chojecki:
Am 21.04.2011 01:50, schrieb Kevin Jackson:
Basing the parser on RandomAccess would significantly improve the
accuracy of the parser. I have found many files which can be read by

The accuracy will be better, of course, but can bring also problems. I think
tests with lot of pdf documents will show it.

Adobe Reader and fail with PDFBox. The problem was that bytes that are
unreachable from the xref table can cause parsing errors. Using the

Adobe works also with normal people who try to do a good parser, but also makes
mistakes. But the adobe reader is the reference in reading pdf documents and is
also the first attempt to test the own product. The reader is all but not spec
conform and people think, if the adobe reader can read a document, this doc is
good enough.

Yeah you are right, parsing only the xref table can cause problems. Also parsing
the whole file from the beginning :) this is a never ending story.

This is originally Andreas project and his opinion interest me. Do you try
something similar in the past? or what problems do you have while programming
this parser.
No offense, I'd just like to clarify Thomas statement.
PDFBox isn't my project. Ben Litchfield started the development in 2002 and later on Daniel Wilson and Philip Koch joined the team. First it was hosted on sourceforge and entered the apache incubator after a software grant in 2008. In 2009 PDFBox finally became a toplevel project. I got my invitation as committer in december 2008. As PMC Chair I'm the speaker of the project and the interface to the ASF board. I don't have more power than any other committer/PMC-member.

This should also answer your question about the implementation of the parser. I didn't do that.

However, I'd like the approach to improve the parser. I didn't yet have the time to look into the details, but the facts I already read about it sound good.


BR
Andreas Lehmkühler

Reply via email to