Hi,
Am 21.04.2011 21:09, schrieb Thomas Chojecki:
Am 21.04.2011 01:50, schrieb Kevin Jackson:
Basing the parser on RandomAccess would significantly improve the
accuracy of the parser. I have found many files which can be read by
The accuracy will be better, of course, but can bring also problems. I think
tests with lot of pdf documents will show it.
Adobe Reader and fail with PDFBox. The problem was that bytes that are
unreachable from the xref table can cause parsing errors. Using the
Adobe works also with normal people who try to do a good parser, but also makes
mistakes. But the adobe reader is the reference in reading pdf documents and is
also the first attempt to test the own product. The reader is all but not spec
conform and people think, if the adobe reader can read a document, this doc is
good enough.
Yeah you are right, parsing only the xref table can cause problems. Also parsing
the whole file from the beginning :) this is a never ending story.
This is originally Andreas project and his opinion interest me. Do you try
something similar in the past? or what problems do you have while programming
this parser.
No offense, I'd just like to clarify Thomas statement.
PDFBox isn't my project. Ben Litchfield started the development in 2002 and
later on Daniel Wilson and Philip Koch joined the team. First it was hosted on
sourceforge and entered the apache incubator after a software grant in 2008. In
2009 PDFBox finally became a toplevel project. I got my invitation as committer
in december 2008. As PMC Chair I'm the speaker of the project and the interface
to the ASF board. I don't have more power than any other committer/PMC-member.
This should also answer your question about the implementation of the parser. I
didn't do that.
However, I'd like the approach to improve the parser. I didn't yet have the time
to look into the details, but the facts I already read about it sound good.
BR
Andreas Lehmkühler