[
https://issues.apache.org/jira/browse/PDFBOX-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13178405#comment-13178405
]
Maruan Sahyoun commented on PDFBOX-1199:
----------------------------------------
I had a quick look at the changes made and I think that this is a very good
step forward. The new parsing of the xref should resolve a lot of current
issues as do a lot of the other changes. As I'm currently working on
PDFBOX-1000 maybe we could have a quick chat about how to combine the efforts.
> Non-sequential PDF parser + PATCH
> ---------------------------------
>
> Key: PDFBOX-1199
> URL: https://issues.apache.org/jira/browse/PDFBOX-1199
> Project: PDFBox
> Issue Type: Improvement
> Components: Parsing
> Affects Versions: 1.6.0
> Reporter: Timo Boehme
> Attachments: 2012-01-02_NonSequentialParser.patch,
> NonSequentialPDFParser.java, RandomAccessBufferedFileInputStream.java
>
>
> Currently PDF parsing is done in sequential manner resulting in problems with
> stream parsing and skipping unused content. The solution is a conforming
> parser which first reads XREF tables and uses this information to only parse
> required objects and uses length information for stream parsing. A completely
> new implementation of such a parser is currently worked on in PDFBOX-1000.
> While this parser will be the long term solution a short term solution based
> on existing code would be desirable. A first incomplete solution was
> presented in PDFBOX-1104.
> Starting from PDFBOX-1104 I have implemented an 'as much as possible'
> conforming parser, called 'non-sequential parser', which handles all PDF
> documents (even inlined, with object streams etc.). The parser can be used as
> a drop-in-replacement for PDFParser (subclass of PDFParser). It overwrites
> method parse and getPage method. The only restriction is currently the need
> to specify a file instead of an input stream. In order to efficiently read
> the file and use it with the existing object parsing code I developed a
> RandomAccessBufferedFileInputStream which allows InputStream operations in
> combination with seek operations and cached read data.
> In order to use NonSequentialPDFParser small changes and additions on
> existing classes are needed. This includes changing some methods/fields from
> private to protected in PDFParser, add parsing of stream object information
> from XREF streams, store and get this information from XrefTrailerResolver
> (object ids are stored negated in order to distinguish them from offsets) and
> allow resetting offset in PushBackInputStream. All these changes do not
> change behavior of current parser. Another requirement is the long offset
> patch (PDFBOX-1196) which is excluded from the patch set provided here.
> The provided parser currently works in a forceParsing=false mode resulting in
> an IOException if a parsing error occurs. In most cases this shouldn't be a
> problem since in my use cases exceptions typically occur trying to parse
> unused content or streams which with this new parser are no problems anymore.
> In my setup I use the new parser first and if a parsing error occurs, fall
> back to the sequential parser (a bit like Acrobat does it if XREF information
> is buggy):
> try {
> // ---- try first with (mostly) standard conform parsing
> doc = PDDocument.loadNonSeq( PDF_FILE, raBuf );
> handleDocument(doc);
> } catch ( IOException ioe ) {
> // ---- retry with sequential parser and force parsing
> doc = PDDocument.load( new FileInputStream(PDF_FILE), raBuf, true );
> handleDocument(doc);
> }
> For me this new parser works very well on large document collections and is a
> large step forward to parse all documents also accepted by common PDF tools.
> While its behavior is nearly 'conform' there is nevertheless a need for a
> clean 'real' conforming parser. For instance since the underlying object
> structure has no access to the parser it is necessary to first parse all
> objects before they can be used. This includes objects that might not be
> needed at all. Another normally not needed step is copying the content of a
> stream. Since we work on a file with random access there would be no need for
> it. However this parser should fill the hole until a full featured and clean
> conforming parser is available.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira