[ https://issues.apache.org/jira/browse/PDFBOX-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Timo Boehme reassigned PDFBOX-1199: ----------------------------------- Assignee: Timo Boehme > Non-sequential PDF parser + PATCH > --------------------------------- > > Key: PDFBOX-1199 > URL: https://issues.apache.org/jira/browse/PDFBOX-1199 > Project: PDFBox > Issue Type: Improvement > Components: Parsing > Affects Versions: 1.6.0 > Reporter: Timo Boehme > Assignee: Timo Boehme > Attachments: 2012-01-02_NonSequentialParser.patch, > NonSequentialPDFParser.java, RandomAccessBufferedFileInputStream.java > > > Currently PDF parsing is done in sequential manner resulting in problems with > stream parsing and skipping unused content. The solution is a conforming > parser which first reads XREF tables and uses this information to only parse > required objects and uses length information for stream parsing. A completely > new implementation of such a parser is currently worked on in PDFBOX-1000. > While this parser will be the long term solution a short term solution based > on existing code would be desirable. A first incomplete solution was > presented in PDFBOX-1104. > Starting from PDFBOX-1104 I have implemented an 'as much as possible' > conforming parser, called 'non-sequential parser', which handles all PDF > documents (even inlined, with object streams etc.). The parser can be used as > a drop-in-replacement for PDFParser (subclass of PDFParser). It overwrites > method parse and getPage method. The only restriction is currently the need > to specify a file instead of an input stream. In order to efficiently read > the file and use it with the existing object parsing code I developed a > RandomAccessBufferedFileInputStream which allows InputStream operations in > combination with seek operations and cached read data. > In order to use NonSequentialPDFParser small changes and additions on > existing classes are needed. This includes changing some methods/fields from > private to protected in PDFParser, add parsing of stream object information > from XREF streams, store and get this information from XrefTrailerResolver > (object ids are stored negated in order to distinguish them from offsets) and > allow resetting offset in PushBackInputStream. All these changes do not > change behavior of current parser. Another requirement is the long offset > patch (PDFBOX-1196) which is excluded from the patch set provided here. > The provided parser currently works in a forceParsing=false mode resulting in > an IOException if a parsing error occurs. In most cases this shouldn't be a > problem since in my use cases exceptions typically occur trying to parse > unused content or streams which with this new parser are no problems anymore. > In my setup I use the new parser first and if a parsing error occurs, fall > back to the sequential parser (a bit like Acrobat does it if XREF information > is buggy): > try { > // ---- try first with (mostly) standard conform parsing > doc = PDDocument.loadNonSeq( PDF_FILE, raBuf ); > handleDocument(doc); > } catch ( IOException ioe ) { > // ---- retry with sequential parser and force parsing > doc = PDDocument.load( new FileInputStream(PDF_FILE), raBuf, true ); > handleDocument(doc); > } > For me this new parser works very well on large document collections and is a > large step forward to parse all documents also accepted by common PDF tools. > While its behavior is nearly 'conform' there is nevertheless a need for a > clean 'real' conforming parser. For instance since the underlying object > structure has no access to the parser it is necessary to first parse all > objects before they can be used. This includes objects that might not be > needed at all. Another normally not needed step is copying the content of a > stream. Since we work on a file with random access there would be no need for > it. However this parser should fill the hole until a full featured and clean > conforming parser is available. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira