Non-sequential PDF parser + PATCH
---------------------------------

                 Key: PDFBOX-1199
                 URL: https://issues.apache.org/jira/browse/PDFBOX-1199
             Project: PDFBox
          Issue Type: Improvement
          Components: Parsing
    Affects Versions: 1.6.0
            Reporter: Timo Boehme


Currently PDF parsing is done in sequential manner resulting in problems with 
stream parsing and skipping unused content. The solution is a conforming parser 
which first reads XREF tables and uses this information to only parse required 
objects and uses length information for stream parsing. A completely new 
implementation of such a parser is currently worked on in PDFBOX-1000. While 
this parser will be the long term solution a short term solution based on 
existing code would be desirable. A first incomplete solution was presented in 
PDFBOX-1104.

Starting from PDFBOX-1104 I have implemented an 'as much as possible' 
conforming parser, called 'non-sequential parser', which handles all PDF 
documents (even inlined, with object streams etc.). The parser can be used as a 
drop-in-replacement for PDFParser (subclass of PDFParser). It overwrites method 
parse and getPage method. The only restriction is currently the need to specify 
a file instead of an input stream. In order to efficiently read the file and 
use it with the existing object parsing code I developed a 
RandomAccessBufferedFileInputStream which allows InputStream operations in 
combination with seek operations and cached read data.

In order to use NonSequentialPDFParser small changes and additions on existing 
classes are needed. This includes changing some methods/fields from private to 
protected in PDFParser, add parsing of stream object information from XREF 
streams, store and get this information from XrefTrailerResolver (object ids 
are stored negated in order to distinguish them from offsets) and allow 
resetting offset in PushBackInputStream. All these changes do not change 
behavior of current parser. Another requirement is the long offset patch 
(PDFBOX-1196) which is excluded from the patch set provided here.

The provided parser currently works in a forceParsing=false mode resulting in 
an IOException if a parsing error occurs. In most cases this shouldn't be a 
problem since in my use cases exceptions typically occur trying to parse unused 
content or streams which with this new parser are no problems anymore. In my 
setup I use the new parser first and if a parsing error occurs, fall back to 
the sequential parser (a bit like Acrobat does it if XREF information is buggy):

try {
    // ---- try first with (mostly) standard conform parsing 
    doc = PDDocument.loadNonSeq( PDF_FILE, raBuf );
    handleDocument(doc);
} catch ( IOException ioe ) {
    // ---- retry with sequential parser and force parsing
    doc = PDDocument.load( new FileInputStream(PDF_FILE), raBuf, true );
    handleDocument(doc);
}

For me this new parser works very well on large document collections and is a 
large step forward to parse all documents also accepted by common PDF tools. 
While its behavior is nearly 'conform' there is nevertheless a need for a clean 
'real' conforming parser. For instance since the underlying object structure 
has no access to the parser it is necessary to first parse all objects before 
they can be used. This includes objects that might not be needed at all. 
Another normally not needed step is copying the content of a stream. Since we 
work on a file with random access there would be no need for it. However this 
parser should fill the hole until a full featured and clean conforming parser 
is available.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to