[jira] [Commented] (PDFBOX-1199) Non-sequential PDF parser + PATCH

Maruan Sahyoun (Commented) (JIRA) Mon, 02 Jan 2012 06:16:58 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13178405#comment-13178405
 ]


Maruan Sahyoun commented on PDFBOX-1199:
----------------------------------------

I had a quick look at the changes made and I think that this is a very good 
step forward. The new parsing of the xref should resolve a lot of current 
issues as do a lot of the other changes. As I'm currently working on 
PDFBOX-1000 maybe we could have a quick chat about how to combine the efforts.
                
> Non-sequential PDF parser + PATCH
> ---------------------------------
>
>                 Key: PDFBOX-1199
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1199
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 1.6.0
>            Reporter: Timo Boehme
>         Attachments: 2012-01-02_NonSequentialParser.patch, 
> NonSequentialPDFParser.java, RandomAccessBufferedFileInputStream.java
>
>
> Currently PDF parsing is done in sequential manner resulting in problems with 
> stream parsing and skipping unused content. The solution is a conforming 
> parser which first reads XREF tables and uses this information to only parse 
> required objects and uses length information for stream parsing. A completely 
> new implementation of such a parser is currently worked on in PDFBOX-1000. 
> While this parser will be the long term solution a short term solution based 
> on existing code would be desirable. A first incomplete solution was 
> presented in PDFBOX-1104.
> Starting from PDFBOX-1104 I have implemented an 'as much as possible' 
> conforming parser, called 'non-sequential parser', which handles all PDF 
> documents (even inlined, with object streams etc.). The parser can be used as 
> a drop-in-replacement for PDFParser (subclass of PDFParser). It overwrites 
> method parse and getPage method. The only restriction is currently the need 
> to specify a file instead of an input stream. In order to efficiently read 
> the file and use it with the existing object parsing code I developed a 
> RandomAccessBufferedFileInputStream which allows InputStream operations in 
> combination with seek operations and cached read data.
> In order to use NonSequentialPDFParser small changes and additions on 
> existing classes are needed. This includes changing some methods/fields from 
> private to protected in PDFParser, add parsing of stream object information 
> from XREF streams, store and get this information from XrefTrailerResolver 
> (object ids are stored negated in order to distinguish them from offsets) and 
> allow resetting offset in PushBackInputStream. All these changes do not 
> change behavior of current parser. Another requirement is the long offset 
> patch (PDFBOX-1196) which is excluded from the patch set provided here.
> The provided parser currently works in a forceParsing=false mode resulting in 
> an IOException if a parsing error occurs. In most cases this shouldn't be a 
> problem since in my use cases exceptions typically occur trying to parse 
> unused content or streams which with this new parser are no problems anymore. 
> In my setup I use the new parser first and if a parsing error occurs, fall 
> back to the sequential parser (a bit like Acrobat does it if XREF information 
> is buggy):
> try {
>     // ---- try first with (mostly) standard conform parsing 
>     doc = PDDocument.loadNonSeq( PDF_FILE, raBuf );
>     handleDocument(doc);
> } catch ( IOException ioe ) {
>     // ---- retry with sequential parser and force parsing
>     doc = PDDocument.load( new FileInputStream(PDF_FILE), raBuf, true );
>     handleDocument(doc);
> }
> For me this new parser works very well on large document collections and is a 
> large step forward to parse all documents also accepted by common PDF tools. 
> While its behavior is nearly 'conform' there is nevertheless a need for a 
> clean 'real' conforming parser. For instance since the underlying object 
> structure has no access to the parser it is necessary to first parse all 
> objects before they can be used. This includes objects that might not be 
> needed at all. Another normally not needed step is copying the content of a 
> stream. Since we work on a file with random access there would be no need for 
> it. However this parser should fill the hole until a full featured and clean 
> conforming parser is available.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PDFBOX-1199) Non-sequential PDF parser + PATCH

Reply via email to