Hi > Martin Tappler <[email protected]> hat am 7. August 2014 um 11:44 > geschrieben: > > > Hi, > > I am looking at PDF files at COS level and I found that the current > implementation of the non-sequential parser does not provide support for > hybrid cross references (which was discussed before in this mailing list). > > Look at the PDF file at [1] for instance. It contains a structure tree, > which is hidden in the hybrid-reference file (actually such an example > is also described in the PDF reference section 3.4.7. under > "Compatibility with applications that do not support PDF 1.5."). The > root of the structure tree is the object with object number 28 and > generation number 0 and is contained in an object stream, which is only > referenced in the cross reference stream, which is not parsed by the > current implementation. > > I used version 1.8.6. from the maven repository and also the latest > source version from the trunk to reproduce this behavior. > > However, I came up with a fix which works for me and which should not > break anything. After parsing the cross reference table and the trailer, > the trailer should be checked for an "XrefStm" entry. If this entry is > present, the stream at the given offset should be parsed using > parseXrefObjStream, but with the offset of the cross reference table as > argument (this is done to ensure that the resolving process works as > expected). This replaces the recently parsed information (table and > trailer) in the XrefTrailerResolver, which should be stored in temporary > variables. After this is done, the information contained in the cross > reference stream is updated with the old trailer and the cross reference > table information. According to the PDF spec, this should not be needed, > but makes the parsing more robust, since there might be files, which > store information in the table, but not in the stream. So this ensures > that no information is lost. > > Please find patches for the fix attached. I hope they are useful. Thanks for the contribution!
I didn't had a deeper look, but yes, I guess I'll be useful as I already stumbled upon that missing feature as well in conjunction with PDFBOX-2250 [1]. Saying that, I'll take care about that. > Best regards, > Martin Tappler BR Andreas Lehmkühler [1] https://issues.apache.org/jira/browse/PDFBOX-2250
