Re: Support for hybrid-references (also discussed in: PDFBox and implementation, state reg. PDF Spec)

Andreas Lehmkühler Thu, 07 Aug 2014 02:59:21 -0700

Hi

> Martin Tappler <[email protected]> hat am 7. August 2014 um 11:44
> geschrieben:
>
>
> Hi,
>
> I am looking at PDF files at COS level and I found that the current
> implementation of the non-sequential parser does not provide support for
> hybrid cross references (which was discussed before in this mailing list).
>
> Look at the PDF file at [1] for instance. It contains a structure tree,
> which is hidden in the hybrid-reference file (actually such an example
> is also described in the PDF reference section 3.4.7. under
> "Compatibility with applications that do not support PDF 1.5."). The
> root of the structure tree is the object with object number 28 and
> generation number 0 and is contained in an object stream, which is only
> referenced in the cross reference stream, which is not parsed by the
> current implementation.
>
> I used version 1.8.6. from the maven repository and also the latest
> source version from the trunk to reproduce this behavior.
>
> However, I came up with a fix which works for me and which should not
> break anything. After parsing the cross reference table and the trailer,
> the trailer should be checked for an "XrefStm" entry. If this entry is
> present, the stream at the given offset should be parsed using
> parseXrefObjStream, but with the offset of the cross reference table as
> argument (this is done to ensure that the resolving process works as
> expected). This replaces the recently parsed information (table and
> trailer) in the XrefTrailerResolver, which should be stored in temporary
> variables. After this is done, the information contained in the cross
> reference stream is updated with the old trailer and the cross reference
> table information. According to the PDF spec, this should not be needed,
> but makes the parsing more robust, since there might be files, which
> store information in the table, but not in the stream. So this ensures
> that no information is lost.
>
> Please find patches for the fix attached. I hope they are useful.
Thanks for the contribution!


I didn't had a deeper look, but yes, I guess I'll be useful as I already
stumbled upon that missing feature as well in conjunction with
PDFBOX-2250 [1].

Saying that, I'll take care about that.

> Best regards,
> Martin Tappler

BR
Andreas Lehmkühler

[1] https://issues.apache.org/jira/browse/PDFBOX-2250

Re: Support for hybrid-references (also discussed in: PDFBox and implementation, state reg. PDF Spec)

Reply via email to