PDFBOX-2523 still present (or variation of it still present)

Andreas Lehmkühler Mon, 16 Feb 2015 03:37:04 -0800

Hi,

> Steve Antoch <[email protected]> hat am 13. Februar 2015 um 23:34 geschrieben:
> 
> 
> 
> Hi Tilman and Andreas--
Please don't contact developers directly, use our mailing lists instead. I've
put the users list back into the boat...


> I am working with Krasimir on this issue.
> 
> Although we asked, we were denied permission to send the document out.
:-(

> The failure is being triggered when we attempt to use the Encrypt() class to
> password protect the pdf.
> We end up with the "Expected a long type at offset 113884174, instead got
> 'xref'" failure.
> 
> I have debugged into the PDFBox code and found the offending parts.
> 
> PdfBox is  trying to parse an xref table located at 113884174. 
> 
> The problem we are seeing is that the inside the trailer it finds the /XRefStm
> label, and its offset value is returned as 112085940 (which is what is given
> in the file), 
> However, the checkXRefOffset() call made to verify it doesn't find the xref
> stream there, so it goes searching and ends up returning the closest xref
> offset it can find, which happens to be that it returns its own offset at
> 113884174.  
> 
> 
> I believe that there is an error in PdfBox with respect to this fixup logic,
> even if it had found the 'correct' xref stream.
> That is because the fixup offset can NEVER work.  Every time it fixes up the
> location, it lands on a section which begins with "xref".
> The next call is to skip the whitespace, but since there is never any there
> (it's already proven to be 'xref'),  it does not advance the input stream. 
> Then, the first call to parse that xrefstm always calls readObjectID(), which
> always will throw the exception because the bytes are always 'xref'.
> 
> So, my questions are:
> 
> 1) Are these docs fixable or are they truly corrupt?
Without having a hand on the pdf itself it's hard to give a 100% answer. But I
guess there has to be fix, as adobe is able to open that pdf. I'll try to find
one, following your description of the pdf

> 2) Is this xref issue a known issue with PdfBox?  I would try to create a
> document that displays the error but I honesty don't know how to do so (beyond
> sending the ones that we have that DO display it).
Not until now

> 3) Do you have any idea how these documents end up in this state if they are
> being edited by tools such as InDesign, Acrobat, etc? Is there something I can
> do to identify them?  
There are a lot of more or less corrupt files in the wild. Those are created
using different tools.

> 4) If this is a truly corrupted document, why would Acrobat be able to open
> these files but pdfBox cannot?  Are these streams somehow ignorable?  I ask
> this because I saw this statement on a web page
>  (http://resources.infosecinstitute.com/pdf-file-format-basic-structure/) when
> I was searching for answers on this:
Adobe implements a lot of self healing mechanisms to repair broken files and we
try to do so too, but it's complicated.

> – /XrefStm [integer]: specifies the offset from the beginning of the file to
> the cross-reference stream in the decoded stream. This is only present in
> hybrid-reference files, which is specified if we would also like to open
> documents even if the applications  don’t support compressed reference
> streams.
> 
> Any light you can shed on this is appreciated.
> 
> Thanks-
> Steve
> 
> 
> See below for the pertinent data and the code which is marked with the values
> as I traced through.
> 
> I have confirmed that the byte offset of the word xref below is exactly at
> 113884174.

Does the xref stream start at 112085940 (stream offset from the trailer
dictionary) or what did you find at that offset? 


> xref
> 0 53641
> 0000000000 65535 f
> 0000000017 00000 n
> 
> <massive snip/>
> 
> 
> trailer
> \<\<
> /Size 53641
> /Root 1 0 R
> /XRefStm 112085940
> /Info 8 0 R
> /ID [\<19790A83488211E283B50017F203355C> \<E3DF7097A16969B08238787F19E7E219>]
> >>
> startxref
> 113884174
> %%EOF1 0 obj\<\</Outlines 2 0 R/Metadata 53641 0 R/AcroForm 4 0 R/Pages 5 0
> R/StructTreeRoot 6 0 R/Type/Catalog/PageLabels 7 0 R>>
> endobj
> 
> 
>     protected COSDictionary parseXref(long startXRefOffset) throws IOException
>     {
>         pdfSource.seek(startXRefOffset);
>         long startXrefOffset = parseStartXref();
>         // check the startxref offset
>         long fixedOffset = checkXRefOffset(startXrefOffset);
>         if (fixedOffset > -1)
>         {
>             startXrefOffset = fixedOffset;
>         }
>         document.setStartXref(startXrefOffset);
>         long prev = startXrefOffset;
>         // ---- parse whole chain of xref tables/object streams using PREV
> reference
>         while (prev > -1)  <== prev here is 113884174.
>         {
>             // seek to xref table
>             pdfSource.seek(prev);
> 
>             // skip white spaces
>             skipSpaces();
>             // -- parse xref
>             if (pdfSource.peek() == X)
>             {
>                 // xref table and trailer
>                 // use existing parser to parse xref table
>                 parseXrefTable(prev);
>                 // parse the last trailer.
>                 trailerOffset = pdfSource.getOffset();
>                 // PDFBOX-1739 skip extra xref entries in RegisSTAR documents
>                 while (isLenient && pdfSource.peek() != 't')
>                 {
>                     if (pdfSource.getOffset() == trailerOffset)
>                     {
>                         // warn only the first time
>                         LOG.warn("Expected trailer object at position " +
> trailerOffset
>                                 + ", keep trying");
>                     }
>                     readLine();
>                 }
>                 if (!parseTrailer())
>                 {
>                     throw new IOException("Expected trailer object at
> position: "
>                             + pdfSource.getOffset());
>                 }
>                 COSDictionary trailer =
> xrefTrailerResolver.getCurrentTrailer();
>                 // check for a XRef stream, it may contain some object ids of
> compressed objects
>                 if(trailer.containsKey(COSName.XREF_STM))  <== YES - but falue
>                 {
>                     int streamOffset = trailer.getInt(COSName.XREF_STM);  <==
> This returns 112085940, which is the value from the trailer
>                     // check the xref stream reference
>                     fixedOffset = checkXRefOffset(streamOffset);          <==
> checks it and returns 113884174 instead
>                     if (fixedOffset > -1 && fixedOffset != streamOffset)
>                     {
>                         streamOffset = (int)fixedOffset;
>                         trailer.setInt(COSName.XREF_STM, streamOffset);
>                     }
>                     pdfSource.seek(streamOffset);  <== Seeks to 113884174
>                     //readExpectedString(XREF_TABLE, false); 
>                     skipSpaces();    <===      It's ON "xref", so it doesn't
> skip anything
>                     parseXrefObjStream(prev, false); <== goes in here, first
> thing it tries to do is readObjectNumber(), which can't work because it's
> 'xref' -- BOOM
>                 }
>                 prev = trailer.getInt(COSName.PREV);
>                 if (prev > -1)
>                 {
>                     // check the xref table reference
>                     fixedOffset = checkXRefOffset(prev);
>                     if (fixedOffset > -1 && fixedOffset != prev)
>                     {
>                         prev = fixedOffset;
>                         trailer.setLong(COSName.PREV, prev);
>                     }
>                 }
>             }
>             else
>             {
>                 // parse xref stream
>                 prev = parseXrefObjStream(prev, true);
>                 if (prev > -1)
>                 {
>                     // check the xref table reference
>                     fixedOffset = checkXRefOffset(prev);
>                     if (fixedOffset > -1 && fixedOffset != prev)
>                     {
>                         prev = fixedOffset;
>                         COSDictionary trailer =
> xrefTrailerResolver.getCurrentTrailer();
>                         trailer.setLong(COSName.PREV, prev);
>                     }
>                 }
>             }
>         }
>         // ---- build valid xrefs out of the xref chain
>         xrefTrailerResolver.setStartxref(startXrefOffset);
>         COSDictionary trailer = xrefTrailerResolver.getTrailer();
>         document.setTrailer(trailer);
>         document.setIsXRefStream(XRefType.STREAM ==
> xrefTrailerResolver.getXrefType());
>         // check the offsets of all referenced objects
>         checkXrefOffsets();
>         // copy xref table
>         document.addXRefTable(xrefTrailerResolver.getXrefTable());
>         return trailer;
>     }


BR
Andreas Lehmkühler

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present (or variation of it still present)

Reply via email to