PDFBOX-2523 still present (or variation of it still present)

Andreas Lehmkühler Tue, 24 Feb 2015 03:32:00 -0800

Hi Steve,

> Steve Antoch <[email protected]> hat am 23. Februar 2015 um 19:42 geschrieben:
> 
> 
> @Andreas-
> 
> I have downloaded the latest trunk and came close (it got much further) before
> failing.
> However, I think I may have a fix for that failure:
Thanks for the test


> The code is returning 0 when the xrefstm fixedOffset is not found.  However,
> the code still tries to load and parse from xref 0, resulting in a null
> reference exception later in parser.parse().
Your analysis is correct, but I hope that my last improvements should eliminate
such cases, see PDFBOX-2572 for details. Could you give the latest trunk
(r1661747) a try?

> However, thinking about this, I came up with this:
> 
>                 // check for a XRef stream, it may contain some object ids of
> compressed objects 
>                 if(trailer.containsKey(COSName.XREF_STM))
>                 {
>                     int streamOffset = trailer.getInt(COSName.XREF_STM);
>                     // check the xref stream reference
>                     fixedOffset = checkXRefStreamOffset(streamOffset, false);
>   //<== fixedoffset comes back as 0 => not found
>                     if (fixedOffset > -1 && fixedOffset != streamOffset)
>                     {
>                         streamOffset = (int)fixedOffset;
>                                               // <== streamOffset gets set to
> 0 here
>                         trailer.setInt(COSName.XREF_STM, streamOffset);
>                     }
>                     
>                     if (streamOffset > 0)    //<====  I added this test
> because an xref stream starting at 
>                                                            //  offset 0 can
> never happen, so we should simply skip it
>                     {
>                         pdfSource.seek(streamOffset);
>                         skipSpaces();
>                         parseXrefObjStream(prev, false);  <== this call
> ultimately throws a null ref exception if streamOffset == 0 on entry
>                     }
>                 }
> 
> Adding that, the file successfully parses.
> 
> Also, there was this proposal that I put up on github in a repo that I
> directly forked from pdfbox (it is the only change)
> It relaxes the looping a bit to allow limited recursion.  I would appreciate
> your thoughts on it. 
Is this change related to the discussed issue above?

> https://github.com/santoch/pdfbox/commit/75cc32ab8307062709c30f1cfea5e2fdb8c00ddd
> 
> Thank you so much!  You have been tremendously helpful.  I wish I could have
> given you the files, but unfortunately, they are proprietary and we cannot
> release them.  :-(
No need to worry, you are not the only one who is not allowed to share a
specific pdf.

> Best regards-
> Steve

BR
Andreas Lehmkühler

> 
> ________________________________________
> From: Andreas Lehmkühler <[email protected]>
> Sent: Monday, February 23, 2015 3:43 AM
> To: [email protected]
> Subject: Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present
> (or variation of it still present)
> 
> Hi,
> 
> I've improved the self repair mechnism of the trunk based on Steves report.
> 
> @Steve Please give the newest trunk version/SNAPSHOT a try. Does the issue
> still
> persist?
> 
> BR
> Andreas Lehmkühler
> 
> > Steve Antoch <[email protected]> hat am 17. Februar 2015 um 00:05
> > geschrieben:
> >
> >
> >
> > Andreas-
> > Thanks for the response.
> > Sorry for sending directly.
> >
> > Yes, it tries to read from offset 112085940, but does not find the xrefstm
> > there, so
> > that's when it goes searching.  It seems to be landing in the middle of
> > something else (perhaps an image?)
> >
> > I tried running the preflight command on the file, and this is what it found
> > there.
> > This is in the middle of a whole series of repetitive byte patterns like
> > these, which is interspersed with other sections of content that is also
> > binary only.
> >
> > <?xml version="1.0" encoding="UTF-8" standalone="no"?>
> > <preflight name="file.pdf">
> >   <executionTimeMS>2646</executionTimeMS>
> >   <isValid type="">false</isValid>
> >   <errors count="1">
> >     <error count="1">
> >       <code>1.0</code>
> >       <details>Syntax error, Error: Expected a long type at offset
> > 112085940,
> > instead got
> > '6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ±¯Ó"z·C&#156;3Í}&#14;y&#11;ó&#3;£g&#130;?1º·Ó&#158;-ó&#143;VÏ:ë½NsË&#142;¸&#31;6lÙ³fÅ#ë&#147;&#29;&#31;¨Î÷å.£=&#137;ù}ÕsÞÿ'</details>
> >     </error>
> >   </errors>
> > </preflight>
> >
> > The patterns seem to be:
> >
> > lots of these: 6lÙ³fÍ&#155;
> > interspersed between blocks that are similar to this:
> > ±¯Ó"z·C&#156;3Í}&#14;y&#11;ó&#3;£g&#130;?1º·Ó&#158;-ó&#143;VÏ:ë½NsË&#142;¸&#31;6lÙ³fÅ#ë&#147;&#29;&#31;¨Î÷å.£=&#137;ù}ÕsÞÿ'
> >
> > It just so happens that the offset 112085940 falls right in the middle of a
> > big block of those 6lÙ³fÍ&#155; repetitive blocks.
> >
> > Not sure if that's any help.
> >
> > Steve
> >
> > ________________________________________
> > From: Andreas Lehmkühler <[email protected]>
> > Sent: Monday, February 16, 2015 3:34 AM
> > To: [email protected]
> > Subject: Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present
> > (or variation of it still present)
> >
> > Hi,
> >
> > > Steve Antoch <[email protected]> hat am 13. Februar 2015 um 23:34
> > > geschrieben:
> > >
> > >
> > >
> > > Hi Tilman and Andreas--
> > Please don't contact developers directly, use our mailing lists instead.
> > I've
> > put the users list back into the boat...
> >
> > > I am working with Krasimir on this issue.
> > >
> > > Although we asked, we were denied permission to send the document out.
> > :-(
> >
> > > The failure is being triggered when we attempt to use the Encrypt() class
> > > to
> > > password protect the pdf.
> > > We end up with the "Expected a long type at offset 113884174, instead got
> > > 'xref'" failure.
> > >
> > > I have debugged into the PDFBox code and found the offending parts.
> > >
> > > PdfBox is  trying to parse an xref table located at 113884174.
> > >
> > > The problem we are seeing is that the inside the trailer it finds the
> > > /XRefStm
> > > label, and its offset value is returned as 112085940 (which is what is
> > > given
> > > in the file),
> > > However, the checkXRefOffset() call made to verify it doesn't find the
> > > xref
> > > stream there, so it goes searching and ends up returning the closest xref
> > > offset it can find, which happens to be that it returns its own offset at
> > > 113884174.
> > >
> > >
> > > I believe that there is an error in PdfBox with respect to this fixup
> > > logic,
> > > even if it had found the 'correct' xref stream.
> > > That is because the fixup offset can NEVER work.  Every time it fixes up
> > > the
> > > location, it lands on a section which begins with "xref".
> > > The next call is to skip the whitespace, but since there is never any
> > > there
> > > (it's already proven to be 'xref'),  it does not advance the input stream.
> > > Then, the first call to parse that xrefstm always calls readObjectID(),
> > > which
> > > always will throw the exception because the bytes are always 'xref'.
> > >
> > > So, my questions are:
> > >
> > > 1) Are these docs fixable or are they truly corrupt?
> > Without having a hand on the pdf itself it's hard to give a 100% answer. But
> > I
> > guess there has to be fix, as adobe is able to open that pdf. I'll try to
> > find
> > one, following your description of the pdf
> >
> > > 2) Is this xref issue a known issue with PdfBox?  I would try to create a
> > > document that displays the error but I honesty don't know how to do so
> > > (beyond
> > > sending the ones that we have that DO display it).
> > Not until now
> >
> > > 3) Do you have any idea how these documents end up in this state if they
> > > are
> > > being edited by tools such as InDesign, Acrobat, etc? Is there something I
> > > can
> > > do to identify them?
> > There are a lot of more or less corrupt files in the wild. Those are created
> > using different tools.
> >
> > > 4) If this is a truly corrupted document, why would Acrobat be able to
> > > open
> > > these files but pdfBox cannot?  Are these streams somehow ignorable?  I
> > > ask
> > > this because I saw this statement on a web page
> > >  (http://resources.infosecinstitute.com/pdf-file-format-basic-structure/)
> > > when
> > > I was searching for answers on this:
> > Adobe implements a lot of self healing mechanisms to repair broken files and
> > we
> > try to do so too, but it's complicated.
> >
> > > – /XrefStm [integer]: specifies the offset from the beginning of the file
> > > to
> > > the cross-reference stream in the decoded stream. This is only present in
> > > hybrid-reference files, which is specified if we would also like to open
> > > documents even if the applications  don’t support compressed reference
> > > streams.
> > >
> > > Any light you can shed on this is appreciated.
> > >
> > > Thanks-
> > > Steve
> > >
> > >
> > > See below for the pertinent data and the code which is marked with the
> > > values
> > > as I traced through.
> > >
> > > I have confirmed that the byte offset of the word xref below is exactly at
> > > 113884174.
> >
> > Does the xref stream start at 112085940 (stream offset from the trailer
> > dictionary) or what did you find at that offset?
> >
> >
> > > xref
> > > 0 53641
> > > 0000000000 65535 f
> > > 0000000017 00000 n
> > >
> > > <massive snip/>
> > >
> > >
> > > trailer
> > > \<\<
> > > /Size 53641
> > > /Root 1 0 R
> > > /XRefStm 112085940
> > > /Info 8 0 R
> > > /ID [\<19790A83488211E283B50017F203355C>
> > > \<E3DF7097A16969B08238787F19E7E219>]
> > > >>
> > > startxref
> > > 113884174
> > > %%EOF1 0 obj\<\</Outlines 2 0 R/Metadata 53641 0 R/AcroForm 4 0 R/Pages 5
> > > 0
> > > R/StructTreeRoot 6 0 R/Type/Catalog/PageLabels 7 0 R>>
> > > endobj
> > >
> > >
> > >     protected COSDictionary parseXref(long startXRefOffset) throws
> > > IOException
> > >     {
> > >         pdfSource.seek(startXRefOffset);
> > >         long startXrefOffset = parseStartXref();
> > >         // check the startxref offset
> > >         long fixedOffset = checkXRefOffset(startXrefOffset);
> > >         if (fixedOffset > -1)
> > >         {
> > >             startXrefOffset = fixedOffset;
> > >         }
> > >         document.setStartXref(startXrefOffset);
> > >         long prev = startXrefOffset;
> > >         // ---- parse whole chain of xref tables/object streams using PREV
> > > reference
> > >         while (prev > -1)  <== prev here is 113884174.
> > >         {
> > >             // seek to xref table
> > >             pdfSource.seek(prev);
> > >
> > >             // skip white spaces
> > >             skipSpaces();
> > >             // -- parse xref
> > >             if (pdfSource.peek() == X)
> > >             {
> > >                 // xref table and trailer
> > >                 // use existing parser to parse xref table
> > >                 parseXrefTable(prev);
> > >                 // parse the last trailer.
> > >                 trailerOffset = pdfSource.getOffset();
> > >                 // PDFBOX-1739 skip extra xref entries in RegisSTAR
> > > documents
> > >                 while (isLenient && pdfSource.peek() != 't')
> > >                 {
> > >                     if (pdfSource.getOffset() == trailerOffset)
> > >                     {
> > >                         // warn only the first time
> > >                         LOG.warn("Expected trailer object at position " +
> > > trailerOffset
> > >                                 + ", keep trying");
> > >                     }
> > >                     readLine();
> > >                 }
> > >                 if (!parseTrailer())
> > >                 {
> > >                     throw new IOException("Expected trailer object at
> > > position: "
> > >                             + pdfSource.getOffset());
> > >                 }
> > >                 COSDictionary trailer =
> > > xrefTrailerResolver.getCurrentTrailer();
> > >                 // check for a XRef stream, it may contain some object ids
> > > of
> > > compressed objects
> > >                 if(trailer.containsKey(COSName.XREF_STM))  <== YES - but
> > > falue
> > >                 {
> > >                     int streamOffset = trailer.getInt(COSName.XREF_STM);
> > >  <==
> > > This returns 112085940, which is the value from the trailer
> > >                     // check the xref stream reference
> > >                     fixedOffset = checkXRefOffset(streamOffset);
> > >          <==
> > > checks it and returns 113884174 instead
> > >                     if (fixedOffset > -1 && fixedOffset != streamOffset)
> > >                     {
> > >                         streamOffset = (int)fixedOffset;
> > >                         trailer.setInt(COSName.XREF_STM, streamOffset);
> > >                     }
> > >                     pdfSource.seek(streamOffset);  <== Seeks to 113884174
> > >                     //readExpectedString(XREF_TABLE, false);
> > >                     skipSpaces();    <===      It's ON "xref", so it
> > > doesn't
> > > skip anything
> > >                     parseXrefObjStream(prev, false); <== goes in here,
> > > first
> > > thing it tries to do is readObjectNumber(), which can't work because it's
> > > 'xref' -- BOOM
> > >                 }
> > >                 prev = trailer.getInt(COSName.PREV);
> > >                 if (prev > -1)
> > >                 {
> > >                     // check the xref table reference
> > >                     fixedOffset = checkXRefOffset(prev);
> > >                     if (fixedOffset > -1 && fixedOffset != prev)
> > >                     {
> > >                         prev = fixedOffset;
> > >                         trailer.setLong(COSName.PREV, prev);
> > >                     }
> > >                 }
> > >             }
> > >             else
> > >             {
> > >                 // parse xref stream
> > >                 prev = parseXrefObjStream(prev, true);
> > >                 if (prev > -1)
> > >                 {
> > >                     // check the xref table reference
> > >                     fixedOffset = checkXRefOffset(prev);
> > >                     if (fixedOffset > -1 && fixedOffset != prev)
> > >                     {
> > >                         prev = fixedOffset;
> > >                         COSDictionary trailer =
> > > xrefTrailerResolver.getCurrentTrailer();
> > >                         trailer.setLong(COSName.PREV, prev);
> > >                     }
> > >                 }
> > >             }
> > >         }
> > >         // ---- build valid xrefs out of the xref chain
> > >         xrefTrailerResolver.setStartxref(startXrefOffset);
> > >         COSDictionary trailer = xrefTrailerResolver.getTrailer();
> > >         document.setTrailer(trailer);
> > >         document.setIsXRefStream(XRefType.STREAM ==
> > > xrefTrailerResolver.getXrefType());
> > >         // check the offsets of all referenced objects
> > >         checkXrefOffsets();
> > >         // copy xref table
> > >         document.addXRefTable(xrefTrailerResolver.getXrefTable());
> > >         return trailer;
> > >     }
> >
> >
> > BR
> > Andreas Lehmkühler
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present (or variation of it still present)

Reply via email to