PDFBOX-2523 still present (or variation of it still present)

Andreas Lehmkühler Thu, 26 Feb 2015 23:51:07 -0800

Hi,

> Steve Antoch <[email protected]> hat am 25. Februar 2015 um 00:04 geschrieben:
> 
> 
> Hi Andreas-
> 
> Thanks again.
> 
> I downloaded and built the latest from trunk.  
> There was no change for the book I was testing.  I first tried it after taking
> out my     if (streamOffset > 0) test, but the null reference exception still
> occurred.
OK, thanks again for testing. I've fixed the issue based on your analysis.


> We are planning on running a large breadth test on approximately 108,000 pdfs
> starting tonight.  I will let you know how this test goes.  It will take about
> 4 days to complete.
Cool, I'm looking forward to see the results.

> With respect to the small change I made in my fork:
> https://github.com/santoch/pdfbox/commit/75cc32ab8307062709c30f1cfea5e2fdb8c00ddd
> 
> The issue was a separate but fairly rare failure that we found in a small
> number (about 10) of our pdfs.
> Adobe and Pdfium (Chrome) were both able to open them but pdfBox was not due
> to disallowing nesting.  I figured that if Pdfium allows 64 levels of nesting,
> we might be able to relax this test from 0 levels to allowing 1 level and see
> if it worked.  Since it did, I wanted to run those changes by you for your
> comments.
Is there any chance to get a hand on a sample pdf? I would be good enough to
send it via private mail to me:

BR
Andreas Lehmkühler

> 
> Thanks-
> Steve
> 
> ________________________________________
> From: Andreas Lehmkühler <[email protected]>
> Sent: Tuesday, February 24, 2015 3:30 AM
> To: [email protected]
> Subject: Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present
> (or variation of it still present)
> 
> Hi Steve,
> 
> > Steve Antoch <[email protected]> hat am 23. Februar 2015 um 19:42
> > geschrieben:
> >
> >
> > @Andreas-
> >
> > I have downloaded the latest trunk and came close (it got much further)
> > before
> > failing.
> > However, I think I may have a fix for that failure:
> Thanks for the test
> 
> > The code is returning 0 when the xrefstm fixedOffset is not found.  However,
> > the code still tries to load and parse from xref 0, resulting in a null
> > reference exception later in parser.parse().
> Your analysis is correct, but I hope that my last improvements should
> eliminate
> such cases, see PDFBOX-2572 for details. Could you give the latest trunk
> (r1661747) a try?
> 
> > However, thinking about this, I came up with this:
> >
> >                 // check for a XRef stream, it may contain some object ids
> > of
> > compressed objects
> >                 if(trailer.containsKey(COSName.XREF_STM))
> >                 {
> >                     int streamOffset = trailer.getInt(COSName.XREF_STM);
> >                     // check the xref stream reference
> >                     fixedOffset = checkXRefStreamOffset(streamOffset,
> > false);
> >   //<== fixedoffset comes back as 0 => not found
> >                     if (fixedOffset > -1 && fixedOffset != streamOffset)
> >                     {
> >                         streamOffset = (int)fixedOffset;
> >                                               // <== streamOffset gets set
> > to
> > 0 here
> >                         trailer.setInt(COSName.XREF_STM, streamOffset);
> >                     }
> >
> >                     if (streamOffset > 0)    //<====  I added this test
> > because an xref stream starting at
> >                                                            //  offset 0 can
> > never happen, so we should simply skip it
> >                     {
> >                         pdfSource.seek(streamOffset);
> >                         skipSpaces();
> >                         parseXrefObjStream(prev, false);  <== this call
> > ultimately throws a null ref exception if streamOffset == 0 on entry
> >                     }
> >                 }
> >
> > Adding that, the file successfully parses.
> >
> > Also, there was this proposal that I put up on github in a repo that I
> > directly forked from pdfbox (it is the only change)
> > It relaxes the looping a bit to allow limited recursion.  I would appreciate
> > your thoughts on it.
> Is this change related to the discussed issue above?
> 
> > https://github.com/santoch/pdfbox/commit/75cc32ab8307062709c30f1cfea5e2fdb8c00ddd
> >
> > Thank you so much!  You have been tremendously helpful.  I wish I could have
> > given you the files, but unfortunately, they are proprietary and we cannot
> > release them.  :-(
> No need to worry, you are not the only one who is not allowed to share a
> specific pdf.
> 
> > Best regards-
> > Steve
> 
> BR
> Andreas Lehmkühler
> 
> >
> > ________________________________________
> > From: Andreas Lehmkühler <[email protected]>
> > Sent: Monday, February 23, 2015 3:43 AM
> > To: [email protected]
> > Subject: Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present
> > (or variation of it still present)
> >
> > Hi,
> >
> > I've improved the self repair mechnism of the trunk based on Steves report.
> >
> > @Steve Please give the newest trunk version/SNAPSHOT a try. Does the issue
> > still
> > persist?
> >
> > BR
> > Andreas Lehmkühler
> >
> > > Steve Antoch <[email protected]> hat am 17. Februar 2015 um 00:05
> > > geschrieben:
> > >
> > >
> > >
> > > Andreas-
> > > Thanks for the response.
> > > Sorry for sending directly.
> > >
> > > Yes, it tries to read from offset 112085940, but does not find the xrefstm
> > > there, so
> > > that's when it goes searching.  It seems to be landing in the middle of
> > > something else (perhaps an image?)
> > >
> > > I tried running the preflight command on the file, and this is what it
> > > found
> > > there.
> > > This is in the middle of a whole series of repetitive byte patterns like
> > > these, which is interspersed with other sections of content that is also
> > > binary only.
> > >
> > > <?xml version="1.0" encoding="UTF-8" standalone="no"?>
> > > <preflight name="file.pdf">
> > >   <executionTimeMS>2646</executionTimeMS>
> > >   <isValid type="">false</isValid>
> > >   <errors count="1">
> > >     <error count="1">
> > >       <code>1.0</code>
> > >       <details>Syntax error, Error: Expected a long type at offset
> > > 112085940,
> > > instead got
> > > '6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ±¯Ó"z·C&#156;3Í}&#14;y&#11;ó&#3;£g&#130;?1º·Ó&#158;-ó&#143;VÏ:ë½NsË&#142;¸&#31;6lÙ³fÅ#ë&#147;&#29;&#31;¨Î÷å.£=&#137;ù}ÕsÞÿ'</details>
> > >     </error>
> > >   </errors>
> > > </preflight>
> > >
> > > The patterns seem to be:
> > >
> > > lots of these: 6lÙ³fÍ&#155;
> > > interspersed between blocks that are similar to this:
> > > ±¯Ó"z·C&#156;3Í}&#14;y&#11;ó&#3;£g&#130;?1º·Ó&#158;-ó&#143;VÏ:ë½NsË&#142;¸&#31;6lÙ³fÅ#ë&#147;&#29;&#31;¨Î÷å.£=&#137;ù}ÕsÞÿ'
> > >
> > > It just so happens that the offset 112085940 falls right in the middle of
> > > a
> > > big block of those 6lÙ³fÍ&#155; repetitive blocks.
> > >
> > > Not sure if that's any help.
> > >
> > > Steve
> > >
> > > ________________________________________
> > > From: Andreas Lehmkühler <[email protected]>
> > > Sent: Monday, February 16, 2015 3:34 AM
> > > To: [email protected]
> > > Subject: Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still
> > > present
> > > (or variation of it still present)
> > >
> > > Hi,
> > >
> > > > Steve Antoch <[email protected]> hat am 13. Februar 2015 um 23:34
> > > > geschrieben:
> > > >
> > > >
> > > >
> > > > Hi Tilman and Andreas--
> > > Please don't contact developers directly, use our mailing lists instead.
> > > I've
> > > put the users list back into the boat...
> > >
> > > > I am working with Krasimir on this issue.
> > > >
> > > > Although we asked, we were denied permission to send the document out.
> > > :-(
> > >
> > > > The failure is being triggered when we attempt to use the Encrypt()
> > > > class
> > > > to
> > > > password protect the pdf.
> > > > We end up with the "Expected a long type at offset 113884174, instead
> > > > got
> > > > 'xref'" failure.
> > > >
> > > > I have debugged into the PDFBox code and found the offending parts.
> > > >
> > > > PdfBox is  trying to parse an xref table located at 113884174.
> > > >
> > > > The problem we are seeing is that the inside the trailer it finds the
> > > > /XRefStm
> > > > label, and its offset value is returned as 112085940 (which is what is
> > > > given
> > > > in the file),
> > > > However, the checkXRefOffset() call made to verify it doesn't find the
> > > > xref
> > > > stream there, so it goes searching and ends up returning the closest
> > > > xref
> > > > offset it can find, which happens to be that it returns its own offset
> > > > at
> > > > 113884174.
> > > >
> > > >
> > > > I believe that there is an error in PdfBox with respect to this fixup
> > > > logic,
> > > > even if it had found the 'correct' xref stream.
> > > > That is because the fixup offset can NEVER work.  Every time it fixes up
> > > > the
> > > > location, it lands on a section which begins with "xref".
> > > > The next call is to skip the whitespace, but since there is never any
> > > > there
> > > > (it's already proven to be 'xref'),  it does not advance the input
> > > > stream.
> > > > Then, the first call to parse that xrefstm always calls readObjectID(),
> > > > which
> > > > always will throw the exception because the bytes are always 'xref'.
> > > >
> > > > So, my questions are:
> > > >
> > > > 1) Are these docs fixable or are they truly corrupt?
> > > Without having a hand on the pdf itself it's hard to give a 100% answer.
> > > But
> > > I
> > > guess there has to be fix, as adobe is able to open that pdf. I'll try to
> > > find
> > > one, following your description of the pdf
> > >
> > > > 2) Is this xref issue a known issue with PdfBox?  I would try to create
> > > > a
> > > > document that displays the error but I honesty don't know how to do so
> > > > (beyond
> > > > sending the ones that we have that DO display it).
> > > Not until now
> > >
> > > > 3) Do you have any idea how these documents end up in this state if they
> > > > are
> > > > being edited by tools such as InDesign, Acrobat, etc? Is there something
> > > > I
> > > > can
> > > > do to identify them?
> > > There are a lot of more or less corrupt files in the wild. Those are
> > > created
> > > using different tools.
> > >
> > > > 4) If this is a truly corrupted document, why would Acrobat be able to
> > > > open
> > > > these files but pdfBox cannot?  Are these streams somehow ignorable?  I
> > > > ask
> > > > this because I saw this statement on a web page
> > > > 
> > > > (http://resources.infosecinstitute.com/pdf-file-format-basic-structure/)
> > > > when
> > > > I was searching for answers on this:
> > > Adobe implements a lot of self healing mechanisms to repair broken files
> > > and
> > > we
> > > try to do so too, but it's complicated.
> > >
> > > > – /XrefStm [integer]: specifies the offset from the beginning of the
> > > > file
> > > > to
> > > > the cross-reference stream in the decoded stream. This is only present
> > > > in
> > > > hybrid-reference files, which is specified if we would also like to open
> > > > documents even if the applications  don’t support compressed reference
> > > > streams.
> > > >
> > > > Any light you can shed on this is appreciated.
> > > >
> > > > Thanks-
> > > > Steve
> > > >
> > > >
> > > > See below for the pertinent data and the code which is marked with the
> > > > values
> > > > as I traced through.
> > > >
> > > > I have confirmed that the byte offset of the word xref below is exactly
> > > > at
> > > > 113884174.
> > >
> > > Does the xref stream start at 112085940 (stream offset from the trailer
> > > dictionary) or what did you find at that offset?
> > >
> > >
> > > > xref
> > > > 0 53641
> > > > 0000000000 65535 f
> > > > 0000000017 00000 n
> > > >
> > > > <massive snip/>
> > > >
> > > >
> > > > trailer
> > > > \<\<
> > > > /Size 53641
> > > > /Root 1 0 R
> > > > /XRefStm 112085940
> > > > /Info 8 0 R
> > > > /ID [\<19790A83488211E283B50017F203355C>
> > > > \<E3DF7097A16969B08238787F19E7E219>]
> > > > >>
> > > > startxref
> > > > 113884174
> > > > %%EOF1 0 obj\<\</Outlines 2 0 R/Metadata 53641 0 R/AcroForm 4 0 R/Pages
> > > > 5
> > > > 0
> > > > R/StructTreeRoot 6 0 R/Type/Catalog/PageLabels 7 0 R>>
> > > > endobj
> > > >
> > > >
> > > >     protected COSDictionary parseXref(long startXRefOffset) throws
> > > > IOException
> > > >     {
> > > >         pdfSource.seek(startXRefOffset);
> > > >         long startXrefOffset = parseStartXref();
> > > >         // check the startxref offset
> > > >         long fixedOffset = checkXRefOffset(startXrefOffset);
> > > >         if (fixedOffset > -1)
> > > >         {
> > > >             startXrefOffset = fixedOffset;
> > > >         }
> > > >         document.setStartXref(startXrefOffset);
> > > >         long prev = startXrefOffset;
> > > >         // ---- parse whole chain of xref tables/object streams using
> > > > PREV
> > > > reference
> > > >         while (prev > -1)  <== prev here is 113884174.
> > > >         {
> > > >             // seek to xref table
> > > >             pdfSource.seek(prev);
> > > >
> > > >             // skip white spaces
> > > >             skipSpaces();
> > > >             // -- parse xref
> > > >             if (pdfSource.peek() == X)
> > > >             {
> > > >                 // xref table and trailer
> > > >                 // use existing parser to parse xref table
> > > >                 parseXrefTable(prev);
> > > >                 // parse the last trailer.
> > > >                 trailerOffset = pdfSource.getOffset();
> > > >                 // PDFBOX-1739 skip extra xref entries in RegisSTAR
> > > > documents
> > > >                 while (isLenient && pdfSource.peek() != 't')
> > > >                 {
> > > >                     if (pdfSource.getOffset() == trailerOffset)
> > > >                     {
> > > >                         // warn only the first time
> > > >                         LOG.warn("Expected trailer object at position "
> > > > +
> > > > trailerOffset
> > > >                                 + ", keep trying");
> > > >                     }
> > > >                     readLine();
> > > >                 }
> > > >                 if (!parseTrailer())
> > > >                 {
> > > >                     throw new IOException("Expected trailer object at
> > > > position: "
> > > >                             + pdfSource.getOffset());
> > > >                 }
> > > >                 COSDictionary trailer =
> > > > xrefTrailerResolver.getCurrentTrailer();
> > > >                 // check for a XRef stream, it may contain some object
> > > > ids
> > > > of
> > > > compressed objects
> > > >                 if(trailer.containsKey(COSName.XREF_STM))  <== YES - but
> > > > falue
> > > >                 {
> > > >                     int streamOffset = trailer.getInt(COSName.XREF_STM);
> > > >  <==
> > > > This returns 112085940, which is the value from the trailer
> > > >                     // check the xref stream reference
> > > >                     fixedOffset = checkXRefOffset(streamOffset);
> > > >          <==
> > > > checks it and returns 113884174 instead
> > > >                     if (fixedOffset > -1 && fixedOffset != streamOffset)
> > > >                     {
> > > >                         streamOffset = (int)fixedOffset;
> > > >                         trailer.setInt(COSName.XREF_STM, streamOffset);
> > > >                     }
> > > >                     pdfSource.seek(streamOffset);  <== Seeks to
> > > > 113884174
> > > >                     //readExpectedString(XREF_TABLE, false);
> > > >                     skipSpaces();    <===      It's ON "xref", so it
> > > > doesn't
> > > > skip anything
> > > >                     parseXrefObjStream(prev, false); <== goes in here,
> > > > first
> > > > thing it tries to do is readObjectNumber(), which can't work because
> > > > it's
> > > > 'xref' -- BOOM
> > > >                 }
> > > >                 prev = trailer.getInt(COSName.PREV);
> > > >                 if (prev > -1)
> > > >                 {
> > > >                     // check the xref table reference
> > > >                     fixedOffset = checkXRefOffset(prev);
> > > >                     if (fixedOffset > -1 && fixedOffset != prev)
> > > >                     {
> > > >                         prev = fixedOffset;
> > > >                         trailer.setLong(COSName.PREV, prev);
> > > >                     }
> > > >                 }
> > > >             }
> > > >             else
> > > >             {
> > > >                 // parse xref stream
> > > >                 prev = parseXrefObjStream(prev, true);
> > > >                 if (prev > -1)
> > > >                 {
> > > >                     // check the xref table reference
> > > >                     fixedOffset = checkXRefOffset(prev);
> > > >                     if (fixedOffset > -1 && fixedOffset != prev)
> > > >                     {
> > > >                         prev = fixedOffset;
> > > >                         COSDictionary trailer =
> > > > xrefTrailerResolver.getCurrentTrailer();
> > > >                         trailer.setLong(COSName.PREV, prev);
> > > >                     }
> > > >                 }
> > > >             }
> > > >         }
> > > >         // ---- build valid xrefs out of the xref chain
> > > >         xrefTrailerResolver.setStartxref(startXrefOffset);
> > > >         COSDictionary trailer = xrefTrailerResolver.getTrailer();
> > > >         document.setTrailer(trailer);
> > > >         document.setIsXRefStream(XRefType.STREAM ==
> > > > xrefTrailerResolver.getXrefType());
> > > >         // check the offsets of all referenced objects
> > > >         checkXrefOffsets();
> > > >         // copy xref table
> > > >         document.addXRefTable(xrefTrailerResolver.getXrefTable());
> > > >         return trailer;
> > > >     }
> > >
> > >
> > > BR
> > > Andreas Lehmkühler
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [email protected]
> > > For additional commands, e-mail: [email protected]
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present (or variation of it still present)

Reply via email to