PDFBOX-2523 still present (or variation of it still present)

Steve Antoch Mon, 23 Feb 2015 10:48:14 -0800

@Andreas-

I have downloaded the latest trunk and came close (it got much further) before 
failing.
However, I think I may have a fix for that failure:


The code is returning 0 when the xrefstm fixedOffset is not found.  However, 
the code still tries to load and parse from xref 0, resulting in a null 
reference exception later in parser.parse().

However, thinking about this, I came up with this:

                // check for a XRef stream, it may contain some object ids of 
compressed objects 
                if(trailer.containsKey(COSName.XREF_STM))
                {
                    int streamOffset = trailer.getInt(COSName.XREF_STM);
                    // check the xref stream reference
                    fixedOffset = checkXRefStreamOffset(streamOffset, false);   
//<== fixedoffset comes back as 0 => not found
                    if (fixedOffset > -1 && fixedOffset != streamOffset)
                    {
                        streamOffset = (int)fixedOffset;                        
                       // <== streamOffset gets set to 0 here
                        trailer.setInt(COSName.XREF_STM, streamOffset);
                    }
                    
                    if (streamOffset > 0)    //<====  I added this test because 
an xref stream starting at 
                                                           //  offset 0 can 
never happen, so we should simply skip it
                    {
                        pdfSource.seek(streamOffset);
                        skipSpaces();
                        parseXrefObjStream(prev, false);  <== this call 
ultimately throws a null ref exception if streamOffset == 0 on entry
                    }
                }

Adding that, the file successfully parses.

Also, there was this proposal that I put up on github in a repo that I directly 
forked from pdfbox (it is the only change)
It relaxes the looping a bit to allow limited recursion.  I would appreciate 
your thoughts on it. 

https://github.com/santoch/pdfbox/commit/75cc32ab8307062709c30f1cfea5e2fdb8c00ddd

Thank you so much!  You have been tremendously helpful.  I wish I could have 
given you the files, but unfortunately, they are proprietary and we cannot 
release them.  :-(

Best regards-
Steve

________________________________________
From: Andreas Lehmkühler <[email protected]>
Sent: Monday, February 23, 2015 3:43 AM
To: [email protected]
Subject: Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present 
(or variation of it still present)

Hi,

I've improved the self repair mechnism of the trunk based on Steves report.

@Steve Please give the newest trunk version/SNAPSHOT a try. Does the issue still
persist?

BR
Andreas Lehmkühler

> Steve Antoch <[email protected]> hat am 17. Februar 2015 um 00:05 geschrieben:
>
>
>
> Andreas-
> Thanks for the response.
> Sorry for sending directly.
>
> Yes, it tries to read from offset 112085940, but does not find the xrefstm
> there, so
> that's when it goes searching.  It seems to be landing in the middle of
> something else (perhaps an image?)
>
> I tried running the preflight command on the file, and this is what it found
> there.
> This is in the middle of a whole series of repetitive byte patterns like
> these, which is interspersed with other sections of content that is also
> binary only.
>
> <?xml version="1.0" encoding="UTF-8" standalone="no"?>
> <preflight name="file.pdf">
>   <executionTimeMS>2646</executionTimeMS>
>   <isValid type="">false</isValid>
>   <errors count="1">
>     <error count="1">
>       <code>1.0</code>
>       <details>Syntax error, Error: Expected a long type at offset 112085940,
> instead got
> '6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ³fÍ&#155;6lÙ±¯Ó"z·C&#156;3Í}&#14;y&#11;ó&#3;£g&#130;?1º·Ó&#158;-ó&#143;VÏ:ë½NsË&#142;¸&#31;6lÙ³fÅ#ë&#147;&#29;&#31;¨Î÷å.£=&#137;ù}ÕsÞÿ'</details>
>     </error>
>   </errors>
> </preflight>
>
> The patterns seem to be:
>
> lots of these: 6lÙ³fÍ&#155;
> interspersed between blocks that are similar to this:
> ±¯Ó"z·C&#156;3Í}&#14;y&#11;ó&#3;£g&#130;?1º·Ó&#158;-ó&#143;VÏ:ë½NsË&#142;¸&#31;6lÙ³fÅ#ë&#147;&#29;&#31;¨Î÷å.£=&#137;ù}ÕsÞÿ'
>
> It just so happens that the offset 112085940 falls right in the middle of a
> big block of those 6lÙ³fÍ&#155; repetitive blocks.
>
> Not sure if that's any help.
>
> Steve
>
> ________________________________________
> From: Andreas Lehmkühler <[email protected]>
> Sent: Monday, February 16, 2015 3:34 AM
> To: [email protected]
> Subject: Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present
> (or variation of it still present)
>
> Hi,
>
> > Steve Antoch <[email protected]> hat am 13. Februar 2015 um 23:34
> > geschrieben:
> >
> >
> >
> > Hi Tilman and Andreas--
> Please don't contact developers directly, use our mailing lists instead. I've
> put the users list back into the boat...
>
> > I am working with Krasimir on this issue.
> >
> > Although we asked, we were denied permission to send the document out.
> :-(
>
> > The failure is being triggered when we attempt to use the Encrypt() class to
> > password protect the pdf.
> > We end up with the "Expected a long type at offset 113884174, instead got
> > 'xref'" failure.
> >
> > I have debugged into the PDFBox code and found the offending parts.
> >
> > PdfBox is  trying to parse an xref table located at 113884174.
> >
> > The problem we are seeing is that the inside the trailer it finds the
> > /XRefStm
> > label, and its offset value is returned as 112085940 (which is what is given
> > in the file),
> > However, the checkXRefOffset() call made to verify it doesn't find the xref
> > stream there, so it goes searching and ends up returning the closest xref
> > offset it can find, which happens to be that it returns its own offset at
> > 113884174.
> >
> >
> > I believe that there is an error in PdfBox with respect to this fixup logic,
> > even if it had found the 'correct' xref stream.
> > That is because the fixup offset can NEVER work.  Every time it fixes up the
> > location, it lands on a section which begins with "xref".
> > The next call is to skip the whitespace, but since there is never any there
> > (it's already proven to be 'xref'),  it does not advance the input stream.
> > Then, the first call to parse that xrefstm always calls readObjectID(),
> > which
> > always will throw the exception because the bytes are always 'xref'.
> >
> > So, my questions are:
> >
> > 1) Are these docs fixable or are they truly corrupt?
> Without having a hand on the pdf itself it's hard to give a 100% answer. But I
> guess there has to be fix, as adobe is able to open that pdf. I'll try to find
> one, following your description of the pdf
>
> > 2) Is this xref issue a known issue with PdfBox?  I would try to create a
> > document that displays the error but I honesty don't know how to do so
> > (beyond
> > sending the ones that we have that DO display it).
> Not until now
>
> > 3) Do you have any idea how these documents end up in this state if they are
> > being edited by tools such as InDesign, Acrobat, etc? Is there something I
> > can
> > do to identify them?
> There are a lot of more or less corrupt files in the wild. Those are created
> using different tools.
>
> > 4) If this is a truly corrupted document, why would Acrobat be able to open
> > these files but pdfBox cannot?  Are these streams somehow ignorable?  I ask
> > this because I saw this statement on a web page
> >  (http://resources.infosecinstitute.com/pdf-file-format-basic-structure/)
> > when
> > I was searching for answers on this:
> Adobe implements a lot of self healing mechanisms to repair broken files and
> we
> try to do so too, but it's complicated.
>
> > – /XrefStm [integer]: specifies the offset from the beginning of the file to
> > the cross-reference stream in the decoded stream. This is only present in
> > hybrid-reference files, which is specified if we would also like to open
> > documents even if the applications  don’t support compressed reference
> > streams.
> >
> > Any light you can shed on this is appreciated.
> >
> > Thanks-
> > Steve
> >
> >
> > See below for the pertinent data and the code which is marked with the
> > values
> > as I traced through.
> >
> > I have confirmed that the byte offset of the word xref below is exactly at
> > 113884174.
>
> Does the xref stream start at 112085940 (stream offset from the trailer
> dictionary) or what did you find at that offset?
>
>
> > xref
> > 0 53641
> > 0000000000 65535 f
> > 0000000017 00000 n
> >
> > <massive snip/>
> >
> >
> > trailer
> > \<\<
> > /Size 53641
> > /Root 1 0 R
> > /XRefStm 112085940
> > /Info 8 0 R
> > /ID [\<19790A83488211E283B50017F203355C>
> > \<E3DF7097A16969B08238787F19E7E219>]
> > >>
> > startxref
> > 113884174
> > %%EOF1 0 obj\<\</Outlines 2 0 R/Metadata 53641 0 R/AcroForm 4 0 R/Pages 5 0
> > R/StructTreeRoot 6 0 R/Type/Catalog/PageLabels 7 0 R>>
> > endobj
> >
> >
> >     protected COSDictionary parseXref(long startXRefOffset) throws
> > IOException
> >     {
> >         pdfSource.seek(startXRefOffset);
> >         long startXrefOffset = parseStartXref();
> >         // check the startxref offset
> >         long fixedOffset = checkXRefOffset(startXrefOffset);
> >         if (fixedOffset > -1)
> >         {
> >             startXrefOffset = fixedOffset;
> >         }
> >         document.setStartXref(startXrefOffset);
> >         long prev = startXrefOffset;
> >         // ---- parse whole chain of xref tables/object streams using PREV
> > reference
> >         while (prev > -1)  <== prev here is 113884174.
> >         {
> >             // seek to xref table
> >             pdfSource.seek(prev);
> >
> >             // skip white spaces
> >             skipSpaces();
> >             // -- parse xref
> >             if (pdfSource.peek() == X)
> >             {
> >                 // xref table and trailer
> >                 // use existing parser to parse xref table
> >                 parseXrefTable(prev);
> >                 // parse the last trailer.
> >                 trailerOffset = pdfSource.getOffset();
> >                 // PDFBOX-1739 skip extra xref entries in RegisSTAR
> > documents
> >                 while (isLenient && pdfSource.peek() != 't')
> >                 {
> >                     if (pdfSource.getOffset() == trailerOffset)
> >                     {
> >                         // warn only the first time
> >                         LOG.warn("Expected trailer object at position " +
> > trailerOffset
> >                                 + ", keep trying");
> >                     }
> >                     readLine();
> >                 }
> >                 if (!parseTrailer())
> >                 {
> >                     throw new IOException("Expected trailer object at
> > position: "
> >                             + pdfSource.getOffset());
> >                 }
> >                 COSDictionary trailer =
> > xrefTrailerResolver.getCurrentTrailer();
> >                 // check for a XRef stream, it may contain some object ids
> > of
> > compressed objects
> >                 if(trailer.containsKey(COSName.XREF_STM))  <== YES - but
> > falue
> >                 {
> >                     int streamOffset = trailer.getInt(COSName.XREF_STM);
> >  <==
> > This returns 112085940, which is the value from the trailer
> >                     // check the xref stream reference
> >                     fixedOffset = checkXRefOffset(streamOffset);
> >          <==
> > checks it and returns 113884174 instead
> >                     if (fixedOffset > -1 && fixedOffset != streamOffset)
> >                     {
> >                         streamOffset = (int)fixedOffset;
> >                         trailer.setInt(COSName.XREF_STM, streamOffset);
> >                     }
> >                     pdfSource.seek(streamOffset);  <== Seeks to 113884174
> >                     //readExpectedString(XREF_TABLE, false);
> >                     skipSpaces();    <===      It's ON "xref", so it doesn't
> > skip anything
> >                     parseXrefObjStream(prev, false); <== goes in here, first
> > thing it tries to do is readObjectNumber(), which can't work because it's
> > 'xref' -- BOOM
> >                 }
> >                 prev = trailer.getInt(COSName.PREV);
> >                 if (prev > -1)
> >                 {
> >                     // check the xref table reference
> >                     fixedOffset = checkXRefOffset(prev);
> >                     if (fixedOffset > -1 && fixedOffset != prev)
> >                     {
> >                         prev = fixedOffset;
> >                         trailer.setLong(COSName.PREV, prev);
> >                     }
> >                 }
> >             }
> >             else
> >             {
> >                 // parse xref stream
> >                 prev = parseXrefObjStream(prev, true);
> >                 if (prev > -1)
> >                 {
> >                     // check the xref table reference
> >                     fixedOffset = checkXRefOffset(prev);
> >                     if (fixedOffset > -1 && fixedOffset != prev)
> >                     {
> >                         prev = fixedOffset;
> >                         COSDictionary trailer =
> > xrefTrailerResolver.getCurrentTrailer();
> >                         trailer.setLong(COSName.PREV, prev);
> >                     }
> >                 }
> >             }
> >         }
> >         // ---- build valid xrefs out of the xref chain
> >         xrefTrailerResolver.setStartxref(startXrefOffset);
> >         COSDictionary trailer = xrefTrailerResolver.getTrailer();
> >         document.setTrailer(trailer);
> >         document.setIsXRefStream(XRefType.STREAM ==
> > xrefTrailerResolver.getXrefType());
> >         // check the offsets of all referenced objects
> >         checkXrefOffsets();
> >         // copy xref table
> >         document.addXRefTable(xrefTrailerResolver.getXrefTable());
> >         return trailer;
> >     }
>
>
> BR
> Andreas Lehmkühler
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present (or variation of it still present)

Reply via email to