Hi Steve,
> Steve Antoch <[email protected]> hat am 23. Februar 2015 um 19:42 geschrieben:
>
>
> @Andreas-
>
> I have downloaded the latest trunk and came close (it got much further) before
> failing.
> However, I think I may have a fix for that failure:
Thanks for the test
> The code is returning 0 when the xrefstm fixedOffset is not found. However,
> the code still tries to load and parse from xref 0, resulting in a null
> reference exception later in parser.parse().
Your analysis is correct, but I hope that my last improvements should eliminate
such cases, see PDFBOX-2572 for details. Could you give the latest trunk
(r1661747) a try?
> However, thinking about this, I came up with this:
>
> // check for a XRef stream, it may contain some object ids of
> compressed objects
> if(trailer.containsKey(COSName.XREF_STM))
> {
> int streamOffset = trailer.getInt(COSName.XREF_STM);
> // check the xref stream reference
> fixedOffset = checkXRefStreamOffset(streamOffset, false);
> //<== fixedoffset comes back as 0 => not found
> if (fixedOffset > -1 && fixedOffset != streamOffset)
> {
> streamOffset = (int)fixedOffset;
> // <== streamOffset gets set to
> 0 here
> trailer.setInt(COSName.XREF_STM, streamOffset);
> }
>
> if (streamOffset > 0) //<==== I added this test
> because an xref stream starting at
> // offset 0 can
> never happen, so we should simply skip it
> {
> pdfSource.seek(streamOffset);
> skipSpaces();
> parseXrefObjStream(prev, false); <== this call
> ultimately throws a null ref exception if streamOffset == 0 on entry
> }
> }
>
> Adding that, the file successfully parses.
>
> Also, there was this proposal that I put up on github in a repo that I
> directly forked from pdfbox (it is the only change)
> It relaxes the looping a bit to allow limited recursion. I would appreciate
> your thoughts on it.
Is this change related to the discussed issue above?
> https://github.com/santoch/pdfbox/commit/75cc32ab8307062709c30f1cfea5e2fdb8c00ddd
>
> Thank you so much! You have been tremendously helpful. I wish I could have
> given you the files, but unfortunately, they are proprietary and we cannot
> release them. :-(
No need to worry, you are not the only one who is not allowed to share a
specific pdf.
> Best regards-
> Steve
BR
Andreas Lehmkühler
>
> ________________________________________
> From: Andreas Lehmkühler <[email protected]>
> Sent: Monday, February 23, 2015 3:43 AM
> To: [email protected]
> Subject: Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present
> (or variation of it still present)
>
> Hi,
>
> I've improved the self repair mechnism of the trunk based on Steves report.
>
> @Steve Please give the newest trunk version/SNAPSHOT a try. Does the issue
> still
> persist?
>
> BR
> Andreas Lehmkühler
>
> > Steve Antoch <[email protected]> hat am 17. Februar 2015 um 00:05
> > geschrieben:
> >
> >
> >
> > Andreas-
> > Thanks for the response.
> > Sorry for sending directly.
> >
> > Yes, it tries to read from offset 112085940, but does not find the xrefstm
> > there, so
> > that's when it goes searching. It seems to be landing in the middle of
> > something else (perhaps an image?)
> >
> > I tried running the preflight command on the file, and this is what it found
> > there.
> > This is in the middle of a whole series of repetitive byte patterns like
> > these, which is interspersed with other sections of content that is also
> > binary only.
> >
> > <?xml version="1.0" encoding="UTF-8" standalone="no"?>
> > <preflight name="file.pdf">
> > <executionTimeMS>2646</executionTimeMS>
> > <isValid type="">false</isValid>
> > <errors count="1">
> > <error count="1">
> > <code>1.0</code>
> > <details>Syntax error, Error: Expected a long type at offset
> > 112085940,
> > instead got
> > '6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ±¯Ó"z·Cœ3Í}yó£g‚?1º·Óž-óVÏ:ë½NsËŽ¸6lÙ³fÅ#듨Î÷å.£=‰ù}ÕsÞÿ'</details>
> > </error>
> > </errors>
> > </preflight>
> >
> > The patterns seem to be:
> >
> > lots of these: 6lÙ³fÍ›
> > interspersed between blocks that are similar to this:
> > ±¯Ó"z·Cœ3Í}yó£g‚?1º·Óž-óVÏ:ë½NsËŽ¸6lÙ³fÅ#듨Î÷å.£=‰ù}ÕsÞÿ'
> >
> > It just so happens that the offset 112085940 falls right in the middle of a
> > big block of those 6lÙ³fÍ› repetitive blocks.
> >
> > Not sure if that's any help.
> >
> > Steve
> >
> > ________________________________________
> > From: Andreas Lehmkühler <[email protected]>
> > Sent: Monday, February 16, 2015 3:34 AM
> > To: [email protected]
> > Subject: Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present
> > (or variation of it still present)
> >
> > Hi,
> >
> > > Steve Antoch <[email protected]> hat am 13. Februar 2015 um 23:34
> > > geschrieben:
> > >
> > >
> > >
> > > Hi Tilman and Andreas--
> > Please don't contact developers directly, use our mailing lists instead.
> > I've
> > put the users list back into the boat...
> >
> > > I am working with Krasimir on this issue.
> > >
> > > Although we asked, we were denied permission to send the document out.
> > :-(
> >
> > > The failure is being triggered when we attempt to use the Encrypt() class
> > > to
> > > password protect the pdf.
> > > We end up with the "Expected a long type at offset 113884174, instead got
> > > 'xref'" failure.
> > >
> > > I have debugged into the PDFBox code and found the offending parts.
> > >
> > > PdfBox is trying to parse an xref table located at 113884174.
> > >
> > > The problem we are seeing is that the inside the trailer it finds the
> > > /XRefStm
> > > label, and its offset value is returned as 112085940 (which is what is
> > > given
> > > in the file),
> > > However, the checkXRefOffset() call made to verify it doesn't find the
> > > xref
> > > stream there, so it goes searching and ends up returning the closest xref
> > > offset it can find, which happens to be that it returns its own offset at
> > > 113884174.
> > >
> > >
> > > I believe that there is an error in PdfBox with respect to this fixup
> > > logic,
> > > even if it had found the 'correct' xref stream.
> > > That is because the fixup offset can NEVER work. Every time it fixes up
> > > the
> > > location, it lands on a section which begins with "xref".
> > > The next call is to skip the whitespace, but since there is never any
> > > there
> > > (it's already proven to be 'xref'), it does not advance the input stream.
> > > Then, the first call to parse that xrefstm always calls readObjectID(),
> > > which
> > > always will throw the exception because the bytes are always 'xref'.
> > >
> > > So, my questions are:
> > >
> > > 1) Are these docs fixable or are they truly corrupt?
> > Without having a hand on the pdf itself it's hard to give a 100% answer. But
> > I
> > guess there has to be fix, as adobe is able to open that pdf. I'll try to
> > find
> > one, following your description of the pdf
> >
> > > 2) Is this xref issue a known issue with PdfBox? I would try to create a
> > > document that displays the error but I honesty don't know how to do so
> > > (beyond
> > > sending the ones that we have that DO display it).
> > Not until now
> >
> > > 3) Do you have any idea how these documents end up in this state if they
> > > are
> > > being edited by tools such as InDesign, Acrobat, etc? Is there something I
> > > can
> > > do to identify them?
> > There are a lot of more or less corrupt files in the wild. Those are created
> > using different tools.
> >
> > > 4) If this is a truly corrupted document, why would Acrobat be able to
> > > open
> > > these files but pdfBox cannot? Are these streams somehow ignorable? I
> > > ask
> > > this because I saw this statement on a web page
> > > (http://resources.infosecinstitute.com/pdf-file-format-basic-structure/)
> > > when
> > > I was searching for answers on this:
> > Adobe implements a lot of self healing mechanisms to repair broken files and
> > we
> > try to do so too, but it's complicated.
> >
> > > – /XrefStm [integer]: specifies the offset from the beginning of the file
> > > to
> > > the cross-reference stream in the decoded stream. This is only present in
> > > hybrid-reference files, which is specified if we would also like to open
> > > documents even if the applications don’t support compressed reference
> > > streams.
> > >
> > > Any light you can shed on this is appreciated.
> > >
> > > Thanks-
> > > Steve
> > >
> > >
> > > See below for the pertinent data and the code which is marked with the
> > > values
> > > as I traced through.
> > >
> > > I have confirmed that the byte offset of the word xref below is exactly at
> > > 113884174.
> >
> > Does the xref stream start at 112085940 (stream offset from the trailer
> > dictionary) or what did you find at that offset?
> >
> >
> > > xref
> > > 0 53641
> > > 0000000000 65535 f
> > > 0000000017 00000 n
> > >
> > > <massive snip/>
> > >
> > >
> > > trailer
> > > \<\<
> > > /Size 53641
> > > /Root 1 0 R
> > > /XRefStm 112085940
> > > /Info 8 0 R
> > > /ID [\<19790A83488211E283B50017F203355C>
> > > \<E3DF7097A16969B08238787F19E7E219>]
> > > >>
> > > startxref
> > > 113884174
> > > %%EOF1 0 obj\<\</Outlines 2 0 R/Metadata 53641 0 R/AcroForm 4 0 R/Pages 5
> > > 0
> > > R/StructTreeRoot 6 0 R/Type/Catalog/PageLabels 7 0 R>>
> > > endobj
> > >
> > >
> > > protected COSDictionary parseXref(long startXRefOffset) throws
> > > IOException
> > > {
> > > pdfSource.seek(startXRefOffset);
> > > long startXrefOffset = parseStartXref();
> > > // check the startxref offset
> > > long fixedOffset = checkXRefOffset(startXrefOffset);
> > > if (fixedOffset > -1)
> > > {
> > > startXrefOffset = fixedOffset;
> > > }
> > > document.setStartXref(startXrefOffset);
> > > long prev = startXrefOffset;
> > > // ---- parse whole chain of xref tables/object streams using PREV
> > > reference
> > > while (prev > -1) <== prev here is 113884174.
> > > {
> > > // seek to xref table
> > > pdfSource.seek(prev);
> > >
> > > // skip white spaces
> > > skipSpaces();
> > > // -- parse xref
> > > if (pdfSource.peek() == X)
> > > {
> > > // xref table and trailer
> > > // use existing parser to parse xref table
> > > parseXrefTable(prev);
> > > // parse the last trailer.
> > > trailerOffset = pdfSource.getOffset();
> > > // PDFBOX-1739 skip extra xref entries in RegisSTAR
> > > documents
> > > while (isLenient && pdfSource.peek() != 't')
> > > {
> > > if (pdfSource.getOffset() == trailerOffset)
> > > {
> > > // warn only the first time
> > > LOG.warn("Expected trailer object at position " +
> > > trailerOffset
> > > + ", keep trying");
> > > }
> > > readLine();
> > > }
> > > if (!parseTrailer())
> > > {
> > > throw new IOException("Expected trailer object at
> > > position: "
> > > + pdfSource.getOffset());
> > > }
> > > COSDictionary trailer =
> > > xrefTrailerResolver.getCurrentTrailer();
> > > // check for a XRef stream, it may contain some object ids
> > > of
> > > compressed objects
> > > if(trailer.containsKey(COSName.XREF_STM)) <== YES - but
> > > falue
> > > {
> > > int streamOffset = trailer.getInt(COSName.XREF_STM);
> > > <==
> > > This returns 112085940, which is the value from the trailer
> > > // check the xref stream reference
> > > fixedOffset = checkXRefOffset(streamOffset);
> > > <==
> > > checks it and returns 113884174 instead
> > > if (fixedOffset > -1 && fixedOffset != streamOffset)
> > > {
> > > streamOffset = (int)fixedOffset;
> > > trailer.setInt(COSName.XREF_STM, streamOffset);
> > > }
> > > pdfSource.seek(streamOffset); <== Seeks to 113884174
> > > //readExpectedString(XREF_TABLE, false);
> > > skipSpaces(); <=== It's ON "xref", so it
> > > doesn't
> > > skip anything
> > > parseXrefObjStream(prev, false); <== goes in here,
> > > first
> > > thing it tries to do is readObjectNumber(), which can't work because it's
> > > 'xref' -- BOOM
> > > }
> > > prev = trailer.getInt(COSName.PREV);
> > > if (prev > -1)
> > > {
> > > // check the xref table reference
> > > fixedOffset = checkXRefOffset(prev);
> > > if (fixedOffset > -1 && fixedOffset != prev)
> > > {
> > > prev = fixedOffset;
> > > trailer.setLong(COSName.PREV, prev);
> > > }
> > > }
> > > }
> > > else
> > > {
> > > // parse xref stream
> > > prev = parseXrefObjStream(prev, true);
> > > if (prev > -1)
> > > {
> > > // check the xref table reference
> > > fixedOffset = checkXRefOffset(prev);
> > > if (fixedOffset > -1 && fixedOffset != prev)
> > > {
> > > prev = fixedOffset;
> > > COSDictionary trailer =
> > > xrefTrailerResolver.getCurrentTrailer();
> > > trailer.setLong(COSName.PREV, prev);
> > > }
> > > }
> > > }
> > > }
> > > // ---- build valid xrefs out of the xref chain
> > > xrefTrailerResolver.setStartxref(startXrefOffset);
> > > COSDictionary trailer = xrefTrailerResolver.getTrailer();
> > > document.setTrailer(trailer);
> > > document.setIsXRefStream(XRefType.STREAM ==
> > > xrefTrailerResolver.getXrefType());
> > > // check the offsets of all referenced objects
> > > checkXrefOffsets();
> > > // copy xref table
> > > document.addXRefTable(xrefTrailerResolver.getXrefTable());
> > > return trailer;
> > > }
> >
> >
> > BR
> > Andreas Lehmkühler
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]