@Andreas-
I have downloaded the latest trunk and came close (it got much further) before
failing.
However, I think I may have a fix for that failure:
The code is returning 0 when the xrefstm fixedOffset is not found. However,
the code still tries to load and parse from xref 0, resulting in a null
reference exception later in parser.parse().
However, thinking about this, I came up with this:
// check for a XRef stream, it may contain some object ids of
compressed objects
if(trailer.containsKey(COSName.XREF_STM))
{
int streamOffset = trailer.getInt(COSName.XREF_STM);
// check the xref stream reference
fixedOffset = checkXRefStreamOffset(streamOffset, false);
//<== fixedoffset comes back as 0 => not found
if (fixedOffset > -1 && fixedOffset != streamOffset)
{
streamOffset = (int)fixedOffset;
// <== streamOffset gets set to 0 here
trailer.setInt(COSName.XREF_STM, streamOffset);
}
if (streamOffset > 0) //<==== I added this test because
an xref stream starting at
// offset 0 can
never happen, so we should simply skip it
{
pdfSource.seek(streamOffset);
skipSpaces();
parseXrefObjStream(prev, false); <== this call
ultimately throws a null ref exception if streamOffset == 0 on entry
}
}
Adding that, the file successfully parses.
Also, there was this proposal that I put up on github in a repo that I directly
forked from pdfbox (it is the only change)
It relaxes the looping a bit to allow limited recursion. I would appreciate
your thoughts on it.
https://github.com/santoch/pdfbox/commit/75cc32ab8307062709c30f1cfea5e2fdb8c00ddd
Thank you so much! You have been tremendously helpful. I wish I could have
given you the files, but unfortunately, they are proprietary and we cannot
release them. :-(
Best regards-
Steve
________________________________________
From: Andreas Lehmkühler <[email protected]>
Sent: Monday, February 23, 2015 3:43 AM
To: [email protected]
Subject: Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present
(or variation of it still present)
Hi,
I've improved the self repair mechnism of the trunk based on Steves report.
@Steve Please give the newest trunk version/SNAPSHOT a try. Does the issue still
persist?
BR
Andreas Lehmkühler
> Steve Antoch <[email protected]> hat am 17. Februar 2015 um 00:05 geschrieben:
>
>
>
> Andreas-
> Thanks for the response.
> Sorry for sending directly.
>
> Yes, it tries to read from offset 112085940, but does not find the xrefstm
> there, so
> that's when it goes searching. It seems to be landing in the middle of
> something else (perhaps an image?)
>
> I tried running the preflight command on the file, and this is what it found
> there.
> This is in the middle of a whole series of repetitive byte patterns like
> these, which is interspersed with other sections of content that is also
> binary only.
>
> <?xml version="1.0" encoding="UTF-8" standalone="no"?>
> <preflight name="file.pdf">
> <executionTimeMS>2646</executionTimeMS>
> <isValid type="">false</isValid>
> <errors count="1">
> <error count="1">
> <code>1.0</code>
> <details>Syntax error, Error: Expected a long type at offset 112085940,
> instead got
> '6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ±¯Ó"z·Cœ3Í}yó£g‚?1º·Óž-óVÏ:ë½NsËŽ¸6lÙ³fÅ#듨Î÷å.£=‰ù}ÕsÞÿ'</details>
> </error>
> </errors>
> </preflight>
>
> The patterns seem to be:
>
> lots of these: 6lÙ³fÍ›
> interspersed between blocks that are similar to this:
> ±¯Ó"z·Cœ3Í}yó£g‚?1º·Óž-óVÏ:ë½NsËŽ¸6lÙ³fÅ#듨Î÷å.£=‰ù}ÕsÞÿ'
>
> It just so happens that the offset 112085940 falls right in the middle of a
> big block of those 6lÙ³fÍ› repetitive blocks.
>
> Not sure if that's any help.
>
> Steve
>
> ________________________________________
> From: Andreas Lehmkühler <[email protected]>
> Sent: Monday, February 16, 2015 3:34 AM
> To: [email protected]
> Subject: Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present
> (or variation of it still present)
>
> Hi,
>
> > Steve Antoch <[email protected]> hat am 13. Februar 2015 um 23:34
> > geschrieben:
> >
> >
> >
> > Hi Tilman and Andreas--
> Please don't contact developers directly, use our mailing lists instead. I've
> put the users list back into the boat...
>
> > I am working with Krasimir on this issue.
> >
> > Although we asked, we were denied permission to send the document out.
> :-(
>
> > The failure is being triggered when we attempt to use the Encrypt() class to
> > password protect the pdf.
> > We end up with the "Expected a long type at offset 113884174, instead got
> > 'xref'" failure.
> >
> > I have debugged into the PDFBox code and found the offending parts.
> >
> > PdfBox is trying to parse an xref table located at 113884174.
> >
> > The problem we are seeing is that the inside the trailer it finds the
> > /XRefStm
> > label, and its offset value is returned as 112085940 (which is what is given
> > in the file),
> > However, the checkXRefOffset() call made to verify it doesn't find the xref
> > stream there, so it goes searching and ends up returning the closest xref
> > offset it can find, which happens to be that it returns its own offset at
> > 113884174.
> >
> >
> > I believe that there is an error in PdfBox with respect to this fixup logic,
> > even if it had found the 'correct' xref stream.
> > That is because the fixup offset can NEVER work. Every time it fixes up the
> > location, it lands on a section which begins with "xref".
> > The next call is to skip the whitespace, but since there is never any there
> > (it's already proven to be 'xref'), it does not advance the input stream.
> > Then, the first call to parse that xrefstm always calls readObjectID(),
> > which
> > always will throw the exception because the bytes are always 'xref'.
> >
> > So, my questions are:
> >
> > 1) Are these docs fixable or are they truly corrupt?
> Without having a hand on the pdf itself it's hard to give a 100% answer. But I
> guess there has to be fix, as adobe is able to open that pdf. I'll try to find
> one, following your description of the pdf
>
> > 2) Is this xref issue a known issue with PdfBox? I would try to create a
> > document that displays the error but I honesty don't know how to do so
> > (beyond
> > sending the ones that we have that DO display it).
> Not until now
>
> > 3) Do you have any idea how these documents end up in this state if they are
> > being edited by tools such as InDesign, Acrobat, etc? Is there something I
> > can
> > do to identify them?
> There are a lot of more or less corrupt files in the wild. Those are created
> using different tools.
>
> > 4) If this is a truly corrupted document, why would Acrobat be able to open
> > these files but pdfBox cannot? Are these streams somehow ignorable? I ask
> > this because I saw this statement on a web page
> > (http://resources.infosecinstitute.com/pdf-file-format-basic-structure/)
> > when
> > I was searching for answers on this:
> Adobe implements a lot of self healing mechanisms to repair broken files and
> we
> try to do so too, but it's complicated.
>
> > – /XrefStm [integer]: specifies the offset from the beginning of the file to
> > the cross-reference stream in the decoded stream. This is only present in
> > hybrid-reference files, which is specified if we would also like to open
> > documents even if the applications don’t support compressed reference
> > streams.
> >
> > Any light you can shed on this is appreciated.
> >
> > Thanks-
> > Steve
> >
> >
> > See below for the pertinent data and the code which is marked with the
> > values
> > as I traced through.
> >
> > I have confirmed that the byte offset of the word xref below is exactly at
> > 113884174.
>
> Does the xref stream start at 112085940 (stream offset from the trailer
> dictionary) or what did you find at that offset?
>
>
> > xref
> > 0 53641
> > 0000000000 65535 f
> > 0000000017 00000 n
> >
> > <massive snip/>
> >
> >
> > trailer
> > \<\<
> > /Size 53641
> > /Root 1 0 R
> > /XRefStm 112085940
> > /Info 8 0 R
> > /ID [\<19790A83488211E283B50017F203355C>
> > \<E3DF7097A16969B08238787F19E7E219>]
> > >>
> > startxref
> > 113884174
> > %%EOF1 0 obj\<\</Outlines 2 0 R/Metadata 53641 0 R/AcroForm 4 0 R/Pages 5 0
> > R/StructTreeRoot 6 0 R/Type/Catalog/PageLabels 7 0 R>>
> > endobj
> >
> >
> > protected COSDictionary parseXref(long startXRefOffset) throws
> > IOException
> > {
> > pdfSource.seek(startXRefOffset);
> > long startXrefOffset = parseStartXref();
> > // check the startxref offset
> > long fixedOffset = checkXRefOffset(startXrefOffset);
> > if (fixedOffset > -1)
> > {
> > startXrefOffset = fixedOffset;
> > }
> > document.setStartXref(startXrefOffset);
> > long prev = startXrefOffset;
> > // ---- parse whole chain of xref tables/object streams using PREV
> > reference
> > while (prev > -1) <== prev here is 113884174.
> > {
> > // seek to xref table
> > pdfSource.seek(prev);
> >
> > // skip white spaces
> > skipSpaces();
> > // -- parse xref
> > if (pdfSource.peek() == X)
> > {
> > // xref table and trailer
> > // use existing parser to parse xref table
> > parseXrefTable(prev);
> > // parse the last trailer.
> > trailerOffset = pdfSource.getOffset();
> > // PDFBOX-1739 skip extra xref entries in RegisSTAR
> > documents
> > while (isLenient && pdfSource.peek() != 't')
> > {
> > if (pdfSource.getOffset() == trailerOffset)
> > {
> > // warn only the first time
> > LOG.warn("Expected trailer object at position " +
> > trailerOffset
> > + ", keep trying");
> > }
> > readLine();
> > }
> > if (!parseTrailer())
> > {
> > throw new IOException("Expected trailer object at
> > position: "
> > + pdfSource.getOffset());
> > }
> > COSDictionary trailer =
> > xrefTrailerResolver.getCurrentTrailer();
> > // check for a XRef stream, it may contain some object ids
> > of
> > compressed objects
> > if(trailer.containsKey(COSName.XREF_STM)) <== YES - but
> > falue
> > {
> > int streamOffset = trailer.getInt(COSName.XREF_STM);
> > <==
> > This returns 112085940, which is the value from the trailer
> > // check the xref stream reference
> > fixedOffset = checkXRefOffset(streamOffset);
> > <==
> > checks it and returns 113884174 instead
> > if (fixedOffset > -1 && fixedOffset != streamOffset)
> > {
> > streamOffset = (int)fixedOffset;
> > trailer.setInt(COSName.XREF_STM, streamOffset);
> > }
> > pdfSource.seek(streamOffset); <== Seeks to 113884174
> > //readExpectedString(XREF_TABLE, false);
> > skipSpaces(); <=== It's ON "xref", so it doesn't
> > skip anything
> > parseXrefObjStream(prev, false); <== goes in here, first
> > thing it tries to do is readObjectNumber(), which can't work because it's
> > 'xref' -- BOOM
> > }
> > prev = trailer.getInt(COSName.PREV);
> > if (prev > -1)
> > {
> > // check the xref table reference
> > fixedOffset = checkXRefOffset(prev);
> > if (fixedOffset > -1 && fixedOffset != prev)
> > {
> > prev = fixedOffset;
> > trailer.setLong(COSName.PREV, prev);
> > }
> > }
> > }
> > else
> > {
> > // parse xref stream
> > prev = parseXrefObjStream(prev, true);
> > if (prev > -1)
> > {
> > // check the xref table reference
> > fixedOffset = checkXRefOffset(prev);
> > if (fixedOffset > -1 && fixedOffset != prev)
> > {
> > prev = fixedOffset;
> > COSDictionary trailer =
> > xrefTrailerResolver.getCurrentTrailer();
> > trailer.setLong(COSName.PREV, prev);
> > }
> > }
> > }
> > }
> > // ---- build valid xrefs out of the xref chain
> > xrefTrailerResolver.setStartxref(startXrefOffset);
> > COSDictionary trailer = xrefTrailerResolver.getTrailer();
> > document.setTrailer(trailer);
> > document.setIsXRefStream(XRefType.STREAM ==
> > xrefTrailerResolver.getXrefType());
> > // check the offsets of all referenced objects
> > checkXrefOffsets();
> > // copy xref table
> > document.addXRefTable(xrefTrailerResolver.getXrefTable());
> > return trailer;
> > }
>
>
> BR
> Andreas Lehmkühler
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]