Andreas-
Thanks for the response.
Sorry for sending directly.
Yes, it tries to read from offset 112085940, but does not find the xrefstm
there, so
that's when it goes searching. It seems to be landing in the middle of
something else (perhaps an image?)
I tried running the preflight command on the file, and this is what it found
there.
This is in the middle of a whole series of repetitive byte patterns like these,
which is interspersed with other sections of content that is also binary only.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<preflight name="file.pdf">
<executionTimeMS>2646</executionTimeMS>
<isValid type="">false</isValid>
<errors count="1">
<error count="1">
<code>1.0</code>
<details>Syntax error, Error: Expected a long type at offset 112085940,
instead got
'6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ±¯Ó"z·Cœ3Í}yó£g‚?1º·Óž-óVÏ:ë½NsËŽ¸6lÙ³fÅ#듨Î÷å.£=‰ù}ÕsÞÿ'</details>
</error>
</errors>
</preflight>
The patterns seem to be:
lots of these: 6lÙ³fÍ›
interspersed between blocks that are similar to this:
±¯Ó"z·Cœ3Í}yó£g‚?1º·Óž-óVÏ:ë½NsËŽ¸6lÙ³fÅ#듨Î÷å.£=‰ù}ÕsÞÿ'
It just so happens that the offset 112085940 falls right in the middle of a big
block of those 6lÙ³fÍ› repetitive blocks.
Not sure if that's any help.
Steve
________________________________________
From: Andreas Lehmkühler <[email protected]>
Sent: Monday, February 16, 2015 3:34 AM
To: [email protected]
Subject: Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present
(or variation of it still present)
Hi,
> Steve Antoch <[email protected]> hat am 13. Februar 2015 um 23:34 geschrieben:
>
>
>
> Hi Tilman and Andreas--
Please don't contact developers directly, use our mailing lists instead. I've
put the users list back into the boat...
> I am working with Krasimir on this issue.
>
> Although we asked, we were denied permission to send the document out.
:-(
> The failure is being triggered when we attempt to use the Encrypt() class to
> password protect the pdf.
> We end up with the "Expected a long type at offset 113884174, instead got
> 'xref'" failure.
>
> I have debugged into the PDFBox code and found the offending parts.
>
> PdfBox is trying to parse an xref table located at 113884174.
>
> The problem we are seeing is that the inside the trailer it finds the /XRefStm
> label, and its offset value is returned as 112085940 (which is what is given
> in the file),
> However, the checkXRefOffset() call made to verify it doesn't find the xref
> stream there, so it goes searching and ends up returning the closest xref
> offset it can find, which happens to be that it returns its own offset at
> 113884174.
>
>
> I believe that there is an error in PdfBox with respect to this fixup logic,
> even if it had found the 'correct' xref stream.
> That is because the fixup offset can NEVER work. Every time it fixes up the
> location, it lands on a section which begins with "xref".
> The next call is to skip the whitespace, but since there is never any there
> (it's already proven to be 'xref'), it does not advance the input stream.
> Then, the first call to parse that xrefstm always calls readObjectID(), which
> always will throw the exception because the bytes are always 'xref'.
>
> So, my questions are:
>
> 1) Are these docs fixable or are they truly corrupt?
Without having a hand on the pdf itself it's hard to give a 100% answer. But I
guess there has to be fix, as adobe is able to open that pdf. I'll try to find
one, following your description of the pdf
> 2) Is this xref issue a known issue with PdfBox? I would try to create a
> document that displays the error but I honesty don't know how to do so (beyond
> sending the ones that we have that DO display it).
Not until now
> 3) Do you have any idea how these documents end up in this state if they are
> being edited by tools such as InDesign, Acrobat, etc? Is there something I can
> do to identify them?
There are a lot of more or less corrupt files in the wild. Those are created
using different tools.
> 4) If this is a truly corrupted document, why would Acrobat be able to open
> these files but pdfBox cannot? Are these streams somehow ignorable? I ask
> this because I saw this statement on a web page
> (http://resources.infosecinstitute.com/pdf-file-format-basic-structure/) when
> I was searching for answers on this:
Adobe implements a lot of self healing mechanisms to repair broken files and we
try to do so too, but it's complicated.
> – /XrefStm [integer]: specifies the offset from the beginning of the file to
> the cross-reference stream in the decoded stream. This is only present in
> hybrid-reference files, which is specified if we would also like to open
> documents even if the applications don’t support compressed reference
> streams.
>
> Any light you can shed on this is appreciated.
>
> Thanks-
> Steve
>
>
> See below for the pertinent data and the code which is marked with the values
> as I traced through.
>
> I have confirmed that the byte offset of the word xref below is exactly at
> 113884174.
Does the xref stream start at 112085940 (stream offset from the trailer
dictionary) or what did you find at that offset?
> xref
> 0 53641
> 0000000000 65535 f
> 0000000017 00000 n
>
> <massive snip/>
>
>
> trailer
> \<\<
> /Size 53641
> /Root 1 0 R
> /XRefStm 112085940
> /Info 8 0 R
> /ID [\<19790A83488211E283B50017F203355C> \<E3DF7097A16969B08238787F19E7E219>]
> >>
> startxref
> 113884174
> %%EOF1 0 obj\<\</Outlines 2 0 R/Metadata 53641 0 R/AcroForm 4 0 R/Pages 5 0
> R/StructTreeRoot 6 0 R/Type/Catalog/PageLabels 7 0 R>>
> endobj
>
>
> protected COSDictionary parseXref(long startXRefOffset) throws IOException
> {
> pdfSource.seek(startXRefOffset);
> long startXrefOffset = parseStartXref();
> // check the startxref offset
> long fixedOffset = checkXRefOffset(startXrefOffset);
> if (fixedOffset > -1)
> {
> startXrefOffset = fixedOffset;
> }
> document.setStartXref(startXrefOffset);
> long prev = startXrefOffset;
> // ---- parse whole chain of xref tables/object streams using PREV
> reference
> while (prev > -1) <== prev here is 113884174.
> {
> // seek to xref table
> pdfSource.seek(prev);
>
> // skip white spaces
> skipSpaces();
> // -- parse xref
> if (pdfSource.peek() == X)
> {
> // xref table and trailer
> // use existing parser to parse xref table
> parseXrefTable(prev);
> // parse the last trailer.
> trailerOffset = pdfSource.getOffset();
> // PDFBOX-1739 skip extra xref entries in RegisSTAR documents
> while (isLenient && pdfSource.peek() != 't')
> {
> if (pdfSource.getOffset() == trailerOffset)
> {
> // warn only the first time
> LOG.warn("Expected trailer object at position " +
> trailerOffset
> + ", keep trying");
> }
> readLine();
> }
> if (!parseTrailer())
> {
> throw new IOException("Expected trailer object at
> position: "
> + pdfSource.getOffset());
> }
> COSDictionary trailer =
> xrefTrailerResolver.getCurrentTrailer();
> // check for a XRef stream, it may contain some object ids of
> compressed objects
> if(trailer.containsKey(COSName.XREF_STM)) <== YES - but falue
> {
> int streamOffset = trailer.getInt(COSName.XREF_STM); <==
> This returns 112085940, which is the value from the trailer
> // check the xref stream reference
> fixedOffset = checkXRefOffset(streamOffset); <==
> checks it and returns 113884174 instead
> if (fixedOffset > -1 && fixedOffset != streamOffset)
> {
> streamOffset = (int)fixedOffset;
> trailer.setInt(COSName.XREF_STM, streamOffset);
> }
> pdfSource.seek(streamOffset); <== Seeks to 113884174
> //readExpectedString(XREF_TABLE, false);
> skipSpaces(); <=== It's ON "xref", so it doesn't
> skip anything
> parseXrefObjStream(prev, false); <== goes in here, first
> thing it tries to do is readObjectNumber(), which can't work because it's
> 'xref' -- BOOM
> }
> prev = trailer.getInt(COSName.PREV);
> if (prev > -1)
> {
> // check the xref table reference
> fixedOffset = checkXRefOffset(prev);
> if (fixedOffset > -1 && fixedOffset != prev)
> {
> prev = fixedOffset;
> trailer.setLong(COSName.PREV, prev);
> }
> }
> }
> else
> {
> // parse xref stream
> prev = parseXrefObjStream(prev, true);
> if (prev > -1)
> {
> // check the xref table reference
> fixedOffset = checkXRefOffset(prev);
> if (fixedOffset > -1 && fixedOffset != prev)
> {
> prev = fixedOffset;
> COSDictionary trailer =
> xrefTrailerResolver.getCurrentTrailer();
> trailer.setLong(COSName.PREV, prev);
> }
> }
> }
> }
> // ---- build valid xrefs out of the xref chain
> xrefTrailerResolver.setStartxref(startXrefOffset);
> COSDictionary trailer = xrefTrailerResolver.getTrailer();
> document.setTrailer(trailer);
> document.setIsXRefStream(XRefType.STREAM ==
> xrefTrailerResolver.getXrefType());
> // check the offsets of all referenced objects
> checkXrefOffsets();
> // copy xref table
> document.addXRefTable(xrefTrailerResolver.getXrefTable());
> return trailer;
> }
BR
Andreas Lehmkühler