Hi, > Steve Antoch <[email protected]> hat am 13. Februar 2015 um 23:34 geschrieben: > > > > Hi Tilman and Andreas-- Please don't contact developers directly, use our mailing lists instead. I've put the users list back into the boat...
> I am working with Krasimir on this issue. > > Although we asked, we were denied permission to send the document out. :-( > The failure is being triggered when we attempt to use the Encrypt() class to > password protect the pdf. > We end up with the "Expected a long type at offset 113884174, instead got > 'xref'" failure. > > I have debugged into the PDFBox code and found the offending parts. > > PdfBox is trying to parse an xref table located at 113884174. > > The problem we are seeing is that the inside the trailer it finds the /XRefStm > label, and its offset value is returned as 112085940 (which is what is given > in the file), > However, the checkXRefOffset() call made to verify it doesn't find the xref > stream there, so it goes searching and ends up returning the closest xref > offset it can find, which happens to be that it returns its own offset at > 113884174. > > > I believe that there is an error in PdfBox with respect to this fixup logic, > even if it had found the 'correct' xref stream. > That is because the fixup offset can NEVER work. Every time it fixes up the > location, it lands on a section which begins with "xref". > The next call is to skip the whitespace, but since there is never any there > (it's already proven to be 'xref'), it does not advance the input stream. > Then, the first call to parse that xrefstm always calls readObjectID(), which > always will throw the exception because the bytes are always 'xref'. > > So, my questions are: > > 1) Are these docs fixable or are they truly corrupt? Without having a hand on the pdf itself it's hard to give a 100% answer. But I guess there has to be fix, as adobe is able to open that pdf. I'll try to find one, following your description of the pdf > 2) Is this xref issue a known issue with PdfBox? I would try to create a > document that displays the error but I honesty don't know how to do so (beyond > sending the ones that we have that DO display it). Not until now > 3) Do you have any idea how these documents end up in this state if they are > being edited by tools such as InDesign, Acrobat, etc? Is there something I can > do to identify them? There are a lot of more or less corrupt files in the wild. Those are created using different tools. > 4) If this is a truly corrupted document, why would Acrobat be able to open > these files but pdfBox cannot? Are these streams somehow ignorable? I ask > this because I saw this statement on a web page > (http://resources.infosecinstitute.com/pdf-file-format-basic-structure/) when > I was searching for answers on this: Adobe implements a lot of self healing mechanisms to repair broken files and we try to do so too, but it's complicated. > – /XrefStm [integer]: specifies the offset from the beginning of the file to > the cross-reference stream in the decoded stream. This is only present in > hybrid-reference files, which is specified if we would also like to open > documents even if the applications don’t support compressed reference > streams. > > Any light you can shed on this is appreciated. > > Thanks- > Steve > > > See below for the pertinent data and the code which is marked with the values > as I traced through. > > I have confirmed that the byte offset of the word xref below is exactly at > 113884174. Does the xref stream start at 112085940 (stream offset from the trailer dictionary) or what did you find at that offset? > xref > 0 53641 > 0000000000 65535 f > 0000000017 00000 n > > <massive snip/> > > > trailer > \<\< > /Size 53641 > /Root 1 0 R > /XRefStm 112085940 > /Info 8 0 R > /ID [\<19790A83488211E283B50017F203355C> \<E3DF7097A16969B08238787F19E7E219>] > >> > startxref > 113884174 > %%EOF1 0 obj\<\</Outlines 2 0 R/Metadata 53641 0 R/AcroForm 4 0 R/Pages 5 0 > R/StructTreeRoot 6 0 R/Type/Catalog/PageLabels 7 0 R>> > endobj > > > protected COSDictionary parseXref(long startXRefOffset) throws IOException > { > pdfSource.seek(startXRefOffset); > long startXrefOffset = parseStartXref(); > // check the startxref offset > long fixedOffset = checkXRefOffset(startXrefOffset); > if (fixedOffset > -1) > { > startXrefOffset = fixedOffset; > } > document.setStartXref(startXrefOffset); > long prev = startXrefOffset; > // ---- parse whole chain of xref tables/object streams using PREV > reference > while (prev > -1) <== prev here is 113884174. > { > // seek to xref table > pdfSource.seek(prev); > > // skip white spaces > skipSpaces(); > // -- parse xref > if (pdfSource.peek() == X) > { > // xref table and trailer > // use existing parser to parse xref table > parseXrefTable(prev); > // parse the last trailer. > trailerOffset = pdfSource.getOffset(); > // PDFBOX-1739 skip extra xref entries in RegisSTAR documents > while (isLenient && pdfSource.peek() != 't') > { > if (pdfSource.getOffset() == trailerOffset) > { > // warn only the first time > LOG.warn("Expected trailer object at position " + > trailerOffset > + ", keep trying"); > } > readLine(); > } > if (!parseTrailer()) > { > throw new IOException("Expected trailer object at > position: " > + pdfSource.getOffset()); > } > COSDictionary trailer = > xrefTrailerResolver.getCurrentTrailer(); > // check for a XRef stream, it may contain some object ids of > compressed objects > if(trailer.containsKey(COSName.XREF_STM)) <== YES - but falue > { > int streamOffset = trailer.getInt(COSName.XREF_STM); <== > This returns 112085940, which is the value from the trailer > // check the xref stream reference > fixedOffset = checkXRefOffset(streamOffset); <== > checks it and returns 113884174 instead > if (fixedOffset > -1 && fixedOffset != streamOffset) > { > streamOffset = (int)fixedOffset; > trailer.setInt(COSName.XREF_STM, streamOffset); > } > pdfSource.seek(streamOffset); <== Seeks to 113884174 > //readExpectedString(XREF_TABLE, false); > skipSpaces(); <=== It's ON "xref", so it doesn't > skip anything > parseXrefObjStream(prev, false); <== goes in here, first > thing it tries to do is readObjectNumber(), which can't work because it's > 'xref' -- BOOM > } > prev = trailer.getInt(COSName.PREV); > if (prev > -1) > { > // check the xref table reference > fixedOffset = checkXRefOffset(prev); > if (fixedOffset > -1 && fixedOffset != prev) > { > prev = fixedOffset; > trailer.setLong(COSName.PREV, prev); > } > } > } > else > { > // parse xref stream > prev = parseXrefObjStream(prev, true); > if (prev > -1) > { > // check the xref table reference > fixedOffset = checkXRefOffset(prev); > if (fixedOffset > -1 && fixedOffset != prev) > { > prev = fixedOffset; > COSDictionary trailer = > xrefTrailerResolver.getCurrentTrailer(); > trailer.setLong(COSName.PREV, prev); > } > } > } > } > // ---- build valid xrefs out of the xref chain > xrefTrailerResolver.setStartxref(startXrefOffset); > COSDictionary trailer = xrefTrailerResolver.getTrailer(); > document.setTrailer(trailer); > document.setIsXRefStream(XRefType.STREAM == > xrefTrailerResolver.getXrefType()); > // check the offsets of all referenced objects > checkXrefOffsets(); > // copy xref table > document.addXRefTable(xrefTrailerResolver.getXrefTable()); > return trailer; > } BR Andreas Lehmkühler --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

