Hi, > Steve Antoch <[email protected]> hat am 25. Februar 2015 um 00:04 geschrieben: > > > Hi Andreas- > > Thanks again. > > I downloaded and built the latest from trunk. > There was no change for the book I was testing. I first tried it after taking > out my if (streamOffset > 0) test, but the null reference exception still > occurred. OK, thanks again for testing. I've fixed the issue based on your analysis.
> We are planning on running a large breadth test on approximately 108,000 pdfs > starting tonight. I will let you know how this test goes. It will take about > 4 days to complete. Cool, I'm looking forward to see the results. > With respect to the small change I made in my fork: > https://github.com/santoch/pdfbox/commit/75cc32ab8307062709c30f1cfea5e2fdb8c00ddd > > The issue was a separate but fairly rare failure that we found in a small > number (about 10) of our pdfs. > Adobe and Pdfium (Chrome) were both able to open them but pdfBox was not due > to disallowing nesting. I figured that if Pdfium allows 64 levels of nesting, > we might be able to relax this test from 0 levels to allowing 1 level and see > if it worked. Since it did, I wanted to run those changes by you for your > comments. Is there any chance to get a hand on a sample pdf? I would be good enough to send it via private mail to me: BR Andreas Lehmkühler > > Thanks- > Steve > > ________________________________________ > From: Andreas Lehmkühler <[email protected]> > Sent: Tuesday, February 24, 2015 3:30 AM > To: [email protected] > Subject: Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present > (or variation of it still present) > > Hi Steve, > > > Steve Antoch <[email protected]> hat am 23. Februar 2015 um 19:42 > > geschrieben: > > > > > > @Andreas- > > > > I have downloaded the latest trunk and came close (it got much further) > > before > > failing. > > However, I think I may have a fix for that failure: > Thanks for the test > > > The code is returning 0 when the xrefstm fixedOffset is not found. However, > > the code still tries to load and parse from xref 0, resulting in a null > > reference exception later in parser.parse(). > Your analysis is correct, but I hope that my last improvements should > eliminate > such cases, see PDFBOX-2572 for details. Could you give the latest trunk > (r1661747) a try? > > > However, thinking about this, I came up with this: > > > > // check for a XRef stream, it may contain some object ids > > of > > compressed objects > > if(trailer.containsKey(COSName.XREF_STM)) > > { > > int streamOffset = trailer.getInt(COSName.XREF_STM); > > // check the xref stream reference > > fixedOffset = checkXRefStreamOffset(streamOffset, > > false); > > //<== fixedoffset comes back as 0 => not found > > if (fixedOffset > -1 && fixedOffset != streamOffset) > > { > > streamOffset = (int)fixedOffset; > > // <== streamOffset gets set > > to > > 0 here > > trailer.setInt(COSName.XREF_STM, streamOffset); > > } > > > > if (streamOffset > 0) //<==== I added this test > > because an xref stream starting at > > // offset 0 can > > never happen, so we should simply skip it > > { > > pdfSource.seek(streamOffset); > > skipSpaces(); > > parseXrefObjStream(prev, false); <== this call > > ultimately throws a null ref exception if streamOffset == 0 on entry > > } > > } > > > > Adding that, the file successfully parses. > > > > Also, there was this proposal that I put up on github in a repo that I > > directly forked from pdfbox (it is the only change) > > It relaxes the looping a bit to allow limited recursion. I would appreciate > > your thoughts on it. > Is this change related to the discussed issue above? > > > https://github.com/santoch/pdfbox/commit/75cc32ab8307062709c30f1cfea5e2fdb8c00ddd > > > > Thank you so much! You have been tremendously helpful. I wish I could have > > given you the files, but unfortunately, they are proprietary and we cannot > > release them. :-( > No need to worry, you are not the only one who is not allowed to share a > specific pdf. > > > Best regards- > > Steve > > BR > Andreas Lehmkühler > > > > > ________________________________________ > > From: Andreas Lehmkühler <[email protected]> > > Sent: Monday, February 23, 2015 3:43 AM > > To: [email protected] > > Subject: Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present > > (or variation of it still present) > > > > Hi, > > > > I've improved the self repair mechnism of the trunk based on Steves report. > > > > @Steve Please give the newest trunk version/SNAPSHOT a try. Does the issue > > still > > persist? > > > > BR > > Andreas Lehmkühler > > > > > Steve Antoch <[email protected]> hat am 17. Februar 2015 um 00:05 > > > geschrieben: > > > > > > > > > > > > Andreas- > > > Thanks for the response. > > > Sorry for sending directly. > > > > > > Yes, it tries to read from offset 112085940, but does not find the xrefstm > > > there, so > > > that's when it goes searching. It seems to be landing in the middle of > > > something else (perhaps an image?) > > > > > > I tried running the preflight command on the file, and this is what it > > > found > > > there. > > > This is in the middle of a whole series of repetitive byte patterns like > > > these, which is interspersed with other sections of content that is also > > > binary only. > > > > > > <?xml version="1.0" encoding="UTF-8" standalone="no"?> > > > <preflight name="file.pdf"> > > > <executionTimeMS>2646</executionTimeMS> > > > <isValid type="">false</isValid> > > > <errors count="1"> > > > <error count="1"> > > > <code>1.0</code> > > > <details>Syntax error, Error: Expected a long type at offset > > > 112085940, > > > instead got > > > '6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ³fÍ›6lÙ±¯Ó"z·Cœ3Í}yó£g‚?1º·Óž-óVÏ:ë½NsËŽ¸6lÙ³fÅ#듨Î÷å.£=‰ù}ÕsÞÿ'</details> > > > </error> > > > </errors> > > > </preflight> > > > > > > The patterns seem to be: > > > > > > lots of these: 6lÙ³fÍ› > > > interspersed between blocks that are similar to this: > > > ±¯Ó"z·Cœ3Í}yó£g‚?1º·Óž-óVÏ:ë½NsËŽ¸6lÙ³fÅ#듨Î÷å.£=‰ù}ÕsÞÿ' > > > > > > It just so happens that the offset 112085940 falls right in the middle of > > > a > > > big block of those 6lÙ³fÍ› repetitive blocks. > > > > > > Not sure if that's any help. > > > > > > Steve > > > > > > ________________________________________ > > > From: Andreas Lehmkühler <[email protected]> > > > Sent: Monday, February 16, 2015 3:34 AM > > > To: [email protected] > > > Subject: Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still > > > present > > > (or variation of it still present) > > > > > > Hi, > > > > > > > Steve Antoch <[email protected]> hat am 13. Februar 2015 um 23:34 > > > > geschrieben: > > > > > > > > > > > > > > > > Hi Tilman and Andreas-- > > > Please don't contact developers directly, use our mailing lists instead. > > > I've > > > put the users list back into the boat... > > > > > > > I am working with Krasimir on this issue. > > > > > > > > Although we asked, we were denied permission to send the document out. > > > :-( > > > > > > > The failure is being triggered when we attempt to use the Encrypt() > > > > class > > > > to > > > > password protect the pdf. > > > > We end up with the "Expected a long type at offset 113884174, instead > > > > got > > > > 'xref'" failure. > > > > > > > > I have debugged into the PDFBox code and found the offending parts. > > > > > > > > PdfBox is trying to parse an xref table located at 113884174. > > > > > > > > The problem we are seeing is that the inside the trailer it finds the > > > > /XRefStm > > > > label, and its offset value is returned as 112085940 (which is what is > > > > given > > > > in the file), > > > > However, the checkXRefOffset() call made to verify it doesn't find the > > > > xref > > > > stream there, so it goes searching and ends up returning the closest > > > > xref > > > > offset it can find, which happens to be that it returns its own offset > > > > at > > > > 113884174. > > > > > > > > > > > > I believe that there is an error in PdfBox with respect to this fixup > > > > logic, > > > > even if it had found the 'correct' xref stream. > > > > That is because the fixup offset can NEVER work. Every time it fixes up > > > > the > > > > location, it lands on a section which begins with "xref". > > > > The next call is to skip the whitespace, but since there is never any > > > > there > > > > (it's already proven to be 'xref'), it does not advance the input > > > > stream. > > > > Then, the first call to parse that xrefstm always calls readObjectID(), > > > > which > > > > always will throw the exception because the bytes are always 'xref'. > > > > > > > > So, my questions are: > > > > > > > > 1) Are these docs fixable or are they truly corrupt? > > > Without having a hand on the pdf itself it's hard to give a 100% answer. > > > But > > > I > > > guess there has to be fix, as adobe is able to open that pdf. I'll try to > > > find > > > one, following your description of the pdf > > > > > > > 2) Is this xref issue a known issue with PdfBox? I would try to create > > > > a > > > > document that displays the error but I honesty don't know how to do so > > > > (beyond > > > > sending the ones that we have that DO display it). > > > Not until now > > > > > > > 3) Do you have any idea how these documents end up in this state if they > > > > are > > > > being edited by tools such as InDesign, Acrobat, etc? Is there something > > > > I > > > > can > > > > do to identify them? > > > There are a lot of more or less corrupt files in the wild. Those are > > > created > > > using different tools. > > > > > > > 4) If this is a truly corrupted document, why would Acrobat be able to > > > > open > > > > these files but pdfBox cannot? Are these streams somehow ignorable? I > > > > ask > > > > this because I saw this statement on a web page > > > > > > > > (http://resources.infosecinstitute.com/pdf-file-format-basic-structure/) > > > > when > > > > I was searching for answers on this: > > > Adobe implements a lot of self healing mechanisms to repair broken files > > > and > > > we > > > try to do so too, but it's complicated. > > > > > > > – /XrefStm [integer]: specifies the offset from the beginning of the > > > > file > > > > to > > > > the cross-reference stream in the decoded stream. This is only present > > > > in > > > > hybrid-reference files, which is specified if we would also like to open > > > > documents even if the applications don’t support compressed reference > > > > streams. > > > > > > > > Any light you can shed on this is appreciated. > > > > > > > > Thanks- > > > > Steve > > > > > > > > > > > > See below for the pertinent data and the code which is marked with the > > > > values > > > > as I traced through. > > > > > > > > I have confirmed that the byte offset of the word xref below is exactly > > > > at > > > > 113884174. > > > > > > Does the xref stream start at 112085940 (stream offset from the trailer > > > dictionary) or what did you find at that offset? > > > > > > > > > > xref > > > > 0 53641 > > > > 0000000000 65535 f > > > > 0000000017 00000 n > > > > > > > > <massive snip/> > > > > > > > > > > > > trailer > > > > \<\< > > > > /Size 53641 > > > > /Root 1 0 R > > > > /XRefStm 112085940 > > > > /Info 8 0 R > > > > /ID [\<19790A83488211E283B50017F203355C> > > > > \<E3DF7097A16969B08238787F19E7E219>] > > > > >> > > > > startxref > > > > 113884174 > > > > %%EOF1 0 obj\<\</Outlines 2 0 R/Metadata 53641 0 R/AcroForm 4 0 R/Pages > > > > 5 > > > > 0 > > > > R/StructTreeRoot 6 0 R/Type/Catalog/PageLabels 7 0 R>> > > > > endobj > > > > > > > > > > > > protected COSDictionary parseXref(long startXRefOffset) throws > > > > IOException > > > > { > > > > pdfSource.seek(startXRefOffset); > > > > long startXrefOffset = parseStartXref(); > > > > // check the startxref offset > > > > long fixedOffset = checkXRefOffset(startXrefOffset); > > > > if (fixedOffset > -1) > > > > { > > > > startXrefOffset = fixedOffset; > > > > } > > > > document.setStartXref(startXrefOffset); > > > > long prev = startXrefOffset; > > > > // ---- parse whole chain of xref tables/object streams using > > > > PREV > > > > reference > > > > while (prev > -1) <== prev here is 113884174. > > > > { > > > > // seek to xref table > > > > pdfSource.seek(prev); > > > > > > > > // skip white spaces > > > > skipSpaces(); > > > > // -- parse xref > > > > if (pdfSource.peek() == X) > > > > { > > > > // xref table and trailer > > > > // use existing parser to parse xref table > > > > parseXrefTable(prev); > > > > // parse the last trailer. > > > > trailerOffset = pdfSource.getOffset(); > > > > // PDFBOX-1739 skip extra xref entries in RegisSTAR > > > > documents > > > > while (isLenient && pdfSource.peek() != 't') > > > > { > > > > if (pdfSource.getOffset() == trailerOffset) > > > > { > > > > // warn only the first time > > > > LOG.warn("Expected trailer object at position " > > > > + > > > > trailerOffset > > > > + ", keep trying"); > > > > } > > > > readLine(); > > > > } > > > > if (!parseTrailer()) > > > > { > > > > throw new IOException("Expected trailer object at > > > > position: " > > > > + pdfSource.getOffset()); > > > > } > > > > COSDictionary trailer = > > > > xrefTrailerResolver.getCurrentTrailer(); > > > > // check for a XRef stream, it may contain some object > > > > ids > > > > of > > > > compressed objects > > > > if(trailer.containsKey(COSName.XREF_STM)) <== YES - but > > > > falue > > > > { > > > > int streamOffset = trailer.getInt(COSName.XREF_STM); > > > > <== > > > > This returns 112085940, which is the value from the trailer > > > > // check the xref stream reference > > > > fixedOffset = checkXRefOffset(streamOffset); > > > > <== > > > > checks it and returns 113884174 instead > > > > if (fixedOffset > -1 && fixedOffset != streamOffset) > > > > { > > > > streamOffset = (int)fixedOffset; > > > > trailer.setInt(COSName.XREF_STM, streamOffset); > > > > } > > > > pdfSource.seek(streamOffset); <== Seeks to > > > > 113884174 > > > > //readExpectedString(XREF_TABLE, false); > > > > skipSpaces(); <=== It's ON "xref", so it > > > > doesn't > > > > skip anything > > > > parseXrefObjStream(prev, false); <== goes in here, > > > > first > > > > thing it tries to do is readObjectNumber(), which can't work because > > > > it's > > > > 'xref' -- BOOM > > > > } > > > > prev = trailer.getInt(COSName.PREV); > > > > if (prev > -1) > > > > { > > > > // check the xref table reference > > > > fixedOffset = checkXRefOffset(prev); > > > > if (fixedOffset > -1 && fixedOffset != prev) > > > > { > > > > prev = fixedOffset; > > > > trailer.setLong(COSName.PREV, prev); > > > > } > > > > } > > > > } > > > > else > > > > { > > > > // parse xref stream > > > > prev = parseXrefObjStream(prev, true); > > > > if (prev > -1) > > > > { > > > > // check the xref table reference > > > > fixedOffset = checkXRefOffset(prev); > > > > if (fixedOffset > -1 && fixedOffset != prev) > > > > { > > > > prev = fixedOffset; > > > > COSDictionary trailer = > > > > xrefTrailerResolver.getCurrentTrailer(); > > > > trailer.setLong(COSName.PREV, prev); > > > > } > > > > } > > > > } > > > > } > > > > // ---- build valid xrefs out of the xref chain > > > > xrefTrailerResolver.setStartxref(startXrefOffset); > > > > COSDictionary trailer = xrefTrailerResolver.getTrailer(); > > > > document.setTrailer(trailer); > > > > document.setIsXRefStream(XRefType.STREAM == > > > > xrefTrailerResolver.getXrefType()); > > > > // check the offsets of all referenced objects > > > > checkXrefOffsets(); > > > > // copy xref table > > > > document.addXRefTable(xrefTrailerResolver.getXrefTable()); > > > > return trailer; > > > > } > > > > > > > > > BR > > > Andreas Lehmkühler > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: [email protected] > > > For additional commands, e-mail: [email protected] > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [email protected] > > For additional commands, e-mail: [email protected] > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [email protected] > > For additional commands, e-mail: [email protected] > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

