[ https://issues.apache.org/jira/browse/PDFBOX-4973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17257197#comment-17257197 ]
Andreas Lehmkühler commented on PDFBOX-4973: -------------------------------------------- Turn off the leniency and you will get the expected exception back. But be aware that a lot of other pdfs won't be parseable anymore. Use " {{false}} as parameter for {{org.apache.pdfbox.pdfparser.PDFParser.parse(boolean)}} and the Praser should follow the specs. I don't know if Tika supports that option so that you might have to adjust the wrapping code yourself. > Parsing truncated files no longer throws IOException: Error reading stream, > expected='endstream' actual='' at offset ... > ------------------------------------------------------------------------------------------------------------------------ > > Key: PDFBOX-4973 > URL: https://issues.apache.org/jira/browse/PDFBOX-4973 > Project: PDFBox > Issue Type: Bug > Components: Parsing > Affects Versions: 2.0.7, 2.0.8, 2.0.9, 2.0.10, 2.0.11, 2.0.12, 2.0.13, > 2.0.14, 2.0.15, 2.0.16, 2.0.17, 2.0.18, 2.0.19, 2.0.20, 2.0.21 > Environment: Ubuntu 16.04 > Reporter: Plamen Penchev > Priority: Major > Attachments: truncated-with-eof.pdf, truncated.pdf > > > h3. Issue: > An exception is no longer thrown post-2.0.6, when a stream of a truncated PDF > file is parsed. > In 2.0.6 *COSParser's parseCOSStream* throws *"java.io.IOException: Error > reading stream, expected='endstream' actual='' at offset ..."*. Whereas >= > 2.0.7 the parsing is successful. Shall an EOF marker be added to the > truncated file, however, the expected exception is thrown once again. > The code below is the minimum setup for reproducing the behavior (_in > conjunction with the respective file attached_): > {code:java} > import org.apache.tika.exception.TikaException; > import org.apache.tika.io.TikaInputStream; > import org.apache.tika.metadata.Metadata; > import org.apache.tika.parser.AutoDetectParser; > import org.apache.tika.parser.ParseContext; > import org.apache.tika.parser.pdf.PDFParserConfig; > import org.apache.tika.sax.BodyContentHandler; > import org.xml.sax.SAXException; > import java.io.File; > import java.io.IOException; > public class Main { > public static void main(String[] args) { > File inputFile = new File("/path/to/parent/folder", > "truncated.pdf"); > try { > // metadata will be extracted by Tika > Metadata meta = new Metadata(); > meta.set(Metadata.CONTENT_TYPE, "application/pdf"); > BodyContentHandler ch = new BodyContentHandler(-1); > AutoDetectParser parser = new AutoDetectParser(); > PDFParserConfig pdfParserConfig = new > PDFParserConfig(); > pdfParserConfig.setOcrStrategy("no_ocr"); > pdfParserConfig.setMaxMainMemoryBytes(209715200); > ParseContext parseContext = new ParseContext(); > parseContext.set(PDFParserConfig.class, > pdfParserConfig); > try (TikaInputStream is = > TikaInputStream.get(inputFile.toPath())) { > // try to parse the document > parser.parse(is, ch, meta, parseContext); > } > } catch (TikaException | SAXException | IOException ex) { > // expect to enter catch > } finally { > // instead catch is skipped > } > } > } > {code} > The stack looks like this: > ||parseCOSStream||COSParser||(pdfbox)|| > ||parseFileObject||COSParser||(pdfbox)|| > ||parseObjectDynamically||COSParser||(pdfbox)|| > ||parseDictObjects||COSParser||(pdfbox)|| > ||initialParse||PDFParser||(pdfbox)|| > ||parse||PDFParser||(pdfbox)|| > ||load||PDDocument||(pdfbox)|| > ||parse||PDFParser||(tika-parsers)|| > ||parse||CompositeParser||(tika-parsers)|| > In 2.0.6 the IOException thrown in parseCOSStream is caught in tika's > CompositeParser parse method, and rethrown as TikaException, which we then > expect internally and handle it in the sample code provided. > h3. Why I believe this is a regression: > [https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf]: > In this specification Adobe describes the structure of PDF1.7, the basis for > the ISO 32000 standard. > Under the *(7) Syntax clause*, there is a *(7.5) File Structure* sub-clause > which describes the valid pdf file structure. > *This abstract is from sub-sub clause (7.5.5) File Trailer:* > ------ > The _trailer_ of a PDF file enables a conforming reader to quickly find the > cross-reference table and certain special objects. Conforming readers should > read a PDF file from its end. +The last line of the file shall contain only > the end-of-file marker, *%%EOF*.+ The two preceding lines shall contain, one > per line and in order, the keyword *startxref* and the byte offset in the > decoded stream from the beginning of the file to the beginning of the *xref* > keyword in the last cross-reference section. > ------ > Additionally the document in question cannot be previewed as it is considered > broken by pdf previewers. > h3. What introduced this change in parsing: > I investigated and tested what introduced this change in behavior. > The PDFBOX-3798 issue's resolution > [https://svn.apache.org/viewvc/pdfbox/branches/2.0/pdfbox/src/main/java/org/apache/pdfbox/pdfparser/COSParser.java?r1=1795704&r2=1795703&pathrev=1795704] > is where the change in behavior stems from. > I have tested rebuilding both 2.0.7 and 2.0.19 from their source code after > reverting the change introduced by the commit above. This brings the behavior > back to throwing "java.io.IOException: Error reading stream, > expected='endstream' actual='' at offset ..." again. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org