[jira] [Commented] (PDFBOX-4973) Parsing truncated files no longer throws IOException: Error reading stream, expected='endstream' actual='' at offset ...

Jira Fri, 01 Jan 2021 05:45:05 -0800


    [ 
https://issues.apache.org/jira/browse/PDFBOX-4973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17257197#comment-17257197
 ]


Andreas Lehmkühler commented on PDFBOX-4973:
--------------------------------------------

Turn off the leniency and you will get the expected exception back. But be 
aware that a lot of other pdfs won't be parseable anymore. Use " {{false}} as 
parameter for {{org.apache.pdfbox.pdfparser.PDFParser.parse(boolean)}} and the 
Praser should follow the specs. I don't know if Tika supports that option so 
that you might have to adjust the wrapping code yourself.


> Parsing truncated files no longer throws IOException: Error reading stream, 
> expected='endstream' actual='' at offset ...
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-4973
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4973
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 2.0.7, 2.0.8, 2.0.9, 2.0.10, 2.0.11, 2.0.12, 2.0.13, 
> 2.0.14, 2.0.15, 2.0.16, 2.0.17, 2.0.18, 2.0.19, 2.0.20, 2.0.21
>         Environment: Ubuntu 16.04
>            Reporter: Plamen Penchev
>            Priority: Major
>         Attachments: truncated-with-eof.pdf, truncated.pdf
>
>
> h3. Issue:
> An exception is no longer thrown post-2.0.6, when a stream of a truncated PDF 
> file is parsed.
> In 2.0.6 *COSParser's parseCOSStream* throws *"java.io.IOException: Error 
> reading stream, expected='endstream' actual='' at offset ..."*. Whereas >= 
> 2.0.7 the parsing is successful. Shall an EOF marker be added to the 
> truncated file, however, the expected exception is thrown once again.
> The code below is the minimum setup for reproducing the behavior (_in 
> conjunction with the respective file attached_):
> {code:java}
> import org.apache.tika.exception.TikaException;
> import org.apache.tika.io.TikaInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.pdf.PDFParserConfig;
> import org.apache.tika.sax.BodyContentHandler;
> import org.xml.sax.SAXException;
> import java.io.File;
> import java.io.IOException;
> public class Main {
>         public static void main(String[] args) {
>                 File inputFile = new File("/path/to/parent/folder", 
> "truncated.pdf");
>                 try {
>                         // metadata will be extracted by Tika
>                         Metadata meta = new Metadata();
>                         meta.set(Metadata.CONTENT_TYPE, "application/pdf");
>                         BodyContentHandler ch = new BodyContentHandler(-1);
>                         AutoDetectParser parser = new AutoDetectParser();
>                         PDFParserConfig pdfParserConfig = new 
> PDFParserConfig();
>                         pdfParserConfig.setOcrStrategy("no_ocr");
>                         pdfParserConfig.setMaxMainMemoryBytes(209715200);
>                         ParseContext parseContext = new ParseContext();
>                         parseContext.set(PDFParserConfig.class, 
> pdfParserConfig);
>                         try (TikaInputStream is = 
> TikaInputStream.get(inputFile.toPath())) {
>                                 // try to parse the document
>                                 parser.parse(is, ch, meta, parseContext);
>                         }
>                 } catch (TikaException | SAXException | IOException ex) {
>                         // expect to enter catch
>                 } finally {
>                         // instead catch is skipped
>                 }
>         }
> }
> {code}
> The stack looks like this:
> ||parseCOSStream||COSParser||(pdfbox)||
> ||parseFileObject||COSParser||(pdfbox)||
> ||parseObjectDynamically||COSParser||(pdfbox)||
> ||parseDictObjects||COSParser||(pdfbox)||
> ||initialParse||PDFParser||(pdfbox)||
> ||parse||PDFParser||(pdfbox)||
> ||load||PDDocument||(pdfbox)||
> ||parse||PDFParser||(tika-parsers)||
> ||parse||CompositeParser||(tika-parsers)||
> In 2.0.6 the IOException thrown in parseCOSStream is caught in tika's 
> CompositeParser parse method, and rethrown as TikaException, which we then 
> expect internally and handle it in the sample code provided.
> h3. Why I believe this is a regression:
> [https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf]:
> In this specification Adobe describes the structure of PDF1.7, the basis for 
> the ISO 32000 standard.
> Under the *(7) Syntax clause*, there is a *(7.5) File Structure* sub-clause 
> which describes the valid pdf file structure.
> *This abstract is from sub-sub clause (7.5.5) File Trailer:*
> ------
>  The _trailer_ of a PDF file enables a conforming reader to quickly find the 
> cross-reference table and certain special objects. Conforming readers should 
> read a PDF file from its end. +The last line of the file shall contain only 
> the end-of-file marker, *%%EOF*.+ The two preceding lines shall contain, one 
> per line and in order, the keyword *startxref* and the byte offset in the 
> decoded stream from the beginning of the file to the beginning of the *xref* 
> keyword in the last cross-reference section.
>  ------
> Additionally the document in question cannot be previewed as it is considered 
> broken by pdf previewers.
> h3. What introduced this change in parsing:
> I investigated and tested what introduced this change in behavior.
> The PDFBOX-3798 issue's resolution 
> [https://svn.apache.org/viewvc/pdfbox/branches/2.0/pdfbox/src/main/java/org/apache/pdfbox/pdfparser/COSParser.java?r1=1795704&r2=1795703&pathrev=1795704]
>  is where the change in behavior stems from.
> I have tested rebuilding both 2.0.7 and 2.0.19 from their source code after 
> reverting the change introduced by the commit above. This brings the behavior 
> back to throwing "java.io.IOException: Error reading stream, 
> expected='endstream' actual='' at offset ..." again.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-4973) Parsing truncated files no longer throws IOException: Error reading stream, expected='endstream' actual='' at offset ...

Reply via email to