[jira] [Commented] (PDFBOX-908) Gracefull handle corrupt PDFs

Tilman Hausherr (JIRA) Mon, 09 Jun 2014 06:31:25 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14025183#comment-14025183
 ]


Tilman Hausherr commented on PDFBOX-908:
----------------------------------------

I improved the error messages (offset) in rev 1601374 for the 1.8 branch and 
rev 1601375 for the trunk.

> Gracefull handle corrupt PDFs
> -----------------------------
>
>                 Key: PDFBOX-908
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-908
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 1.3.1
>            Reporter: Martijn Brinkers
>         Attachments: PDFBOX-908.patch, test-corrupt-R.pdf, 
> test-integer-too-large.pdf, test-obj-missing-bj.pdf, 
> test-stream-missing-endobj.pdf
>
>
> I will use PDFBox for text extraction and one of the main requirements are 
> that it should extract as much text as possible. If the PDF document contains 
> something that isn't strictly correct according to the PDF specs it should 
> try recover gracefully and continue scanning if possible if forceParsing is 
> enabled. While testing against a large batch of PDF documents (including 
> large ebooks) I found that the parser sometimes stops parsing and/or 
> extracting text even with forceParsing enabled.  I have attached a patch to 
> make PDFBox handle some PDF problems more gracefully when  forceParsing is 
> enabled.
> Some of my patches tries to handle certain situations differently from the 
> existing code. For example the existing code to handle cases when an endobj 
> is missing seems to be very complex. In all of my tests it seems to work 
> better when the code just assumes that the endobj was missing. Whether or not 
> assuming that endobj is missing or whether the existing way to cope with this 
> is better is of course debatable. 
> A patch is included to handle situations where the data (DI) for an inline 
> image contains the EI keyword. The EI is now only accepted if the char before 
> EI is an end-of-line marker instead of whitespace.
> I have added the method #isContinueOnError to PDFParser.  By default it 
> returns forceParsing but implementors can override it to stop parsing when a 
> certain limit is reached (for example on a timeout).  This can be helpful to 
> stop parsing when the parser gets stuck in an unlimited loop.
> BaseParser#readInt unread the data when a NumberFormatException was thrown. 
> This resulted in an unlimited loop when forcParsing was enabled when testing 
> with test-integer-too-large.pdf (see attached file). I think it's better to 
> not unread data when an exception will be thrown because the risks are higher 
> that you run into an unlimited loop.
> The other patches are just minor like checks for null values etc.
> I have attached four test PDF documents. These PDF documents are PDFs which I 
> corrupted by hand to try to replicate similar situations I found in existing 
> (copyrighted) ebooks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (PDFBOX-908) Gracefull handle corrupt PDFs

Reply via email to