[ https://issues.apache.org/jira/browse/PDFBOX-908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14025183#comment-14025183 ]
Tilman Hausherr commented on PDFBOX-908: ---------------------------------------- I improved the error messages (offset) in rev 1601374 for the 1.8 branch and rev 1601375 for the trunk. > Gracefull handle corrupt PDFs > ----------------------------- > > Key: PDFBOX-908 > URL: https://issues.apache.org/jira/browse/PDFBOX-908 > Project: PDFBox > Issue Type: Bug > Components: Parsing > Affects Versions: 1.3.1 > Reporter: Martijn Brinkers > Attachments: PDFBOX-908.patch, test-corrupt-R.pdf, > test-integer-too-large.pdf, test-obj-missing-bj.pdf, > test-stream-missing-endobj.pdf > > > I will use PDFBox for text extraction and one of the main requirements are > that it should extract as much text as possible. If the PDF document contains > something that isn't strictly correct according to the PDF specs it should > try recover gracefully and continue scanning if possible if forceParsing is > enabled. While testing against a large batch of PDF documents (including > large ebooks) I found that the parser sometimes stops parsing and/or > extracting text even with forceParsing enabled. I have attached a patch to > make PDFBox handle some PDF problems more gracefully when forceParsing is > enabled. > Some of my patches tries to handle certain situations differently from the > existing code. For example the existing code to handle cases when an endobj > is missing seems to be very complex. In all of my tests it seems to work > better when the code just assumes that the endobj was missing. Whether or not > assuming that endobj is missing or whether the existing way to cope with this > is better is of course debatable. > A patch is included to handle situations where the data (DI) for an inline > image contains the EI keyword. The EI is now only accepted if the char before > EI is an end-of-line marker instead of whitespace. > I have added the method #isContinueOnError to PDFParser. By default it > returns forceParsing but implementors can override it to stop parsing when a > certain limit is reached (for example on a timeout). This can be helpful to > stop parsing when the parser gets stuck in an unlimited loop. > BaseParser#readInt unread the data when a NumberFormatException was thrown. > This resulted in an unlimited loop when forcParsing was enabled when testing > with test-integer-too-large.pdf (see attached file). I think it's better to > not unread data when an exception will be thrown because the risks are higher > that you run into an unlimited loop. > The other patches are just minor like checks for null values etc. > I have attached four test PDF documents. These PDF documents are PDFs which I > corrupted by hand to try to replicate similar situations I found in existing > (copyrighted) ebooks. -- This message was sent by Atlassian JIRA (v6.2#6252)