Gracefull handle corrupt PDFs
-----------------------------

                 Key: PDFBOX-908
                 URL: https://issues.apache.org/jira/browse/PDFBOX-908
             Project: PDFBox
          Issue Type: Bug
          Components: Parsing
    Affects Versions: 1.3.1
            Reporter: Martijn Brinkers


I will use PDFBox for text extraction and one of the main requirements are that 
it should extract as much text as possible. If the PDF document contains 
something that isn't strictly correct according to the PDF specs it should try 
recover gracefully and continue scanning if possible if forceParsing is 
enabled. While testing against a large batch of PDF documents (including large 
ebooks) I found that the parser sometimes stops parsing and/or extracting text 
even with forceParsing enabled.  I have attached a patch to make PDFBox handle 
some PDF problems more gracefully when  forceParsing is enabled.

Some of my patches tries to handle certain situations differently from the 
existing code. For example the existing code to handle cases when an endobj is 
missing seems to be very complex. In all of my tests it seems to work better 
when the code just assumes that the endobj was missing. Whether or not assuming 
that endobj is missing or whether the existing way to cope with this is better 
is of course debatable. 

A patch is included to handle situations where the data (DI) for an inline 
image contains the EI keyword. The EI is now only accepted if the char before 
EI is an end-of-line marker instead of whitespace.

I have added the method #isContinueOnError to PDFParser.  By default it returns 
forceParsing but implementors can override it to stop parsing when a certain 
limit is reached (for example on a timeout).  This can be helpful to stop 
parsing when the parser gets stuck in an unlimited loop.

BaseParser#readInt unread the data when a NumberFormatException was thrown. 
This resulted in an unlimited loop when forcParsing was enabled when testing 
with test-integer-too-large.pdf (see attached file). I think it's better to not 
unread data when an exception will be thrown because the risks are higher that 
you run into an unlimited loop.

The other patches are just minor like checks for null values etc.

I have attached four test PDF documents. These PDF documents are PDFs which I 
corrupted by hand to try to replicate similar situations I found in existing 
(copyrighted) ebooks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to