FlateFilter.java swallows Exceptions (should rethrow) -----------------------------------------------------
Key: PDFBOX-847 URL: https://issues.apache.org/jira/browse/PDFBOX-847 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.2.1 Reporter: Andreas Wollschlaeger I just re-discovered an issue in FlateFilter.java, which i mentioned quite a while ago on the mailinglist; and which was agreed to be an misfeature :-) In FlateFilter.java, at lines 115ff, we find this piece of code: try { // decoding not needed while ((amountRead = decompressor.read(buffer, 0, Math.min(mayRead,BUFFER_SIZE))) != -1) { result.write(buffer, 0, amountRead); } } catch (OutOfMemoryError exception) { // if the stream is corrupt an OutOfMemoryError may occur log.error("Stop reading corrupt stream"); } catch (ZipException exception) { // if the stream is corrupt an OutOfMemoryError may occur log.error("Stop reading corrupt stream"); } catch (EOFException exception) { // if the stream is corrupt an OutOfMemoryError may occur log.error("Stop reading corrupt stream"); } which means these Exceptions are discarded and not reported upstream to the caller. This is very infortunate, as the caller has no means to discover that text extraction is incomplete. I discovered this on troubleshooting Alfresco DMS, which uses PDFBox for indexing PDF documents - except an innocent log message, Alfresco does not know that conversion has failed. Proposed solution is to re-throw all 3 Exceptions and let the caller handle the exceptions -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.