[ https://issues.apache.org/jira/browse/PDFBOX-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13604947#comment-13604947 ]
Timo Boehme commented on PDFBOX-1541: ------------------------------------- I'm still in favor of some kind of 'recovery mode' which could get activated if such an error occurs. In this mode the document would first be parsed sequentially for object starts, endobj, endstream etc. so that we get information of possible end offsets for objects (e.g. choose the last one before next object starts). Furthermore this would allow to correct broken object tables. Such a big picture might be a better starting point for corrections instead of having only local information for guessing what will be the correct index. I know I've announced to be willing to work on such a tool some time ago - unfortunately there was so far no time and not enough pressure by the document collections we have to work with. It is still on my to-do list but I cannot provide you with an estimation when it will be done. > expected='endstream' actual='' failure to parse > ----------------------------------------------- > > Key: PDFBOX-1541 > URL: https://issues.apache.org/jira/browse/PDFBOX-1541 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 1.7.1 > Environment: Ubuntu 12.04, JDK 1.7 > Reporter: Jinder Aujla > Attachments: exporeal09_flyer_email3.pdf > > > Following exception thrown when parsing attached PDF > Caused by: java.io.IOException: expected='endstream' actual='' > org.apache.pdfbox.io.PushBackInputStream@2a789924 > at > org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:597) > at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:575) > at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:187) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira