[jira] [Commented] (PDFBOX-1541) expected='endstream' actual='' failure to parse

Timo Boehme (JIRA) Mon, 18 Mar 2013 01:48:19 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13604947#comment-13604947
 ]


Timo Boehme commented on PDFBOX-1541:
-------------------------------------

I'm still in favor of some kind of 'recovery mode' which could get activated if 
such an error occurs. In this mode the document would first be parsed 
sequentially for object starts, endobj, endstream etc. so that we get 
information of possible end offsets for objects (e.g. choose the last one 
before next object starts). Furthermore this would allow to correct broken 
object tables. Such a big picture might be a better starting point for 
corrections instead of having only local information for guessing what will be 
the correct index.
I know I've announced to be willing to work on such a tool some time ago - 
unfortunately there was so far no time and not enough pressure by the document 
collections we have to work with. It is still on my to-do list but I cannot 
provide you with an estimation when it will be done.
                
> expected='endstream' actual='' failure to parse
> -----------------------------------------------
>
>                 Key: PDFBOX-1541
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1541
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.7.1
>         Environment: Ubuntu 12.04, JDK 1.7
>            Reporter: Jinder Aujla
>         Attachments: exporeal09_flyer_email3.pdf
>
>
> Following exception thrown when parsing attached PDF
> Caused by: java.io.IOException: expected='endstream' actual='' 
> org.apache.pdfbox.io.PushBackInputStream@2a789924
>       at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:597)
>       at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:575)
>       at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:187)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PDFBOX-1541) expected='endstream' actual='' failure to parse

Reply via email to