[jira] [Commented] (PDFBOX-1757) Errors parsing/extracting text from a PDF

Timo Boehme (JIRA) Mon, 04 Nov 2013 06:57:00 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-1757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13812873#comment-13812873
 ]


Timo Boehme commented on PDFBOX-1757:
-------------------------------------

Ok, one more analysis:
http://digitalcorpora.org/corp/nps/files/govdocs1/367/367594.pdf
at page 7 (or 6 if starting with index 0) the page content is in a flate 
encoded stream (object 20). The deflated stream (when copied into an editor) 
has gibberish content starting with line 678. It looks like a multi threaded 
program has created the content writing from two threads at same time. Thus the 
parser tries to interpret text as operator or gets wrong arguments for real 
operators. You can see the logging output (e.g. INFO: unsupported/disabled 
operation: bfantly)-)
The final exception is only the result of trying to read and interpret the 
gibberish content before.

Some PDF Viewers simply ignore this part - you can see that apparently some 
content is missing at this page; other like Firefox build-in PDF Viewer do not 
render the page at all. Thus again this document is broken.

In general: PDFBox (with loadNonSeq) tries to read the document according to 
the specification and (in theory) it only throws an exception if the PDF is not 
'well-formed'. While there will be errors in the code leading to exceptions 
even on well-formed documents the probability is high that the document is 
broken and not the PDFBox parser.
Thus while other PDF viewers might not even inform you about the broken 
document and thus you won't notice about missing content (as in this case), 
with PDFBox you can be quite sure to have a correct document if it can be 
parsed without exceptions.

Referring to your last comment:
- please understand that at the current time I'm not willing to analyze more of 
these files from the same source; I think I could show that all files which 
still have problems with current trunk and nonSeq parser are broken; it is 
planned for a new parser to allow for a more relaxed parsing; maybe this 
document collection would be a good base for testing this feature
- in case Acrobat ask you to save the file you could do this and try to read 
this file with PDFBox; at least if Acrobat generated the file completely new 
(at least the problematic parts) it should be readable by PDFBox 

> Errors parsing/extracting text from a PDF
> -----------------------------------------
>
>                 Key: PDFBOX-1757
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1757
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 1.8.2
>         Environment: Ubuntu Linux & Windows 7 (both JDK6)
>            Reporter: William Palmer
>            Assignee: Timo Boehme
>            Priority: Minor
>
> I am trying to extract text from PDFs.  Extracting text from the test file 
> http://digitalcorpora.org/corp/nps/files/govdocs1/020/020747.pdf causes 
> exceptions to be thrown.
> The first:
> Exception in thread "main" java.lang.RuntimeException: java.io.IOException: 
> Value is not an integer: 636121514401477526485946144
>       at 
> org.apache.pdfbox.pdfparser.PDFStreamParser$1.tryNext(PDFStreamParser.java:187)
>       at 
> org.apache.pdfbox.pdfparser.PDFStreamParser$1.hasNext(PDFStreamParser.java:194)
>       at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:255)
>       at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
>       at 
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
>       at 
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:455)
>       at 
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:379)
>       at 
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:335)
> Caused by: java.io.IOException: Value is not an integer: 
> 636121514401477526485946144
>       at org.apache.pdfbox.cos.COSNumber.get(COSNumber.java:104)
>       at 
> org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:351)
>       at 
> org.apache.pdfbox.pdfparser.PDFStreamParser.access$000(PDFStreamParser.java:46)
>       at 
> org.apache.pdfbox.pdfparser.PDFStreamParser$1.tryNext(PDFStreamParser.java:182)
> Code to cause above exception:
> PDFTextStripper ts = new PDFTextStripper();
> PrintWriter out = new PrintWriter(new FileWriter(new File ("020747.txt")));
> PDDocument doc = PDDocument.load(new File("020747.pdf").toURI().toURL(), 
> true);
> ts.setForceParsing(true);
> ts.writeText(doc, out);
> Using the following code causes a different exception until 
> org.apache.pdfbox.baseParser.pushBackSize is increased (only tested 1024768). 
>  After it is increased I get basically the same exception as above
> PrintWriter out = new PrintWriter(new FileWriter(new File("020747.txt")));
> PDFParser parser = new PDFParser(new FileInputStream(new File("020747.pdf")));
> parser.parse();
> PDFTextStripper ts = new PDFTextStripper();
> ts.setForceParsing(true);
> ts.writeText(parser.getPDDocument(), out);



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (PDFBOX-1757) Errors parsing/extracting text from a PDF

Reply via email to