[jira] [Updated] (PDFBOX-1585) org.apache.pdfbox.util.PDFTextStripper.getText() causes thread to block indefinitely

JIRA Wed, 03 Jul 2013 13:07:10 -0700

     [ 
https://issues.apache.org/jira/browse/PDFBOX-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Christian Kohlschütter updated PDFBOX-1585:
-------------------------------------------

    Attachment: PDFBOX-1585.patch

We had a similar problem; thanks for providing the problematic PDF.

With the help of your stack trace, it was pretty easy to figure out that pdfbox 
was hanging in an endless loop when reading from an InputStream that reached 
its end (EOF).

A patch is attached.

PS: There are some other places in pdfbox that also might loop because 
InputStream#read() is not checked for -1 (EOF), but this here probably is the 
most important one.
                
> org.apache.pdfbox.util.PDFTextStripper.getText() causes thread to block 
> indefinitely
> ------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1585
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1585
>             Project: PDFBox
>          Issue Type: Bug
>          Components: PDFReader, Text extraction
>    Affects Versions: 1.8.1
>         Environment: Ubuntu Linux 10.04
> Solaris 10
> Java 1.6.0_34
>            Reporter: Sascha Szott
>         Attachments: PDFBOX-1585.patch
>
>
> URL of the problematic pdf file is http://www.redalyc.org/pdf/540/54017220.pdf
> My program tries to extract the fulltext of the given pdf file in the 
> following manner:
> {code}
> String fileName = "/home/sascha/testfile.pdf"                   // 1
> PDDocument pdDoc = PDDocument.load(fileName, true); // 2
> PDFTextStripper text = new PDFTextStripper();             // 3
> String fullText = text.getText(pdDoc);                               // 4
> {code}
> The call in line 4 causes the thread to block indefinitely (runs now for more 
> than two days without making any progress). The file is stored in a local 
> file system (no network interaction occurs).
> jstack indicates that the thread is not deadlocked:
> {code}
> "main" prio=10 tid=0x000000004187d800 nid=0x6ed8 runnable [0x00007f9e28e56000]
>    java.lang.Thread.State: RUNNABLE
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>         - locked <0x00000007d73a84a0> (a java.io.BufferedInputStream)
>         at java.io.FilterInputStream.read(FilterInputStream.java:66)
>         at java.io.PushbackInputStream.read(PushbackInputStream.java:122)
>         at 
> org.apache.pdfbox.io.PushBackInputStream.read(PushBackInputStream.java:91)
>         at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSHexString(BaseParser.java:1006)
>         at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSString(BaseParser.java:808)
>         at 
> org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:260)
>         at 
> org.apache.pdfbox.pdfparser.PDFStreamParser.access$000(PDFStreamParser.java:46)
>         at 
> org.apache.pdfbox.pdfparser.PDFStreamParser$1.tryNext(PDFStreamParser.java:182)
>         at 
> org.apache.pdfbox.pdfparser.PDFStreamParser$1.hasNext(PDFStreamParser.java:194)
>         at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:255)
>         at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
>         at org.apache.pdfbox.util.operator.Invoke.process(Invoke.java:67)
>         at 
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
>         at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
>         at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
>         at org.apache.pdfbox.util.operator.Invoke.process(Invoke.java:67)
>         at 
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
>         at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
>         at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
>         at 
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
>         at 
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:455)
>         at 
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:379)
>         at 
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:335)
>         at 
> org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:254)
>         at 
> de.kobv.ked.extraction.FulltextExtraction.getFulltext(FulltextExtraction.java:65)
> {code}
> Any idea or advice on how to fix that problem? Is it possible to set up a 
> timeout for the extraction operation?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PDFBOX-1585) org.apache.pdfbox.util.PDFTextStripper.getText() causes thread to block indefinitely

Reply via email to