Re: [jira] [Updated] (PDFBOX-1585) org.apache.pdfbox.util.PDFTextStripper.getText() causes thread to block indefinitely

Andreas Lehmkuehler Mon, 05 Aug 2013 06:07:38 -0700

Hi,

did you try to apply Christians patch?


Am 05.08.2013 14:04, schrieb Florian Over:

Hi,
this is really hitting us hard on production.
Is anyone working on this already?

Maybe will try the timeout for now.

Best regards
Florian Over


2013/7/3 Christian Kohlschütter (JIRA) <[email protected]>


      [
https://issues.apache.org/jira/browse/PDFBOX-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]

Christian Kohlschütter updated PDFBOX-1585:
-------------------------------------------

     Attachment: PDFBOX-1585.patch

We had a similar problem; thanks for providing the problematic PDF.

With the help of your stack trace, it was pretty easy to figure out that
pdfbox was hanging in an endless loop when reading from an InputStream that
reached its end (EOF).

A patch is attached.

PS: There are some other places in pdfbox that also might loop because
InputStream#read() is not checked for -1 (EOF), but this here probably is
the most important one.

org.apache.pdfbox.util.PDFTextStripper.getText() causes thread to block

indefinitely

------------------------------------------------------------------------------------


                 Key: PDFBOX-1585
                 URL: https://issues.apache.org/jira/browse/PDFBOX-1585
             Project: PDFBox
          Issue Type: Bug
          Components: PDFReader, Text extraction
    Affects Versions: 1.8.1
         Environment: Ubuntu Linux 10.04
Solaris 10
Java 1.6.0_34
            Reporter: Sascha Szott
         Attachments: PDFBOX-1585.patch


URL of the problematic pdf file is

http://www.redalyc.org/pdf/540/54017220.pdf

My program tries to extract the fulltext of the given pdf file in the

following manner:

{code}
String fileName = "/home/sascha/testfile.pdf"                   // 1
PDDocument pdDoc = PDDocument.load(fileName, true); // 2
PDFTextStripper text = new PDFTextStripper();             // 3
String fullText = text.getText(pdDoc);                               // 4
{code}
The call in line 4 causes the thread to block indefinitely (runs now for

more than two days without making any progress). The file is stored in a
local file system (no network interaction occurs).

jstack indicates that the thread is not deadlocked:
{code}
"main" prio=10 tid=0x000000004187d800 nid=0x6ed8 runnable

[0x00007f9e28e56000]

    java.lang.Thread.State: RUNNABLE
         at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
         at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
         - locked <0x00000007d73a84a0> (a java.io.BufferedInputStream)
         at java.io.FilterInputStream.read(FilterInputStream.java:66)
         at java.io.PushbackInputStream.read(PushbackInputStream.java:122)
         at

org.apache.pdfbox.io.PushBackInputStream.read(PushBackInputStream.java:91)

at

org.apache.pdfbox.pdfparser.BaseParser.parseCOSHexString(BaseParser.java:1006)

at

org.apache.pdfbox.pdfparser.BaseParser.parseCOSString(BaseParser.java:808)

at

org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:260)

at

org.apache.pdfbox.pdfparser.PDFStreamParser.access$000(PDFStreamParser.java:46)

at

org.apache.pdfbox.pdfparser.PDFStreamParser$1.tryNext(PDFStreamParser.java:182)

at

org.apache.pdfbox.pdfparser.PDFStreamParser$1.hasNext(PDFStreamParser.java:194)

at

org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:255)

at

org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)

         at org.apache.pdfbox.util.operator.Invoke.process(Invoke.java:67)
         at

org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)

at

org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)

at

org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)

         at org.apache.pdfbox.util.operator.Invoke.process(Invoke.java:67)
         at

org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)

at

org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)

at

org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)

at

org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)

at

org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:455)

at

org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:379)

at

org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:335)

at

org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:254)

at

de.kobv.ked.extraction.FulltextExtraction.getFulltext(FulltextExtraction.java:65)

{code}
Any idea or advice on how to fix that problem? Is it possible to set up

a timeout for the extraction operation?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA
administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


BR
Andreas Lehmkühler

Re: [jira] [Updated] (PDFBOX-1585) org.apache.pdfbox.util.PDFTextStripper.getText() causes thread to block indefinitely

Reply via email to