Re: [jira] [Updated] (PDFBOX-1585) org.apache.pdfbox.util.PDFTextStripper.getText() causes thread to block indefinitely

Florian Over Mon, 05 Aug 2013 05:05:58 -0700

Hi,
this is really hitting us hard on production.
Is anyone working on this already?


Maybe will try the timeout for now.

Best regards
Florian Over


2013/7/3 Christian Kohlschütter (JIRA) <j...@apache.org>

>
>      [
> https://issues.apache.org/jira/browse/PDFBOX-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>
> Christian Kohlschütter updated PDFBOX-1585:
> -------------------------------------------
>
>     Attachment: PDFBOX-1585.patch
>
> We had a similar problem; thanks for providing the problematic PDF.
>
> With the help of your stack trace, it was pretty easy to figure out that
> pdfbox was hanging in an endless loop when reading from an InputStream that
> reached its end (EOF).
>
> A patch is attached.
>
> PS: There are some other places in pdfbox that also might loop because
> InputStream#read() is not checked for -1 (EOF), but this here probably is
> the most important one.
>
> > org.apache.pdfbox.util.PDFTextStripper.getText() causes thread to block
> indefinitely
> >
> ------------------------------------------------------------------------------------
> >
> >                 Key: PDFBOX-1585
> >                 URL: https://issues.apache.org/jira/browse/PDFBOX-1585
> >             Project: PDFBox
> >          Issue Type: Bug
> >          Components: PDFReader, Text extraction
> >    Affects Versions: 1.8.1
> >         Environment: Ubuntu Linux 10.04
> > Solaris 10
> > Java 1.6.0_34
> >            Reporter: Sascha Szott
> >         Attachments: PDFBOX-1585.patch
> >
> >
> > URL of the problematic pdf file is
> http://www.redalyc.org/pdf/540/54017220.pdf
> > My program tries to extract the fulltext of the given pdf file in the
> following manner:
> > {code}
> > String fileName = "/home/sascha/testfile.pdf"                   // 1
> > PDDocument pdDoc = PDDocument.load(fileName, true); // 2
> > PDFTextStripper text = new PDFTextStripper();             // 3
> > String fullText = text.getText(pdDoc);                               // 4
> > {code}
> > The call in line 4 causes the thread to block indefinitely (runs now for
> more than two days without making any progress). The file is stored in a
> local file system (no network interaction occurs).
> > jstack indicates that the thread is not deadlocked:
> > {code}
> > "main" prio=10 tid=0x000000004187d800 nid=0x6ed8 runnable
> [0x00007f9e28e56000]
> >    java.lang.Thread.State: RUNNABLE
> >         at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> >         at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
> >         - locked <0x00000007d73a84a0> (a java.io.BufferedInputStream)
> >         at java.io.FilterInputStream.read(FilterInputStream.java:66)
> >         at java.io.PushbackInputStream.read(PushbackInputStream.java:122)
> >         at
> org.apache.pdfbox.io.PushBackInputStream.read(PushBackInputStream.java:91)
> >         at
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSHexString(BaseParser.java:1006)
> >         at
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSString(BaseParser.java:808)
> >         at
> org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:260)
> >         at
> org.apache.pdfbox.pdfparser.PDFStreamParser.access$000(PDFStreamParser.java:46)
> >         at
> org.apache.pdfbox.pdfparser.PDFStreamParser$1.tryNext(PDFStreamParser.java:182)
> >         at
> org.apache.pdfbox.pdfparser.PDFStreamParser$1.hasNext(PDFStreamParser.java:194)
> >         at
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:255)
> >         at
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
> >         at org.apache.pdfbox.util.operator.Invoke.process(Invoke.java:67)
> >         at
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
> >         at
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
> >         at
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
> >         at org.apache.pdfbox.util.operator.Invoke.process(Invoke.java:67)
> >         at
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
> >         at
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
> >         at
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
> >         at
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
> >         at
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:455)
> >         at
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:379)
> >         at
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:335)
> >         at
> org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:254)
> >         at
> de.kobv.ked.extraction.FulltextExtraction.getFulltext(FulltextExtraction.java:65)
> > {code}
> > Any idea or advice on how to fix that problem? Is it possible to set up
> a timeout for the extraction operation?
>
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA
> administrators
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>

Re: [jira] [Updated] (PDFBOX-1585) org.apache.pdfbox.util.PDFTextStripper.getText() causes thread to block indefinitely

Reply via email to