Hi, this is really hitting us hard on production. Is anyone working on this already?
Maybe will try the timeout for now. Best regards Florian Over 2013/7/3 Christian Kohlschütter (JIRA) <j...@apache.org> > > [ > https://issues.apache.org/jira/browse/PDFBOX-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel] > > Christian Kohlschütter updated PDFBOX-1585: > ------------------------------------------- > > Attachment: PDFBOX-1585.patch > > We had a similar problem; thanks for providing the problematic PDF. > > With the help of your stack trace, it was pretty easy to figure out that > pdfbox was hanging in an endless loop when reading from an InputStream that > reached its end (EOF). > > A patch is attached. > > PS: There are some other places in pdfbox that also might loop because > InputStream#read() is not checked for -1 (EOF), but this here probably is > the most important one. > > > org.apache.pdfbox.util.PDFTextStripper.getText() causes thread to block > indefinitely > > > ------------------------------------------------------------------------------------ > > > > Key: PDFBOX-1585 > > URL: https://issues.apache.org/jira/browse/PDFBOX-1585 > > Project: PDFBox > > Issue Type: Bug > > Components: PDFReader, Text extraction > > Affects Versions: 1.8.1 > > Environment: Ubuntu Linux 10.04 > > Solaris 10 > > Java 1.6.0_34 > > Reporter: Sascha Szott > > Attachments: PDFBOX-1585.patch > > > > > > URL of the problematic pdf file is > http://www.redalyc.org/pdf/540/54017220.pdf > > My program tries to extract the fulltext of the given pdf file in the > following manner: > > {code} > > String fileName = "/home/sascha/testfile.pdf" // 1 > > PDDocument pdDoc = PDDocument.load(fileName, true); // 2 > > PDFTextStripper text = new PDFTextStripper(); // 3 > > String fullText = text.getText(pdDoc); // 4 > > {code} > > The call in line 4 causes the thread to block indefinitely (runs now for > more than two days without making any progress). The file is stored in a > local file system (no network interaction occurs). > > jstack indicates that the thread is not deadlocked: > > {code} > > "main" prio=10 tid=0x000000004187d800 nid=0x6ed8 runnable > [0x00007f9e28e56000] > > java.lang.Thread.State: RUNNABLE > > at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) > > at java.io.BufferedInputStream.read(BufferedInputStream.java:237) > > - locked <0x00000007d73a84a0> (a java.io.BufferedInputStream) > > at java.io.FilterInputStream.read(FilterInputStream.java:66) > > at java.io.PushbackInputStream.read(PushbackInputStream.java:122) > > at > org.apache.pdfbox.io.PushBackInputStream.read(PushBackInputStream.java:91) > > at > org.apache.pdfbox.pdfparser.BaseParser.parseCOSHexString(BaseParser.java:1006) > > at > org.apache.pdfbox.pdfparser.BaseParser.parseCOSString(BaseParser.java:808) > > at > org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:260) > > at > org.apache.pdfbox.pdfparser.PDFStreamParser.access$000(PDFStreamParser.java:46) > > at > org.apache.pdfbox.pdfparser.PDFStreamParser$1.tryNext(PDFStreamParser.java:182) > > at > org.apache.pdfbox.pdfparser.PDFStreamParser$1.hasNext(PDFStreamParser.java:194) > > at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:255) > > at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235) > > at org.apache.pdfbox.util.operator.Invoke.process(Invoke.java:67) > > at > org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554) > > at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268) > > at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235) > > at org.apache.pdfbox.util.operator.Invoke.process(Invoke.java:67) > > at > org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554) > > at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268) > > at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235) > > at > org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215) > > at > org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:455) > > at > org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:379) > > at > org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:335) > > at > org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:254) > > at > de.kobv.ked.extraction.FulltextExtraction.getFulltext(FulltextExtraction.java:65) > > {code} > > Any idea or advice on how to fix that problem? Is it possible to set up > a timeout for the extraction operation? > > -- > This message is automatically generated by JIRA. > If you think it was sent incorrectly, please contact your JIRA > administrators > For more information on JIRA, see: http://www.atlassian.com/software/jira >