[ https://issues.apache.org/jira/browse/PDFBOX-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andreas Lehmkühler updated PDFBOX-1585: --------------------------------------- Fix Version/s: 1.8.4 > org.apache.pdfbox.util.PDFTextStripper.getText() causes thread to block > indefinitely > ------------------------------------------------------------------------------------ > > Key: PDFBOX-1585 > URL: https://issues.apache.org/jira/browse/PDFBOX-1585 > Project: PDFBox > Issue Type: Bug > Components: PDFReader, Text extraction > Affects Versions: 1.8.1 > Environment: Ubuntu Linux 10.04 > Solaris 10 > Java 1.6.0_34 > Reporter: Sascha Szott > Assignee: Andreas Lehmkühler > Fix For: 1.8.4, 2.0.0 > > Attachments: PDFBOX-1585.patch > > > URL of the problematic pdf file is http://www.redalyc.org/pdf/540/54017220.pdf > My program tries to extract the fulltext of the given pdf file in the > following manner: > {code} > String fileName = "/home/sascha/testfile.pdf" // 1 > PDDocument pdDoc = PDDocument.load(fileName, true); // 2 > PDFTextStripper text = new PDFTextStripper(); // 3 > String fullText = text.getText(pdDoc); // 4 > {code} > The call in line 4 causes the thread to block indefinitely (runs now for more > than two days without making any progress). The file is stored in a local > file system (no network interaction occurs). > jstack indicates that the thread is not deadlocked: > {code} > "main" prio=10 tid=0x000000004187d800 nid=0x6ed8 runnable [0x00007f9e28e56000] > java.lang.Thread.State: RUNNABLE > at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) > at java.io.BufferedInputStream.read(BufferedInputStream.java:237) > - locked <0x00000007d73a84a0> (a java.io.BufferedInputStream) > at java.io.FilterInputStream.read(FilterInputStream.java:66) > at java.io.PushbackInputStream.read(PushbackInputStream.java:122) > at > org.apache.pdfbox.io.PushBackInputStream.read(PushBackInputStream.java:91) > at > org.apache.pdfbox.pdfparser.BaseParser.parseCOSHexString(BaseParser.java:1006) > at > org.apache.pdfbox.pdfparser.BaseParser.parseCOSString(BaseParser.java:808) > at > org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:260) > at > org.apache.pdfbox.pdfparser.PDFStreamParser.access$000(PDFStreamParser.java:46) > at > org.apache.pdfbox.pdfparser.PDFStreamParser$1.tryNext(PDFStreamParser.java:182) > at > org.apache.pdfbox.pdfparser.PDFStreamParser$1.hasNext(PDFStreamParser.java:194) > at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:255) > at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235) > at org.apache.pdfbox.util.operator.Invoke.process(Invoke.java:67) > at > org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554) > at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268) > at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235) > at org.apache.pdfbox.util.operator.Invoke.process(Invoke.java:67) > at > org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554) > at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268) > at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235) > at > org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215) > at > org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:455) > at > org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:379) > at > org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:335) > at > org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:254) > at > de.kobv.ked.extraction.FulltextExtraction.getFulltext(FulltextExtraction.java:65) > {code} > Any idea or advice on how to fix that problem? Is it possible to set up a > timeout for the extraction operation? -- This message was sent by Atlassian JIRA (v6.1.5#6160)