[
https://issues.apache.org/jira/browse/PDFBOX-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
Christian Kohlschütter updated PDFBOX-1585:
-------------------------------------------
Attachment: PDFBOX-1585.patch
We had a similar problem; thanks for providing the problematic PDF.
With the help of your stack trace, it was pretty easy to figure out that
pdfbox was hanging in an endless loop when reading from an InputStream that
reached its end (EOF).
A patch is attached.
PS: There are some other places in pdfbox that also might loop because
InputStream#read() is not checked for -1 (EOF), but this here probably is
the most important one.
org.apache.pdfbox.util.PDFTextStripper.getText() causes thread to block
indefinitely
------------------------------------------------------------------------------------
Key: PDFBOX-1585
URL: https://issues.apache.org/jira/browse/PDFBOX-1585
Project: PDFBox
Issue Type: Bug
Components: PDFReader, Text extraction
Affects Versions: 1.8.1
Environment: Ubuntu Linux 10.04
Solaris 10
Java 1.6.0_34
Reporter: Sascha Szott
Attachments: PDFBOX-1585.patch
URL of the problematic pdf file is
http://www.redalyc.org/pdf/540/54017220.pdf
My program tries to extract the fulltext of the given pdf file in the
following manner:
{code}
String fileName = "/home/sascha/testfile.pdf" // 1
PDDocument pdDoc = PDDocument.load(fileName, true); // 2
PDFTextStripper text = new PDFTextStripper(); // 3
String fullText = text.getText(pdDoc); // 4
{code}
The call in line 4 causes the thread to block indefinitely (runs now for
more than two days without making any progress). The file is stored in a
local file system (no network interaction occurs).
jstack indicates that the thread is not deadlocked:
{code}
"main" prio=10 tid=0x000000004187d800 nid=0x6ed8 runnable
[0x00007f9e28e56000]
java.lang.Thread.State: RUNNABLE
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
- locked <0x00000007d73a84a0> (a java.io.BufferedInputStream)
at java.io.FilterInputStream.read(FilterInputStream.java:66)
at java.io.PushbackInputStream.read(PushbackInputStream.java:122)
at
org.apache.pdfbox.io.PushBackInputStream.read(PushBackInputStream.java:91)
at
org.apache.pdfbox.pdfparser.BaseParser.parseCOSHexString(BaseParser.java:1006)
at
org.apache.pdfbox.pdfparser.BaseParser.parseCOSString(BaseParser.java:808)
at
org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:260)
at
org.apache.pdfbox.pdfparser.PDFStreamParser.access$000(PDFStreamParser.java:46)
at
org.apache.pdfbox.pdfparser.PDFStreamParser$1.tryNext(PDFStreamParser.java:182)
at
org.apache.pdfbox.pdfparser.PDFStreamParser$1.hasNext(PDFStreamParser.java:194)
at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:255)
at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
at org.apache.pdfbox.util.operator.Invoke.process(Invoke.java:67)
at
org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
at org.apache.pdfbox.util.operator.Invoke.process(Invoke.java:67)
at
org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
at
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
at
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:455)
at
org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:379)
at
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:335)
at
org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:254)
at
de.kobv.ked.extraction.FulltextExtraction.getFulltext(FulltextExtraction.java:65)
{code}
Any idea or advice on how to fix that problem? Is it possible to set up
a timeout for the extraction operation?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA
administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira