Sorry for cross-posting, but the tika-ml does not seem to be too "lively":
I am trying to make use of the ForkParser. Unfortunately I am getting „Lost
connection to a forked server process“ for an (encrypted) pdf which I can
extract „in-process“. Extracting the document "in-process" takes approx 40s
(!). Also, extracting other (smaller) docs works in/with the ForkParser.
Memory should be no problem:
forkParser.setJavaCommand("java -Xmx2048m -Xdebug");
Running the unitTest with the forkparser the test stops after 10seconds. The
console output is alike:
...
SLF4J: Found binding in [tika-in-memory://localhost/3]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type
[ch.qos.logback.classic.util.ContextSelectorStaticBinder]
07:28:01.909 [main] INFO o.apache.pdfbox.pdfparser.PDFParser - Document is
encrypted
07:28:02.239 [main] DEBUG o.a.p.p.PDFObjectStreamParser - parsed=COSObject{706,
0}
07:28:02.239 [main] DEBUG o.a.p.p.PDFObjectStreamParser - parsed=COSObject{707,
0}
07:28:02.239 [main] DEBUG o.a.p.p.PDFObjectStreamParser - parsed=COSObject{708,
0} ...
07:28:02.249 [main] DEBUG o.a.p.p.PDFObjectStreamParser - parsed=COSObject{752,
0}
07:28:02.249 [main] DEBUG o.a.p.p.PDFObjectStreamParser - parsed=COSObject{753,
0}
07:28:02.249 [main] DEBUG o.a.p.p.PDFObjectStreamParser - parsed=COSObject{754,
0}
07:28:11.465 [main] ERROR ch.mysign.sky.indexing.IndexUtility - failed to
extract text from input stream
org.apache.tika.exception.TikaException: Failed to communicate with a forked
parser process. The process has most likely crashed due to some error like
running out of memory. A new process will be started for the next parsing
request.
at org.apache.tika.fork.ForkParser.parse(ForkParser.java:142)
~[tika-core.jar:1.7]
at
ch.mysign.sky.indexing.IndexUtility.extractTextFrom(IndexUtility.java:158)
[target/:na]
at
ch.mysign.sky.indexing.IndexUtility.extractTextFrom(IndexUtility.java:84)
[target/:na]
at
ch.mysign.sky.indexing.IndexUtility.extractTextFrom(IndexUtility.java:70)
[target/:na]
at
ch.mysign.sky.indexing.IndexUtilityTest.diesesPdfAuslesenDauertEwig(IndexUtilityTest.java:193)
[target/:na]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
~[na:1.8.0_25]
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
~[na:1.8.0_25]
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
~[na:1.8.0_25] ...
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
[selenium-server-standalone.jar:na]
at
org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
[.cp/:na]
at
org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
[.cp/:na]
at
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:459)
[.cp/:na]
at
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:675)
[.cp/:na]
at
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:382)
[.cp/:na]
at
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192)
[.cp/:na] Caused by: java.io.IOException: Lost connection to a forked server
process
at org.apache.tika.fork.ForkClient.waitForResponse(ForkClient.java:191)
~[tika-core.jar:1.7]
at org.apache.tika.fork.ForkClient.call(ForkClient.java:125)
~[tika-core.jar:1.7]
at org.apache.tika.fork.ForkParser.parse(ForkParser.java:134)
~[tika-core.jar:1.7]
... 38 common frames omitted
Any timeouts I am running in? What else can I investigate on?
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]