Sorry for cross-posting, but the tika-ml does not seem to be too "lively": I am trying to make use of the ForkParser. Unfortunately I am getting „Lost connection to a forked server process“ for an (encrypted) pdf which I can extract „in-process“. Extracting the document "in-process" takes approx 40s (!). Also, extracting other (smaller) docs works in/with the ForkParser.
Memory should be no problem: forkParser.setJavaCommand("java -Xmx2048m -Xdebug"); Running the unitTest with the forkparser the test stops after 10seconds. The console output is alike: ... SLF4J: Found binding in [tika-in-memory://localhost/3] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [ch.qos.logback.classic.util.ContextSelectorStaticBinder] 07:28:01.909 [main] INFO o.apache.pdfbox.pdfparser.PDFParser - Document is encrypted 07:28:02.239 [main] DEBUG o.a.p.p.PDFObjectStreamParser - parsed=COSObject{706, 0} 07:28:02.239 [main] DEBUG o.a.p.p.PDFObjectStreamParser - parsed=COSObject{707, 0} 07:28:02.239 [main] DEBUG o.a.p.p.PDFObjectStreamParser - parsed=COSObject{708, 0} ... 07:28:02.249 [main] DEBUG o.a.p.p.PDFObjectStreamParser - parsed=COSObject{752, 0} 07:28:02.249 [main] DEBUG o.a.p.p.PDFObjectStreamParser - parsed=COSObject{753, 0} 07:28:02.249 [main] DEBUG o.a.p.p.PDFObjectStreamParser - parsed=COSObject{754, 0} 07:28:11.465 [main] ERROR ch.mysign.sky.indexing.IndexUtility - failed to extract text from input stream org.apache.tika.exception.TikaException: Failed to communicate with a forked parser process. The process has most likely crashed due to some error like running out of memory. A new process will be started for the next parsing request. at org.apache.tika.fork.ForkParser.parse(ForkParser.java:142) ~[tika-core.jar:1.7] at ch.mysign.sky.indexing.IndexUtility.extractTextFrom(IndexUtility.java:158) [target/:na] at ch.mysign.sky.indexing.IndexUtility.extractTextFrom(IndexUtility.java:84) [target/:na] at ch.mysign.sky.indexing.IndexUtility.extractTextFrom(IndexUtility.java:70) [target/:na] at ch.mysign.sky.indexing.IndexUtilityTest.diesesPdfAuslesenDauertEwig(IndexUtilityTest.java:193) [target/:na] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.8.0_25] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[na:1.8.0_25] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_25] ... at org.junit.runners.ParentRunner.run(ParentRunner.java:309) [selenium-server-standalone.jar:na] at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50) [.cp/:na] at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) [.cp/:na] at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:459) [.cp/:na] at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:675) [.cp/:na] at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:382) [.cp/:na] at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192) [.cp/:na] Caused by: java.io.IOException: Lost connection to a forked server process at org.apache.tika.fork.ForkClient.waitForResponse(ForkClient.java:191) ~[tika-core.jar:1.7] at org.apache.tika.fork.ForkClient.call(ForkClient.java:125) ~[tika-core.jar:1.7] at org.apache.tika.fork.ForkParser.parse(ForkParser.java:134) ~[tika-core.jar:1.7] ... 38 common frames omitted Any timeouts I am running in? What else can I investigate on? --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org