FWIW, just to let you know about the deadend.

I'm a big fan of Serverless containers see TIKA-4529, but I decided to go
further and use S3 fetcher and s3 Emitter that turn me to TikaAsyncCLI.
I've put it into Docker with tesseract, etc.
Finally, it pulls 4Mb pdf from s3, spins of TikaServer jvm it lanches those
binary tools to check their availability and just dies:

 org.apache.tika.pipes.PipesClient pipesClientId=0: commandline: [java,
-cp,
/tika-emitter-s3.jar:/tika-fetcher-s3.jar:/tika-pipes-iterator-s3.jar:/tika-app.jar,
-Djava.awt.headless=true, -DpipesClientId=0,
-Dlog4j.configurationFile=file:///log4j2.xml, -XX:+UseContainerSupport,
-XX:MaxRAMPercentage=15, -XX:InitialRAMPercentage=15,
org.apache.tika.pipes.PipesServer, /tmp/tika-config.xml, 100000, 300000,
1500000]

.PipesClient pipesClientId=0: From forked process before start byte: DEBUG
[main] 16:25:14,240 org.apache.tika.pipes.PipesServer processing requests
 org.apache.tika.parser.ocr.TesseractOCRParser hasTesseract (path:
[tesseract]): true
s.PipesServer timer -- initialize parser and other resources: 939 ms
DEBUG [main] 16:25:15,180 org.apache.tika.pipes.PipesServer pipes server
initialized

TRACE [pool-4-thread-1] 16:25:15,206 org.apache.tika.pipes.PipesClient
pipesClientId=0: timer -- write tuple: 24 ms
ERROR [pool-3-thread-2] 16:25:15,239 org.apache.tika.pipes.PipesClient
pipesClientId=0: execution exception
java.util.concurrent.ExecutionException: java.io.IOException: problem
reading response from server: 54

Caused by: java.lang.IllegalArgumentException: byte with index 83 must be <
17
        at
org.apache.tika.pipes.PipesServer$STATUS.lookup(PipesServer.java:123)
        at
org.apache.tika.pipes.PipesClient.readResults(PipesClient.java:291)
        ... 5 more
TRACE [pool-3-thread-6] 16:25:15,332
org.apache.tika.pipes.async.AsyncEmitter Nothing on the async queue
DEBUG [pool-3-thread-6] 16:25:15,332
org.apache.tika.pipes.async.AsyncEmitter cache size: (0) bytes and extract
count: 0
WARN  [pool-3-thread-2] 16:25:15,458 org.apache.tika.pipes.PipesClient
pipesClientId=0 crash: path/to/4mb.pdf in 59 ms with exit code 137
TRACE [pool-3-thread-2] 16:25:15,458
org.apache.tika.pipes.async.AsyncProcessor timer -- pipes client process:
1646 ms

the only clue I have is [..with exit code 137], it implies OOM, but I can't
see any other evidence, counters or logs or whatever.

We can count it as a bug that failed Server isn;t propagated to the failure
of TikaAsyncCLI

DEBUG [pool-3-thread-6] 16:25:15,813
org.apache.tika.pipes.async.AsyncEmitter emitted: 0 files
DEBUG [pool-3-thread-1] 16:25:15,820
org.apache.tika.pipes.async.AsyncProcessor emitter thread finished, total 1
INFO  [main] 16:25:16,313 org.apache.tika.async.cli.TikaAsyncCLI
Successfully finished processing 1 files in 3001 ms

I've tweaked settings a little, memory size etc, it's helpless. Same
configuration works fine on host linux w/o container.

So, I gave up, turn back to tika-app cli. FYI.
-- 
Sincerely yours
Mikhail Khludnev

Reply via email to