Tika 2.x should help with this in pipes and async. Your system should expect to go oom or crash at some point if you're processing enough files.
Right --spawnChild is not default in 1.x, but it will be in 2.x. And, yes, you should be using it. To set the Xmx in the forked process add -J, as in -JXmx2g would set the Xmx for the forked process. I don't have experience to recommend bumping Xmx to close to your container's max memory. In java programs that do a bunch of work off heap, this would be a bad idea because you need to leave resources for your system os, but I don't think we do much off heap. Which file types are causing OOMs? The MP4Parser is notorious, and we're looking to swap it out in 2.x for a different parser. Yep, TIKA-3353 is the monitoring that Nick was mentioning. On Fri, May 28, 2021 at 9:08 AM Cristian Zamfir <[email protected]> wrote: > > Thanks for your answer Nick! > > I am running apache/tika:latest-full which is using 1.25. Looks like I need > at least version 1.26 for https://issues.apache.org/jira/browse/TIKA-3353, > but I am not sure if this is not overkill for implementing basic liveness > health checks. > > It's clear that –spawnChild and ForkParser are two must-haves that AFAIU are > not default in apache/tika:latest-full > > My guess is that I also need to set the jvm heap size close to the memory > resource limit for the container, but that's not ideal because the heap size > would be statically configured while the memory resource limits are dynamic. > Or maybe this is not necessary if I use -spawnChild? > > I am looking forward to your answers, thanks a lot! > > Cristi > > > On Fri, May 28, 2021 at 2:55 PM Nick Burch <[email protected]> wrote: >> >> On Thu, 27 May 2021, Cristian Zamfir wrote: >> > I am running some stress tests of the latest tika server docker (not >> > modified in any way, just pulled from the registry) and seeing that after a >> > few hours I see OOM in the logs. The container has a limit of 4GB set in >> > K8S. I am wondering if you have any best practices on how to avoid this. >> >> Hopefully one of our Tika+Docker experts will be along in a minute to help >> advise! >> >> For now, the general advice is documented at: >> https://cwiki.apache.org/confluence/display/TIKA/The+Robustness+of+Apache+Tika >> >> Also, which version of Tika are you on? There have been some contributions >> recently around monitoring the server, which you might want to upgrade >> for, eg TIKA-3353 >> >> Nick
