Sorry, I'm not sure I understand precisely what's going on.

First, are you running tika-server, tika-app, tika-async, or running Tika
programmatically?  I'm guessing tika-server because you've containerized
it, but I've containerized tika-async...so...? :D

If tika-server, are you sending requests in parallel to each container?  If
in parallel, how many parallel requests are you allowing?

 Are you able to share with me (privately) an example specific file that is
causing problems?

>where despite setting a watchdog to limit the heap to 3GB.
You're setting your own watchdog?  Or, is this tika-server's watchdog and
you've set -Xmx3g in <forkedJvmArgs>?

>1. The JVM is slow to observe the forked process exceeding its heap and
does not terminate it in time
Again, your own watchdog? If tika-server's watchdog...possibly?  I haven't
seen this behavior, but it doesn't mean that it can't happen.

>2. It's not the heap that grows, but there is some stack overflow due to
very deep recursion.
Possible, but I don't think so... the default -Xss isn't very deep.
Perhaps I misunderstand the suggestion?

>Finally, are there any file types that are known to use a lot of memory
with Tika?
A file from any of the major file formats can be, um, crafted to take up a
lot of memory. My rule of thumb is to allow 2GB per thread (if running
multithreaded) or request if you're allowing concurrent requests of
tika-server.  There will still be some files that cause tika to OOM if
you're processing millions/billions of files from the wild.

To turn it around, are there specific file types that you are noticing are
causing OOM?

On Thu, Jul 20, 2023 at 6:20 PM Cristian Zamfir <cri...@cyberhaven.com>
wrote:

> Hi,
>
> I am seeing some cases with Tika 2.2.1 where despite setting a watchdog to
> limit the heap to 3GB, the entire Tika container exceeds 6GB and that
> exceeds the resource memory limit, so it gets OOM-ed. Here is one example:
>
> total-vm:8109464kB, anon-rss:99780kB, file-rss:28204kB, shmem-rss:32kB,
> UID:0 pgtables:700kB oom_score_adj:-997
>
> Only some files seem to be causing this behavior.
>
> The memory ramps up fairly quickly, in a few tens of seconds it can go
> from 1GB to 6GB.
>
> The next step is to check if this goes away with 2.8.0, but I wonder if
> any of the following explanations make any sense:
> 1. The JVM is slow to observe the forked process exceeding its heap and
> does not terminate it in time
> 2. It's not the heap that grows, but there is some stack overflow due to
> very deep recursion.
>
> Finally, are there any file types that are known to use a lot of memory
> with Tika?
>
> Thanks,
> Cristi
>
>

Reply via email to