Hi Tim,
On Tue, Jul 25, 2023 at 8:28 PM Tim Allison <talli...@apache.org> wrote:

> Argh, sorry for my delay.
> Y, tika server is built on Apache CXF.  Threading per request happens at
> the CXF level, not the Tika level.
> With pipes, you can control how many spawned processes are available to
> serve your requests, but that does add some complexity -- learning curve to
> configure.
> > I am guessing there is a particular memory hungry file format and
> several of them are handled in parallel.
> That's one of the key design goals with pipes -- one file per process at a
> time.  With traditional tika-server or with the watchdog/forked stuff (the
> default in 2.x), if you're sending files in concurrently and one of them
> crashes the server, you don't know which file was the culprit.
> Out of the box, Tika server uses the watchdog set up and not pipes.

I would certainly want to transition to pipes, looks like exactly what we
need. Actually are there any downsides to making that default?
We are using the Docker image, is it possible to pass the Docker image a
tika-config.xml file to default to pipes? If there is any documentation
related to this, I am interested in trying it out.

> >I guess there are no known SLAs from Tika’s watchdog kicking in, this is
> what I was asking. I don’t know how it is implemented.
> It is really simple.  The monitor/watchdog main process spawns a child
> process which is the actual server.  The watchdog does pulse checks of the
> spawned process and if it goes oom or crashes, the watchdog restarts it.
> The watchdog does no monitoring for memory consumption but relies on the
> -Xmx that's configured.

In this case it really looks puzzling how a Tika server configured with
-Xmx3g can reach >6GB. Sounds like this is a bug and I will need to find a
way to reproduce it.


> On Tue, Jul 25, 2023 at 7:43 AM Cristian Zamfir <cri...@cyberhaven.com>
> wrote:
>> Hello,
>> On 21 Jul 2023 at 23:51:54, Cristian Zamfir <cri...@cyberhaven.com>
>> wrote:
>>> Hi Tim!
>>> Sorry for the lack of details, adding now.
>>> On 21 Jul 2023 at 18:56:02, Tim Allison <talli...@apache.org> wrote:
>>>> Sorry, I'm not sure I understand precisely what's going on.
>>>> First, are you running tika-server, tika-app, tika-async, or running
>>>> Tika programmatically?  I'm guessing tika-server because you've
>>>> containerized it, but I've containerized tika-async...so...? 😃
>>> Tika-server, the official docker image with a custom config - the
>>> config’s main changes are the -Xmx arg.
>>>> If tika-server, are you sending requests in parallel to each
>>>> container?  If in parallel, how many parallel requests are you allowing?
>>> Yes, sending requests in parallel without managing the number of
>>> requests in parallel - there is horizontal auto-scaling to deal deal with
>>> load, but the number of replicas is not based on the queue size, rather on
>>> CPU consumption. Is there a recommended concurrency level? I could use that
>>> instead for HPA. More on that below.
>>>>  Are you able to share with me (privately) an example specific file
>>>> that is causing problems?
>>> Unfortunately no, and I do not have access to the files either for
>>> security reasons, not logging them on purpose. That would have been the
>>> first thing I would have tried too.
>>>> >where despite setting a watchdog to limit the heap to 3GB.
>>>> You're setting your own watchdog?  Or, is this tika-server's watchdog
>>>> and you've set -Xmx3g in <forkedJvmArgs>?
>>> Using -Xmx3g.
>>>> >1. The JVM is slow to observe the forked process exceeding its heap
>>>> and does not terminate it in time
>>>> Again, your own watchdog? If tika-server's watchdog...possibly?  I
>>>> haven't seen this behavior, but it doesn't mean that it can't happen.
>>> I guess there are no known SLAs from Tika’s watchdog kicking in, this is
>>> what I was asking. I don’t know how it is implemented.
>>>> >2. It's not the heap that grows, but there is some stack overflow due
>>>> to very deep recursion.
>>>> Possible, but I don't think so... the default -Xss isn't very deep.
>>>> Perhaps I misunderstand the suggestion?
>>> I think we are on the same page - I was thinking what non-heap sources
>>> could account for the memory usage.
>>>> >Finally, are there any file types that are known to use a lot of
>>>> memory with Tika?
>>>> A file from any of the major file formats can be, um, crafted to take
>>>> up a lot of memory. My rule of thumb is to allow 2GB per thread (if running
>>>> multithreaded) or request if you're allowing concurrent requests of
>>>> tika-server.  There will still be some files that cause tika to OOM if
>>>> you're processing millions/billions of files from the wild.
>>> This happens quite often with regular files, not crafted inputs. I am
>>> guessing there is a particular memory hungry file format and several of
>>> them are handled in parallel.
>>> With 2GB per request and heap size of 3GB that would mean very few
>>> concurrent requests, so not great efficiency. Most of the time in my
>>> experience Tika can process lots of files in parallel with a 3GB heap.
>>> I noticed this message also appears quite often:
>>> org.apache.tika.utils.XMLReaderUtils Contention waiting for a SAXParser.
>>> Consider increasing the XMLReaderUtils.POOL_SIZE
>>> I am guessing this means the number of requests handled in parallel is
>>> exceeding a certain internal limit.
>>> Now that I understand this better, I have some followup questions:
>>>    1. Is there concurrency control I can configure, to limit the number
>>>    of incoming requests handled in parallel?
>> I looked at the tika server options and I did not see an option for
>> concurrency control.
>> Actually I found a reply from Nicholas to my older question where I
>> understood that Tika Pipes may be the answer
>> https://www.mail-archive.com/user@tika.apache.org/msg03535.html
>> The main question is if the latest Tika server implementation uses pipes
>> by default or is another solution recommended.
>> Thanks,
>> Cristi
>>>    1. Assuming the answer is “yes" above, will requests be queued when
>>>    the limit is reached? If they are dropped, is there a busy status reply 
>>> to
>>>    the /tika API?
>>>    2. Is the queue size or the number of concurrently parsed files
>>>    exposed through an API?
>>> To turn it around, are there specific file types that you are noticing
>>>> are causing OOM?
>>> I will have to look into obtaining analytics on the input, maybe that
>>> will shed more light.
>>> Thanks!
>>>> On Thu, Jul 20, 2023 at 6:20 PM Cristian Zamfir <cri...@cyberhaven.com>
>>>> wrote:
>>>>> Hi,
>>>>> I am seeing some cases with Tika 2.2.1 where despite setting a
>>>>> watchdog to limit the heap to 3GB, the entire Tika container exceeds 6GB
>>>>> and that exceeds the resource memory limit, so it gets OOM-ed. Here is one
>>>>> example:
>>>>> total-vm:8109464kB, anon-rss:99780kB, file-rss:28204kB,
>>>>> shmem-rss:32kB, UID:0 pgtables:700kB oom_score_adj:-997
>>>>> Only some files seem to be causing this behavior.
>>>>> The memory ramps up fairly quickly, in a few tens of seconds it can go
>>>>> from 1GB to 6GB.
>>>>> The next step is to check if this goes away with 2.8.0, but I wonder
>>>>> if any of the following explanations make any sense:
>>>>> 1. The JVM is slow to observe the forked process exceeding its heap
>>>>> and does not terminate it in time
>>>>> 2. It's not the heap that grows, but there is some stack overflow due
>>>>> to very deep recursion.
>>>>> Finally, are there any file types that are known to use a lot of
>>>>> memory with Tika?
>>>>> Thanks,
>>>>> Cristi

Reply via email to