Hi Tim,

Please let me know if you can refer me to any documentation on how to
enable pipes for Tika server docker.
I noticed that Nicholas mentioned pipes before, so I have taken the liberty
to CC him to this thread - sorry for the noise and thanks a lot for your
help!

Best regards,
Cristi

On Wed, Jul 26, 2023 at 2:58 PM Cristian Zamfir <cri...@cyberhaven.com>
wrote:

> Hi Tim,
> On Tue, Jul 25, 2023 at 8:28 PM Tim Allison <talli...@apache.org> wrote:
>
>> Argh, sorry for my delay.
>>
>> Y, tika server is built on Apache CXF.  Threading per request happens at
>> the CXF level, not the Tika level.
>>
>> With pipes, you can control how many spawned processes are available to
>> serve your requests, but that does add some complexity -- learning curve to
>> configure.
>>
>> > I am guessing there is a particular memory hungry file format and
>> several of them are handled in parallel.
>> That's one of the key design goals with pipes -- one file per process at
>> a time.  With traditional tika-server or with the watchdog/forked stuff
>> (the default in 2.x), if you're sending files in concurrently and one of
>> them crashes the server, you don't know which file was the culprit.
>>
>> Out of the box, Tika server uses the watchdog set up and not pipes.
>>
>
> I would certainly want to transition to pipes, looks like exactly what we
> need. Actually are there any downsides to making that default?
> We are using the Docker image, is it possible to pass the Docker image a
> tika-config.xml file to default to pipes? If there is any documentation
> related to this, I am interested in trying it out.
>
>
>> >I guess there are no known SLAs from Tika’s watchdog kicking in, this is
>> what I was asking. I don’t know how it is implemented.
>> It is really simple.  The monitor/watchdog main process spawns a child
>> process which is the actual server.  The watchdog does pulse checks of the
>> spawned process and if it goes oom or crashes, the watchdog restarts it.
>> The watchdog does no monitoring for memory consumption but relies on the
>> -Xmx that's configured.
>>
>
> In this case it really looks puzzling how a Tika server configured with
> -Xmx3g can reach >6GB. Sounds like this is a bug and I will need to find a
> way to reproduce it.
>
> Thanks,
> Cristi
>
>
>
>>
>> On Tue, Jul 25, 2023 at 7:43 AM Cristian Zamfir <cri...@cyberhaven.com>
>> wrote:
>>
>>> Hello,
>>>
>>> On 21 Jul 2023 at 23:51:54, Cristian Zamfir <cri...@cyberhaven.com>
>>> wrote:
>>>
>>>> Hi Tim!
>>>>
>>>> Sorry for the lack of details, adding now.
>>>>
>>>> On 21 Jul 2023 at 18:56:02, Tim Allison <talli...@apache.org> wrote:
>>>>
>>>>> Sorry, I'm not sure I understand precisely what's going on.
>>>>>
>>>>> First, are you running tika-server, tika-app, tika-async, or running
>>>>> Tika programmatically?  I'm guessing tika-server because you've
>>>>> containerized it, but I've containerized tika-async...so...? 😃
>>>>>
>>>>
>>>> Tika-server, the official docker image with a custom config - the
>>>> config’s main changes are the -Xmx arg.
>>>>
>>>>
>>>>> If tika-server, are you sending requests in parallel to each
>>>>> container?  If in parallel, how many parallel requests are you allowing?
>>>>>
>>>>
>>>> Yes, sending requests in parallel without managing the number of
>>>> requests in parallel - there is horizontal auto-scaling to deal deal with
>>>> load, but the number of replicas is not based on the queue size, rather on
>>>> CPU consumption. Is there a recommended concurrency level? I could use that
>>>> instead for HPA. More on that below.
>>>>
>>>>
>>>>>  Are you able to share with me (privately) an example specific file
>>>>> that is causing problems?
>>>>>
>>>>
>>>> Unfortunately no, and I do not have access to the files either for
>>>> security reasons, not logging them on purpose. That would have been the
>>>> first thing I would have tried too.
>>>>
>>>>
>>>>> >where despite setting a watchdog to limit the heap to 3GB.
>>>>> You're setting your own watchdog?  Or, is this tika-server's watchdog
>>>>> and you've set -Xmx3g in <forkedJvmArgs>?
>>>>>
>>>>
>>>> Using -Xmx3g.
>>>>
>>>>
>>>>
>>>>> >1. The JVM is slow to observe the forked process exceeding its heap
>>>>> and does not terminate it in time
>>>>> Again, your own watchdog? If tika-server's watchdog...possibly?  I
>>>>> haven't seen this behavior, but it doesn't mean that it can't happen.
>>>>>
>>>>
>>>> I guess there are no known SLAs from Tika’s watchdog kicking in, this
>>>> is what I was asking. I don’t know how it is implemented.
>>>>
>>>>
>>>>> >2. It's not the heap that grows, but there is some stack overflow due
>>>>> to very deep recursion.
>>>>> Possible, but I don't think so... the default -Xss isn't very deep.
>>>>> Perhaps I misunderstand the suggestion?
>>>>>
>>>>
>>>> I think we are on the same page - I was thinking what non-heap sources
>>>> could account for the memory usage.
>>>>
>>>>
>>>>
>>>>> >Finally, are there any file types that are known to use a lot of
>>>>> memory with Tika?
>>>>> A file from any of the major file formats can be, um, crafted to take
>>>>> up a lot of memory. My rule of thumb is to allow 2GB per thread (if 
>>>>> running
>>>>> multithreaded) or request if you're allowing concurrent requests of
>>>>> tika-server.  There will still be some files that cause tika to OOM if
>>>>> you're processing millions/billions of files from the wild.
>>>>>
>>>>
>>>> This happens quite often with regular files, not crafted inputs. I am
>>>> guessing there is a particular memory hungry file format and several of
>>>> them are handled in parallel.
>>>>
>>>> With 2GB per request and heap size of 3GB that would mean very few
>>>> concurrent requests, so not great efficiency. Most of the time in my
>>>> experience Tika can process lots of files in parallel with a 3GB heap.
>>>>
>>>> I noticed this message also appears quite often:
>>>> org.apache.tika.utils.XMLReaderUtils Contention waiting for a
>>>> SAXParser. Consider increasing the XMLReaderUtils.POOL_SIZE
>>>> I am guessing this means the number of requests handled in parallel is
>>>> exceeding a certain internal limit.
>>>>
>>>>
>>>> Now that I understand this better, I have some followup questions:
>>>>
>>>>    1. Is there concurrency control I can configure, to limit the
>>>>    number of incoming requests handled in parallel?
>>>>
>>>>
>>> I looked at the tika server options and I did not see an option for
>>> concurrency control.
>>> Actually I found a reply from Nicholas to my older question where I
>>> understood that Tika Pipes may be the answer
>>> https://www.mail-archive.com/user@tika.apache.org/msg03535.html
>>>
>>> The main question is if the latest Tika server implementation uses pipes
>>> by default or is another solution recommended.
>>>
>>> Thanks,
>>> Cristi
>>>
>>>
>>>
>>>>    1. Assuming the answer is “yes" above, will requests be queued when
>>>>    the limit is reached? If they are dropped, is there a busy status reply 
>>>> to
>>>>    the /tika API?
>>>>    2. Is the queue size or the number of concurrently parsed files
>>>>    exposed through an API?
>>>>
>>>>
>>>>
>>>> To turn it around, are there specific file types that you are noticing
>>>>> are causing OOM?
>>>>>
>>>>
>>>> I will have to look into obtaining analytics on the input, maybe that
>>>> will shed more light.
>>>>
>>>> Thanks!
>>>>
>>>>
>>>>> On Thu, Jul 20, 2023 at 6:20 PM Cristian Zamfir <cri...@cyberhaven.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I am seeing some cases with Tika 2.2.1 where despite setting a
>>>>>> watchdog to limit the heap to 3GB, the entire Tika container exceeds 6GB
>>>>>> and that exceeds the resource memory limit, so it gets OOM-ed. Here is 
>>>>>> one
>>>>>> example:
>>>>>>
>>>>>> total-vm:8109464kB, anon-rss:99780kB, file-rss:28204kB,
>>>>>> shmem-rss:32kB, UID:0 pgtables:700kB oom_score_adj:-997
>>>>>>
>>>>>> Only some files seem to be causing this behavior.
>>>>>>
>>>>>> The memory ramps up fairly quickly, in a few tens of seconds it can
>>>>>> go from 1GB to 6GB.
>>>>>>
>>>>>> The next step is to check if this goes away with 2.8.0, but I wonder
>>>>>> if any of the following explanations make any sense:
>>>>>> 1. The JVM is slow to observe the forked process exceeding its heap
>>>>>> and does not terminate it in time
>>>>>> 2. It's not the heap that grows, but there is some stack overflow due
>>>>>> to very deep recursion.
>>>>>>
>>>>>> Finally, are there any file types that are known to use a lot of
>>>>>> memory with Tika?
>>>>>>
>>>>>> Thanks,
>>>>>> Cristi
>>>>>>
>>>>>>

Reply via email to