Hi Tim, Please let me know if you can refer me to any documentation on how to enable pipes for Tika server docker. I noticed that Nicholas mentioned pipes before, so I have taken the liberty to CC him to this thread - sorry for the noise and thanks a lot for your help!
Best regards, Cristi On Wed, Jul 26, 2023 at 2:58 PM Cristian Zamfir <cri...@cyberhaven.com> wrote: > Hi Tim, > On Tue, Jul 25, 2023 at 8:28 PM Tim Allison <talli...@apache.org> wrote: > >> Argh, sorry for my delay. >> >> Y, tika server is built on Apache CXF. Threading per request happens at >> the CXF level, not the Tika level. >> >> With pipes, you can control how many spawned processes are available to >> serve your requests, but that does add some complexity -- learning curve to >> configure. >> >> > I am guessing there is a particular memory hungry file format and >> several of them are handled in parallel. >> That's one of the key design goals with pipes -- one file per process at >> a time. With traditional tika-server or with the watchdog/forked stuff >> (the default in 2.x), if you're sending files in concurrently and one of >> them crashes the server, you don't know which file was the culprit. >> >> Out of the box, Tika server uses the watchdog set up and not pipes. >> > > I would certainly want to transition to pipes, looks like exactly what we > need. Actually are there any downsides to making that default? > We are using the Docker image, is it possible to pass the Docker image a > tika-config.xml file to default to pipes? If there is any documentation > related to this, I am interested in trying it out. > > >> >I guess there are no known SLAs from Tika’s watchdog kicking in, this is >> what I was asking. I don’t know how it is implemented. >> It is really simple. The monitor/watchdog main process spawns a child >> process which is the actual server. The watchdog does pulse checks of the >> spawned process and if it goes oom or crashes, the watchdog restarts it. >> The watchdog does no monitoring for memory consumption but relies on the >> -Xmx that's configured. >> > > In this case it really looks puzzling how a Tika server configured with > -Xmx3g can reach >6GB. Sounds like this is a bug and I will need to find a > way to reproduce it. > > Thanks, > Cristi > > > >> >> On Tue, Jul 25, 2023 at 7:43 AM Cristian Zamfir <cri...@cyberhaven.com> >> wrote: >> >>> Hello, >>> >>> On 21 Jul 2023 at 23:51:54, Cristian Zamfir <cri...@cyberhaven.com> >>> wrote: >>> >>>> Hi Tim! >>>> >>>> Sorry for the lack of details, adding now. >>>> >>>> On 21 Jul 2023 at 18:56:02, Tim Allison <talli...@apache.org> wrote: >>>> >>>>> Sorry, I'm not sure I understand precisely what's going on. >>>>> >>>>> First, are you running tika-server, tika-app, tika-async, or running >>>>> Tika programmatically? I'm guessing tika-server because you've >>>>> containerized it, but I've containerized tika-async...so...? 😃 >>>>> >>>> >>>> Tika-server, the official docker image with a custom config - the >>>> config’s main changes are the -Xmx arg. >>>> >>>> >>>>> If tika-server, are you sending requests in parallel to each >>>>> container? If in parallel, how many parallel requests are you allowing? >>>>> >>>> >>>> Yes, sending requests in parallel without managing the number of >>>> requests in parallel - there is horizontal auto-scaling to deal deal with >>>> load, but the number of replicas is not based on the queue size, rather on >>>> CPU consumption. Is there a recommended concurrency level? I could use that >>>> instead for HPA. More on that below. >>>> >>>> >>>>> Are you able to share with me (privately) an example specific file >>>>> that is causing problems? >>>>> >>>> >>>> Unfortunately no, and I do not have access to the files either for >>>> security reasons, not logging them on purpose. That would have been the >>>> first thing I would have tried too. >>>> >>>> >>>>> >where despite setting a watchdog to limit the heap to 3GB. >>>>> You're setting your own watchdog? Or, is this tika-server's watchdog >>>>> and you've set -Xmx3g in <forkedJvmArgs>? >>>>> >>>> >>>> Using -Xmx3g. >>>> >>>> >>>> >>>>> >1. The JVM is slow to observe the forked process exceeding its heap >>>>> and does not terminate it in time >>>>> Again, your own watchdog? If tika-server's watchdog...possibly? I >>>>> haven't seen this behavior, but it doesn't mean that it can't happen. >>>>> >>>> >>>> I guess there are no known SLAs from Tika’s watchdog kicking in, this >>>> is what I was asking. I don’t know how it is implemented. >>>> >>>> >>>>> >2. It's not the heap that grows, but there is some stack overflow due >>>>> to very deep recursion. >>>>> Possible, but I don't think so... the default -Xss isn't very deep. >>>>> Perhaps I misunderstand the suggestion? >>>>> >>>> >>>> I think we are on the same page - I was thinking what non-heap sources >>>> could account for the memory usage. >>>> >>>> >>>> >>>>> >Finally, are there any file types that are known to use a lot of >>>>> memory with Tika? >>>>> A file from any of the major file formats can be, um, crafted to take >>>>> up a lot of memory. My rule of thumb is to allow 2GB per thread (if >>>>> running >>>>> multithreaded) or request if you're allowing concurrent requests of >>>>> tika-server. There will still be some files that cause tika to OOM if >>>>> you're processing millions/billions of files from the wild. >>>>> >>>> >>>> This happens quite often with regular files, not crafted inputs. I am >>>> guessing there is a particular memory hungry file format and several of >>>> them are handled in parallel. >>>> >>>> With 2GB per request and heap size of 3GB that would mean very few >>>> concurrent requests, so not great efficiency. Most of the time in my >>>> experience Tika can process lots of files in parallel with a 3GB heap. >>>> >>>> I noticed this message also appears quite often: >>>> org.apache.tika.utils.XMLReaderUtils Contention waiting for a >>>> SAXParser. Consider increasing the XMLReaderUtils.POOL_SIZE >>>> I am guessing this means the number of requests handled in parallel is >>>> exceeding a certain internal limit. >>>> >>>> >>>> Now that I understand this better, I have some followup questions: >>>> >>>> 1. Is there concurrency control I can configure, to limit the >>>> number of incoming requests handled in parallel? >>>> >>>> >>> I looked at the tika server options and I did not see an option for >>> concurrency control. >>> Actually I found a reply from Nicholas to my older question where I >>> understood that Tika Pipes may be the answer >>> https://www.mail-archive.com/user@tika.apache.org/msg03535.html >>> >>> The main question is if the latest Tika server implementation uses pipes >>> by default or is another solution recommended. >>> >>> Thanks, >>> Cristi >>> >>> >>> >>>> 1. Assuming the answer is “yes" above, will requests be queued when >>>> the limit is reached? If they are dropped, is there a busy status reply >>>> to >>>> the /tika API? >>>> 2. Is the queue size or the number of concurrently parsed files >>>> exposed through an API? >>>> >>>> >>>> >>>> To turn it around, are there specific file types that you are noticing >>>>> are causing OOM? >>>>> >>>> >>>> I will have to look into obtaining analytics on the input, maybe that >>>> will shed more light. >>>> >>>> Thanks! >>>> >>>> >>>>> On Thu, Jul 20, 2023 at 6:20 PM Cristian Zamfir <cri...@cyberhaven.com> >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I am seeing some cases with Tika 2.2.1 where despite setting a >>>>>> watchdog to limit the heap to 3GB, the entire Tika container exceeds 6GB >>>>>> and that exceeds the resource memory limit, so it gets OOM-ed. Here is >>>>>> one >>>>>> example: >>>>>> >>>>>> total-vm:8109464kB, anon-rss:99780kB, file-rss:28204kB, >>>>>> shmem-rss:32kB, UID:0 pgtables:700kB oom_score_adj:-997 >>>>>> >>>>>> Only some files seem to be causing this behavior. >>>>>> >>>>>> The memory ramps up fairly quickly, in a few tens of seconds it can >>>>>> go from 1GB to 6GB. >>>>>> >>>>>> The next step is to check if this goes away with 2.8.0, but I wonder >>>>>> if any of the following explanations make any sense: >>>>>> 1. The JVM is slow to observe the forked process exceeding its heap >>>>>> and does not terminate it in time >>>>>> 2. It's not the heap that grows, but there is some stack overflow due >>>>>> to very deep recursion. >>>>>> >>>>>> Finally, are there any file types that are known to use a lot of >>>>>> memory with Tika? >>>>>> >>>>>> Thanks, >>>>>> Cristi >>>>>> >>>>>>