I just updated the wiki.  I haven't put in an anchor yet, but see:
https://cwiki.apache.org/confluence/display/TIKA/TikaServer and search
for 'status' at the bottom of the page.

Please let us know if you have any questions.

Best,

              Tim


On Thu, Jun 10, 2021 at 11:22 AM Cristian Zamfir <[email protected]> wrote:
>
> It appears that the -status option was dropped in 2.x - was it replaced by 
> something else?
>
> Thanks,
> Cristi
>
>
> On Wed, Jun 2, 2021 at 4:54 PM Tim Allison <[email protected]> wrote:
>>
>> >I wanted to double check that -JXX:+ExitOnOutOfMemoryError should be 
>> >provided to the main process or to the child, can you please confirm?
>>
>> Yes
>>
>> On Wed, Jun 2, 2021 at 10:49 AM Cristian Zamfir <[email protected]> 
>> wrote:
>> >
>> >
>> >
>> > > On 2 Jun 2021, at 15:33, Cristian Zamfir <[email protected]> wrote:
>> > >
>> > >>
>> > >> On 2 Jun 2021, at 14:43, Tim Allison <[email protected]> wrote:
>> > >>
>> > >>> I noticed that Tika prints in the logs OOM (null), but seems to 
>> > >>> recover by itself even when not using -spawnChild. Is this the 
>> > >>> expected behavior?
>> > >>
>> > >> When not in -spawnChild mode, Tika is catching OOM exceptions (when it
>> > >> can), but it isn't "recovering"... the jvm may be in an inconsistent
>> > >> state, and it is safest to restart the jvm.  It would probably be good
>> > >> practice when in -spawnChild mode to use -XX:+ExitOnOutOfMemoryError,
>> > >> or on the tika commandline -JXX:+ExitOnOutOfMemoryError.
>> > >
>> > > Thanks for the clarification, makes sense. Already migrated to using 
>> > > -spawnChild.
>> > > It would be great to make these default for the docker container, I 
>> > > suspect most people using the docker image will use it in a similar way 
>> > > and can run into OOM.
>> >
>> > I tested that these args work:
>> > -spawnChild TIKA_CHILD_JVM_OPTS=-JXmx3g -JXX:+ExitOnOutOfMemoryError 
>> > -status
>> > I wanted to double check that -JXX:+ExitOnOutOfMemoryError should be 
>> > provided to the main process or to the child, can you please confirm?
>> >
>> >
>> > >
>> > >
>> > >>
>> > >> I highly encourage you to use -spawnChild mode, or the new pipes
>> > >> modules in 2.x if those will work for you at some point...those are
>> > >> still beta.  OOMs are one thing, but infinite loops are another.
>> > >>
>> > >> 1. Do you have a recommendation for a stress test that would allow me
>> > >> to easily test OOM behavior?
>> > >> The MockParser is built for exactly this:
>> > >> https://www.google.com/url?q=https://cwiki.apache.org/confluence/display/TIKA/MockParser&source=gmail-imap&ust=1623242610000000&usg=AOvVaw0Ojll8dNTrN4dPDsFxe27I
>> > >>
>> > >> Let us know if you have any questions about it.  The key elements for
>> > >> you are <fakeload/>, <throw/> <oom/> and probably <system_exit/>.
>> > >> That's for synthetic load testing.  If you want files in the wild, we
>> > >> have 2TB of files from the wild:
>> > >> https://www.google.com/url?q=https://corpora.tika.apache.org/base/docs/&source=gmail-imap&ust=1623242610000000&usg=AOvVaw1GmwQmaIG2r-uVROvKDdeJ
>> > >
>> > > Looks great. Looks like I will need to tweak the container for testing 
>> > > this, but that’s likely fine.
>> >
>> > Actually I tested that the server restarts on OOM using ulimit and then a 
>> > for loop with curl, it is easy to reproduce.
>> >
>> > >
>> > >>
>> > >> 2. For implementing a health check that detects when Tika is stuck, I
>> > >> could periodically send a simple request and check that the reply is
>> > >> correct, do you recommend a better approach?
>> > >> We have a rudimentary /status endpoint, which will give you number of
>> > >> restarts, number of files processed, milliseconds since last parse.
>> > >> You have to turn it on via the commandline: -status.
>> > >
>> > > The status endpoint looks like a possible, I can look for "status": 
>> > > “OPERATING”. Sending a single byte file looks like a decent check as 
>> > > well.
>> > >
>> > >
>> > >
>> > >>
>> > >> On Wed, Jun 2, 2021 at 6:50 AM Cristian Zamfir <[email protected]> 
>> > >> wrote:
>> > >>>
>> > >>> Hi!
>> > >>>
>> > >>> I noticed that Tika prints in the logs OOM (null), but seems to 
>> > >>> recover by itself even when not using -spawnChild. Is this the 
>> > >>> expected behavior? I am trying to figure out when logs containing 
>> > >>> "OOM" are critical and would require a container restart.
>> > >>>
>> > >>> I also wanted to bring up two of my questions below, I am looking 
>> > >>> forward to your feedback:
>> > >>> 1. Do you have a recommendation for a stress test that would allow me 
>> > >>> to easily test OOM behavior?
>> > >>> 2. For implementing a health check that detects when Tika is stuck, I 
>> > >>> could periodically send a simple request and check that the reply is 
>> > >>> correct, do you recommend a better approach?
>> > >>>
>> > >>> Thanks,
>> > >>> Cristi
>> > >>>
>> > >>> On Sat, May 29, 2021 at 2:58 PM Cristian Zamfir 
>> > >>> <[email protected]> wrote:
>> > >>>>
>> > >>>>
>> > >>>>> On 28 May 2021, at 19:03, Tim Allison <[email protected]> wrote:
>> > >>>>>
>> > >>>>> Tika 2.x should help with this in pipes and async.  Your system 
>> > >>>>> should
>> > >>>>> expect to go oom or crash at some point if you're processing enough
>> > >>>>> files.
>> > >>>>
>> > >>>> I believe that this is what is happening in my case, it’s not due to 
>> > >>>> a single file, it happens under high load when processing many files 
>> > >>>> at once.
>> > >>>>
>> > >>>>>
>> > >>>>> Right --spawnChild is not default in 1.x, but it will be in 2.x.  
>> > >>>>> And,
>> > >>>>> yes, you should be using it. To set the Xmx in the forked process add
>> > >>>>> -J, as in -JXmx2g would set the Xmx for the forked process.
>> > >>>>
>> > >>>>
>> > >>>> Did both now and I think this provides good recovery from OOM.
>> > >>>>
>> > >>>>
>> > >>>>>
>> > >>>>> I don't have experience to recommend bumping Xmx to close to your
>> > >>>>> container's max memory. In java programs that do a bunch of work off
>> > >>>>> heap, this would be a bad idea because you need to leave resources 
>> > >>>>> for
>> > >>>>> your system os, but I don't think we do much off heap.
>> > >>>>
>> > >>>> What’s your take on a configuration in which the container is capped 
>> > >>>> at 4GB and the spawned child has a heap limit of 3GB? Sounds like a 
>> > >>>> pretty safe margin to me.
>> > >>>>
>> > >>>>>
>> > >>>>> Which file types are causing OOMs?  The MP4Parser is notorious, and
>> > >>>>> we're looking to swap it out in 2.x for a different parser.
>> > >>>>
>> > >>>> Good to hear. I don’t know how to identify the root cause because 
>> > >>>> there are many files sent at once.
>> > >>>> However, it would be great to learn if there is a quick way to 
>> > >>>> trigger a high load and test resiliency to OOM, do you have a 
>> > >>>> recommendation?
>> > >>>>
>> > >>>>
>> > >>>>>
>> > >>>>> Yep, TIKA-3353 is the monitoring that Nick was mentioning.
>> > >>>>
>> > >>>> I am actually more interested in health checks, to detect when the 
>> > >>>> system is stuck without automatically restarting. A built-in health 
>> > >>>> check would certainly be a nice feature.
>> > >>>>
>> > >>>> Besides OOM, one other possible cause is if /tmp gets full - for 
>> > >>>> instance I see here 
>> > >>>> https://www.google.com/url?q=https://github.com/tongwang/tika-server-docker/blob/master/bin/healthcheck&source=gmail-imap&ust=1623242610000000&usg=AOvVaw3ELoyR3KnlYeRkxqI-n_sp
>> > >>>>  that /tmp is cleaned up periodically and the health check fails if 
>> > >>>> it is too full.
>> > >>>>
>> > >>>> Are there any other situations that could indicate that the container 
>> > >>>> is stuck and needs a restart and if yes, is there a way to detect the 
>> > >>>> condition?
>> > >>>>
>> > >>>> Thanks,
>> > >>>> Cristi
>> > >>>>
>> > >>>>>
>> > >>>>> On Fri, May 28, 2021 at 9:08 AM Cristian Zamfir 
>> > >>>>> <[email protected]> wrote:
>> > >>>>>>
>> > >>>>>> Thanks for your answer Nick!
>> > >>>>>>
>> > >>>>>> I am running apache/tika:latest-full which is using 1.25. Looks 
>> > >>>>>> like I need at least version 1.26 for 
>> > >>>>>> https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://issues.apache.org/jira/browse/TIKA-3353%26source%3Dgmail-imap%26ust%3D1622826254000000%26usg%3DAOvVaw1we1l0Sh-gWif4FqbZ2qek&source=gmail-imap&ust=1623242610000000&usg=AOvVaw1_cxA4lC8qLoQsbu4sQfsP,
>> > >>>>>>  but I am not sure if this is not overkill for implementing basic 
>> > >>>>>> liveness health checks.
>> > >>>>>>
>> > >>>>>> It's clear that –spawnChild and ForkParser are two must-haves that 
>> > >>>>>> AFAIU are not default in apache/tika:latest-full
>> > >>>>>>
>> > >>>>>> My guess is that I also need to set the jvm heap size close to the 
>> > >>>>>> memory resource limit for the container, but that's not ideal 
>> > >>>>>> because the heap size would be statically configured while the 
>> > >>>>>> memory resource limits are dynamic. Or maybe this is not necessary 
>> > >>>>>> if I use -spawnChild?
>> > >>>>>>
>> > >>>>>> I am looking forward to your answers, thanks a lot!
>> > >>>>>>
>> > >>>>>> Cristi
>> > >>>>>>
>> > >>>>>>
>> > >>>>>> On Fri, May 28, 2021 at 2:55 PM Nick Burch <[email protected]> 
>> > >>>>>> wrote:
>> > >>>>>>>
>> > >>>>>>> On Thu, 27 May 2021, Cristian Zamfir wrote:
>> > >>>>>>>> I am running some stress tests of the latest tika server docker 
>> > >>>>>>>> (not
>> > >>>>>>>> modified in any way, just pulled from the registry) and seeing 
>> > >>>>>>>> that after a
>> > >>>>>>>> few hours I see OOM in the logs. The container has a limit of 4GB 
>> > >>>>>>>> set in
>> > >>>>>>>> K8S. I am wondering if you have any best practices on how to 
>> > >>>>>>>> avoid this.
>> > >>>>>>>
>> > >>>>>>> Hopefully one of our Tika+Docker experts will be along in a minute 
>> > >>>>>>> to help
>> > >>>>>>> advise!
>> > >>>>>>>
>> > >>>>>>> For now, the general advice is documented at:
>> > >>>>>>> https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://cwiki.apache.org/confluence/display/TIKA/The%252BRobustness%252Bof%252BApache%252BTika%26source%3Dgmail-imap%26ust%3D1622826254000000%26usg%3DAOvVaw0p_ynGwlHapvMiy24sF1FP&source=gmail-imap&ust=1623242610000000&usg=AOvVaw2un_ETGBn01eVOW2jexxPL
>> > >>>>>>>
>> > >>>>>>> Also, which version of Tika are you on? There have been some 
>> > >>>>>>> contributions
>> > >>>>>>> recently around monitoring the server, which you might want to 
>> > >>>>>>> upgrade
>> > >>>>>>> for, eg TIKA-3353
>> > >>>>>>>
>> > >>>>>>> Nick
>> >

Reply via email to