I see, thanks. Has TIKA_CHILD_JVM_OPTS=-JXmx been replaced by the configuration option forkedJvmArgs or do they still both work? Guessing that it is fully replaced.
When I switched to a config file for the server I noticed that some of the options I can see in the github repo do not seem to work. For instance log and includeStack. Have there been changes to these options in master compared to when 2.0.0-BETA was released? java -jar ./tika-server-standard-2.0.0-BETA.jar --config ./tika-server-config.xml SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. org.apache.tika.exception.TikaConfigException: Couldn't find setter: setLog for object class org.apache.tika.server.core.TikaServerConfig at org.apache.tika.config.ConfigBase.tryToSet(ConfigBase.java:389) at org.apache.tika.config.ConfigBase.setParams(ConfigBase.java:222) at org.apache.tika.config.ConfigBase.setParams(ConfigBase.java:190) at org.apache.tika.config.ConfigBase.configure(ConfigBase.java:432) at org.apache.tika.server.core.TikaServerConfig.load(TikaServerConfig.java:179) at org.apache.tika.server.core.TikaServerConfig.load(TikaServerConfig.java:172) at org.apache.tika.server.core.TikaServerConfig.load(TikaServerConfig.java:128) at org.apache.tika.server.core.TikaServerCli.execute(TikaServerCli.java:83) at org.apache.tika.server.core.TikaServerCli.main(TikaServerCli.java:66) <properties> <server> <params> <log>info</log> <!-- <includeStack>false</includeStack> --> <forkedJvmArgs> <arg>-Xmx3g</arg> </forkedJvmArgs> <endpoints> <endpoint>status</endpoint> <endpoint>tika</endpoint> </endpoints> </params> </server> </properties> Thanks, Cristi On Thu, Jun 10, 2021 at 5:36 PM Tim Allison <[email protected]> wrote: > I just updated the wiki. I haven't put in an anchor yet, but see: > https://cwiki.apache.org/confluence/display/TIKA/TikaServer and search > for 'status' at the bottom of the page. > > Please let us know if you have any questions. > > Best, > > Tim > > > On Thu, Jun 10, 2021 at 11:22 AM Cristian Zamfir <[email protected]> > wrote: > > > > It appears that the -status option was dropped in 2.x - was it replaced > by something else? > > > > Thanks, > > Cristi > > > > > > On Wed, Jun 2, 2021 at 4:54 PM Tim Allison <[email protected]> wrote: > >> > >> >I wanted to double check that -JXX:+ExitOnOutOfMemoryError should be > provided to the main process or to the child, can you please confirm? > >> > >> Yes > >> > >> On Wed, Jun 2, 2021 at 10:49 AM Cristian Zamfir <[email protected]> > wrote: > >> > > >> > > >> > > >> > > On 2 Jun 2021, at 15:33, Cristian Zamfir <[email protected]> > wrote: > >> > > > >> > >> > >> > >> On 2 Jun 2021, at 14:43, Tim Allison <[email protected]> wrote: > >> > >> > >> > >>> I noticed that Tika prints in the logs OOM (null), but seems to > recover by itself even when not using -spawnChild. Is this the expected > behavior? > >> > >> > >> > >> When not in -spawnChild mode, Tika is catching OOM exceptions > (when it > >> > >> can), but it isn't "recovering"... the jvm may be in an > inconsistent > >> > >> state, and it is safest to restart the jvm. It would probably be > good > >> > >> practice when in -spawnChild mode to use > -XX:+ExitOnOutOfMemoryError, > >> > >> or on the tika commandline -JXX:+ExitOnOutOfMemoryError. > >> > > > >> > > Thanks for the clarification, makes sense. Already migrated to > using -spawnChild. > >> > > It would be great to make these default for the docker container, I > suspect most people using the docker image will use it in a similar way and > can run into OOM. > >> > > >> > I tested that these args work: > >> > -spawnChild TIKA_CHILD_JVM_OPTS=-JXmx3g -JXX:+ExitOnOutOfMemoryError > -status > >> > I wanted to double check that -JXX:+ExitOnOutOfMemoryError should be > provided to the main process or to the child, can you please confirm? > >> > > >> > > >> > > > >> > > > >> > >> > >> > >> I highly encourage you to use -spawnChild mode, or the new pipes > >> > >> modules in 2.x if those will work for you at some point...those are > >> > >> still beta. OOMs are one thing, but infinite loops are another. > >> > >> > >> > >> 1. Do you have a recommendation for a stress test that would allow > me > >> > >> to easily test OOM behavior? > >> > >> The MockParser is built for exactly this: > >> > >> > https://www.google.com/url?q=https://cwiki.apache.org/confluence/display/TIKA/MockParser&source=gmail-imap&ust=1623242610000000&usg=AOvVaw0Ojll8dNTrN4dPDsFxe27I > >> > >> > >> > >> Let us know if you have any questions about it. The key elements > for > >> > >> you are <fakeload/>, <throw/> <oom/> and probably <system_exit/>. > >> > >> That's for synthetic load testing. If you want files in the wild, > we > >> > >> have 2TB of files from the wild: > >> > >> > https://www.google.com/url?q=https://corpora.tika.apache.org/base/docs/&source=gmail-imap&ust=1623242610000000&usg=AOvVaw1GmwQmaIG2r-uVROvKDdeJ > >> > > > >> > > Looks great. Looks like I will need to tweak the container for > testing this, but that’s likely fine. > >> > > >> > Actually I tested that the server restarts on OOM using ulimit and > then a for loop with curl, it is easy to reproduce. > >> > > >> > > > >> > >> > >> > >> 2. For implementing a health check that detects when Tika is > stuck, I > >> > >> could periodically send a simple request and check that the reply > is > >> > >> correct, do you recommend a better approach? > >> > >> We have a rudimentary /status endpoint, which will give you number > of > >> > >> restarts, number of files processed, milliseconds since last parse. > >> > >> You have to turn it on via the commandline: -status. > >> > > > >> > > The status endpoint looks like a possible, I can look for "status": > “OPERATING”. Sending a single byte file looks like a decent check as well. > >> > > > >> > > > >> > > > >> > >> > >> > >> On Wed, Jun 2, 2021 at 6:50 AM Cristian Zamfir < > [email protected]> wrote: > >> > >>> > >> > >>> Hi! > >> > >>> > >> > >>> I noticed that Tika prints in the logs OOM (null), but seems to > recover by itself even when not using -spawnChild. Is this the expected > behavior? I am trying to figure out when logs containing "OOM" are critical > and would require a container restart. > >> > >>> > >> > >>> I also wanted to bring up two of my questions below, I am looking > forward to your feedback: > >> > >>> 1. Do you have a recommendation for a stress test that would > allow me to easily test OOM behavior? > >> > >>> 2. For implementing a health check that detects when Tika is > stuck, I could periodically send a simple request and check that the reply > is correct, do you recommend a better approach? > >> > >>> > >> > >>> Thanks, > >> > >>> Cristi > >> > >>> > >> > >>> On Sat, May 29, 2021 at 2:58 PM Cristian Zamfir < > [email protected]> wrote: > >> > >>>> > >> > >>>> > >> > >>>>> On 28 May 2021, at 19:03, Tim Allison <[email protected]> > wrote: > >> > >>>>> > >> > >>>>> Tika 2.x should help with this in pipes and async. Your system > should > >> > >>>>> expect to go oom or crash at some point if you're processing > enough > >> > >>>>> files. > >> > >>>> > >> > >>>> I believe that this is what is happening in my case, it’s not > due to a single file, it happens under high load when processing many files > at once. > >> > >>>> > >> > >>>>> > >> > >>>>> Right --spawnChild is not default in 1.x, but it will be in > 2.x. And, > >> > >>>>> yes, you should be using it. To set the Xmx in the forked > process add > >> > >>>>> -J, as in -JXmx2g would set the Xmx for the forked process. > >> > >>>> > >> > >>>> > >> > >>>> Did both now and I think this provides good recovery from OOM. > >> > >>>> > >> > >>>> > >> > >>>>> > >> > >>>>> I don't have experience to recommend bumping Xmx to close to > your > >> > >>>>> container's max memory. In java programs that do a bunch of > work off > >> > >>>>> heap, this would be a bad idea because you need to leave > resources for > >> > >>>>> your system os, but I don't think we do much off heap. > >> > >>>> > >> > >>>> What’s your take on a configuration in which the container is > capped at 4GB and the spawned child has a heap limit of 3GB? Sounds like a > pretty safe margin to me. > >> > >>>> > >> > >>>>> > >> > >>>>> Which file types are causing OOMs? The MP4Parser is notorious, > and > >> > >>>>> we're looking to swap it out in 2.x for a different parser. > >> > >>>> > >> > >>>> Good to hear. I don’t know how to identify the root cause > because there are many files sent at once. > >> > >>>> However, it would be great to learn if there is a quick way to > trigger a high load and test resiliency to OOM, do you have a > recommendation? > >> > >>>> > >> > >>>> > >> > >>>>> > >> > >>>>> Yep, TIKA-3353 is the monitoring that Nick was mentioning. > >> > >>>> > >> > >>>> I am actually more interested in health checks, to detect when > the system is stuck without automatically restarting. A built-in health > check would certainly be a nice feature. > >> > >>>> > >> > >>>> Besides OOM, one other possible cause is if /tmp gets full - for > instance I see here > https://www.google.com/url?q=https://github.com/tongwang/tika-server-docker/blob/master/bin/healthcheck&source=gmail-imap&ust=1623242610000000&usg=AOvVaw3ELoyR3KnlYeRkxqI-n_sp > that /tmp is cleaned up periodically and the health check fails if it is > too full. > >> > >>>> > >> > >>>> Are there any other situations that could indicate that the > container is stuck and needs a restart and if yes, is there a way to detect > the condition? > >> > >>>> > >> > >>>> Thanks, > >> > >>>> Cristi > >> > >>>> > >> > >>>>> > >> > >>>>> On Fri, May 28, 2021 at 9:08 AM Cristian Zamfir < > [email protected]> wrote: > >> > >>>>>> > >> > >>>>>> Thanks for your answer Nick! > >> > >>>>>> > >> > >>>>>> I am running apache/tika:latest-full which is using 1.25. > Looks like I need at least version 1.26 for > https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://issues.apache.org/jira/browse/TIKA-3353%26source%3Dgmail-imap%26ust%3D1622826254000000%26usg%3DAOvVaw1we1l0Sh-gWif4FqbZ2qek&source=gmail-imap&ust=1623242610000000&usg=AOvVaw1_cxA4lC8qLoQsbu4sQfsP, > but I am not sure if this is not overkill for implementing basic liveness > health checks. > >> > >>>>>> > >> > >>>>>> It's clear that –spawnChild and ForkParser are two must-haves > that AFAIU are not default in apache/tika:latest-full > >> > >>>>>> > >> > >>>>>> My guess is that I also need to set the jvm heap size close to > the memory resource limit for the container, but that's not ideal because > the heap size would be statically configured while the memory resource > limits are dynamic. Or maybe this is not necessary if I use -spawnChild? > >> > >>>>>> > >> > >>>>>> I am looking forward to your answers, thanks a lot! > >> > >>>>>> > >> > >>>>>> Cristi > >> > >>>>>> > >> > >>>>>> > >> > >>>>>> On Fri, May 28, 2021 at 2:55 PM Nick Burch < > [email protected]> wrote: > >> > >>>>>>> > >> > >>>>>>> On Thu, 27 May 2021, Cristian Zamfir wrote: > >> > >>>>>>>> I am running some stress tests of the latest tika server > docker (not > >> > >>>>>>>> modified in any way, just pulled from the registry) and > seeing that after a > >> > >>>>>>>> few hours I see OOM in the logs. The container has a limit > of 4GB set in > >> > >>>>>>>> K8S. I am wondering if you have any best practices on how to > avoid this. > >> > >>>>>>> > >> > >>>>>>> Hopefully one of our Tika+Docker experts will be along in a > minute to help > >> > >>>>>>> advise! > >> > >>>>>>> > >> > >>>>>>> For now, the general advice is documented at: > >> > >>>>>>> > https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://cwiki.apache.org/confluence/display/TIKA/The%252BRobustness%252Bof%252BApache%252BTika%26source%3Dgmail-imap%26ust%3D1622826254000000%26usg%3DAOvVaw0p_ynGwlHapvMiy24sF1FP&source=gmail-imap&ust=1623242610000000&usg=AOvVaw2un_ETGBn01eVOW2jexxPL > >> > >>>>>>> > >> > >>>>>>> Also, which version of Tika are you on? There have been some > contributions > >> > >>>>>>> recently around monitoring the server, which you might want > to upgrade > >> > >>>>>>> for, eg TIKA-3353 > >> > >>>>>>> > >> > >>>>>>> Nick > >> > >
