I’m pretty sure these OOMs are caused by uncontrolled thread creation, up to 4000 threads. That requires an additional 4 Gb (1 Meg per thread). It is like Solr doesn’t use thread pools at all.
I set this in jetty.xml, but it still created 4000 threads. <Get name="ThreadPool"> <Set name="minThreads" type="int"><Property name="solr.jetty.threads.min" default="200"/></Set> <Set name="maxThreads" type="int"><Property name="solr.jetty.threads.max" default="200"/></Set> wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Nov 23, 2017, at 7:02 PM, Damien Kamerman <dami...@gmail.com> wrote: > > I found the suggesters very memory hungry. I had one particularly large > index where the suggester should have been filtering a small number of > docs, but was mmap'ing the entire index. I only ever saw this behavior with > the suggesters. > > On 22 November 2017 at 03:17, Walter Underwood <wun...@wunderwood.org> > wrote: > >> All our customizations are in solr.in.sh. We’re using the one we >> configured for 6.3.0. I’ll check for any differences between that and the >> 6.5.1 script. >> >> I don’t see any arguments at all in the dashboard. I do see them in a ps >> listing, right at the end. >> >> java -server -Xms8g -Xmx8g -XX:+UseG1GC -XX:+ParallelRefProcEnabled >> -XX:G1HeapRegionSize=8m -XX:MaxGCPauseMillis=200 -XX:+UseLargePages >> -XX:+AggressiveOpts -XX:+HeapDumpOnOutOfMemoryError -verbose:gc >> -XX:+PrintHeapAtGC -XX:+PrintGCDetails -XX:+PrintGCDateStamps >> -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution >> -XX:+PrintGCApplicationStoppedTime >> -Xloggc:/solr/logs/solr_gc.log -XX:+UseGCLogFileRotation >> -XX:NumberOfGCLogFiles=9 -XX:GCLogFileSize=20M >> -Dcom.sun.management.jmxremote >> -Dcom.sun.management.jmxremote.local.only=false >> -Dcom.sun.management.jmxremote.ssl=false >> -Dcom.sun.management.jmxremote.authenticate=false >> -Dcom.sun.management.jmxremote.port=18983 >> -Dcom.sun.management.jmxremote.rmi.port=18983 >> -Djava.rmi.server.hostname=new-solr-c01.test3.cloud.cheggnet.com >> -DzkClientTimeout=15000 -DzkHost=zookeeper1.test3.cloud.cheggnet.com:2181, >> zookeeper2.test3.cloud.cheggnet.com:2181,zookeeper3.test3.cloud. >> cheggnet.com:2181/solr-cloud -Dsolr.log.level=WARN >> -Dsolr.log.dir=/solr/logs -Djetty.port=8983 -DSTOP.PORT=7983 >> -DSTOP.KEY=solrrocks -Dhost=new-solr-c01.test3.cloud.cheggnet.com >> -Duser.timezone=UTC -Djetty.home=/apps/solr6/server >> -Dsolr.solr.home=/apps/solr6/server/solr -Dsolr.install.dir=/apps/solr6 >> -Dgraphite.prefix=solr-cloud.new-solr-c01 -Dgraphite.host=influx.test. >> cheggnet.com -javaagent:/apps/solr6/newrelic/newrelic.jar >> -Dnewrelic.environment=test3 -Dsolr.log.muteconsole -Xss256k >> -Dsolr.log.muteconsole -XX:OnOutOfMemoryError=/apps/solr6/bin/oom_solr.sh >> 8983 /solr/logs -jar start.jar --module=http >> >> I’m still confused why we are hitting OOM in 6.5.1 but weren’t in 6.3.0. >> Our load benchmarks use prod logs. We added suggesters, but those use >> analyzing infix, so they are search indexes, not in-memory. >> >> wunder >> Walter Underwood >> wun...@wunderwood.org >> http://observer.wunderwood.org/ (my blog) >> >> >>> On Nov 21, 2017, at 5:46 AM, Shawn Heisey <apa...@elyograg.org> wrote: >>> >>> On 11/20/2017 6:17 PM, Walter Underwood wrote: >>>> When I ran load benchmarks with 6.3.0, an overloaded cluster would get >> super slow but keep functioning. With 6.5.1, we hit 100% CPU, then start >> getting OOMs. That is really bad, because it means we need to reboot every >> node in the cluster. >>>> Also, the JVM OOM hook isn’t running the process killer (JVM >> 1.8.0_121-b13). Using the G1 collector with the Shawn Heisey settings in an >> 8G heap. >>> <snip> >>>> This is not good behavior in prod. The process goes to the bad place, >> then we need to wait until someone is paged and kills it manually. Luckily, >> it usually drops out of the live nodes for each collection and doesn’t take >> user traffic. >>> >>> There was a bug, fixed long before 6.3.0, where the OOM killer script >> wasn't working because the arguments enabling it were in the wrong place. >> It was fixed in 5.5.1 and 6.0. >>> >>> https://issues.apache.org/jira/browse/SOLR-8145 >>> >>> If the scripts that you are using to get Solr started originated with a >> much older version of Solr than you are currently running, maybe you've got >> the arguments in the wrong order. >>> >>> Do you see the commandline arguments for the OOM killer (only available >> on *NIX systems, not Windows) on the admin UI dashboard? If they are >> properly placed, you will see them on the dashboard, but if they aren't >> properly placed, then you won't see them. This is what the argument looks >> like for one of my Solr installs: >>> >>> -XX:OnOutOfMemoryError=/opt/solr/bin/oom_solr.sh 8983 /var/solr/logs >>> >>> Something which you probably already know: If you're hitting OOM, you >> need a larger heap, or you need to adjust the config so it uses less >> memory. There are no other ways to "fix" OOM problems. >>> >>> Thanks, >>> Shawn >> >>