I do have one theory about the OOM. The server is running out of memory because 
there are too many threads. Instead of queueing up overload in the load 
balancer, it is queue in new threads waiting to run. Setting 
solr.jetty.threads.max to 10,000 guarantees this will happen under overload.

New Relic shows this clearly. CPU hits 100% at 15:40, thread count and load 
average start climbing. At 15:43, it reaches 3000 threads and starts throwing 
OOM. After that, the server is in a stable congested state.

I understand why the Jetty thread max was set so high, but I think the cure is 
worse than the disease. We’ll run another load benchmark with thread max at 
something realistic, like 200.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Nov 21, 2017, at 8:17 AM, Walter Underwood <wun...@wunderwood.org> wrote:
> 
> All our customizations are in solr.in.sh. We’re using the one we configured 
> for 6.3.0. I’ll check for any differences between that and the 6.5.1 script.
> 
> I don’t see any arguments at all in the dashboard. I do see them in a ps 
> listing, right at the end.
> 
> java -server -Xms8g -Xmx8g -XX:+UseG1GC -XX:+ParallelRefProcEnabled 
> -XX:G1HeapRegionSize=8m -XX:MaxGCPauseMillis=200 -XX:+UseLargePages 
> -XX:+AggressiveOpts -XX:+HeapDumpOnOutOfMemoryError -verbose:gc 
> -XX:+PrintHeapAtGC -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
> -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution 
> -XX:+PrintGCApplicationStoppedTime -Xloggc:/solr/logs/solr_gc.log 
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=9 -XX:GCLogFileSize=20M 
> -Dcom.sun.management.jmxremote 
> -Dcom.sun.management.jmxremote.local.only=false 
> -Dcom.sun.management.jmxremote.ssl=false 
> -Dcom.sun.management.jmxremote.authenticate=false 
> -Dcom.sun.management.jmxremote.port=18983 
> -Dcom.sun.management.jmxremote.rmi.port=18983 
> -Djava.rmi.server.hostname=new-solr-c01.test3.cloud.cheggnet.com 
> -DzkClientTimeout=15000 
> -DzkHost=zookeeper1.test3.cloud.cheggnet.com:2181,zookeeper2.test3.cloud.cheggnet.com:2181,zookeeper3.test3.cloud.cheggnet.com:2181/solr-cloud
>  -Dsolr.log.level=WARN -Dsolr.log.dir=/solr/logs -Djetty.port=8983 
> -DSTOP.PORT=7983 -DSTOP.KEY=solrrocks 
> -Dhost=new-solr-c01.test3.cloud.cheggnet.com -Duser.timezone=UTC 
> -Djetty.home=/apps/solr6/server -Dsolr.solr.home=/apps/solr6/server/solr 
> -Dsolr.install.dir=/apps/solr6 -Dgraphite.prefix=solr-cloud.new-solr-c01 
> -Dgraphite.host=influx.test.cheggnet.com 
> -javaagent:/apps/solr6/newrelic/newrelic.jar -Dnewrelic.environment=test3 
> -Dsolr.log.muteconsole -Xss256k -Dsolr.log.muteconsole 
> -XX:OnOutOfMemoryError=/apps/solr6/bin/oom_solr.sh 8983 /solr/logs -jar 
> start.jar --module=http
> 
> I’m still confused why we are hitting OOM in 6.5.1 but weren’t in 6.3.0. Our 
> load benchmarks use prod logs. We added suggesters, but those use analyzing 
> infix, so they are search indexes, not in-memory.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
> 
>> On Nov 21, 2017, at 5:46 AM, Shawn Heisey <apa...@elyograg.org> wrote:
>> 
>> On 11/20/2017 6:17 PM, Walter Underwood wrote:
>>> When I ran load benchmarks with 6.3.0, an overloaded cluster would get 
>>> super slow but keep functioning. With 6.5.1, we hit 100% CPU, then start 
>>> getting OOMs. That is really bad, because it means we need to reboot every 
>>> node in the cluster.
>>> Also, the JVM OOM hook isn’t running the process killer (JVM 
>>> 1.8.0_121-b13). Using the G1 collector with the Shawn Heisey settings in an 
>>> 8G heap.
>> <snip>
>>> This is not good behavior in prod. The process goes to the bad place, then 
>>> we need to wait until someone is paged and kills it manually. Luckily, it 
>>> usually drops out of the live nodes for each collection and doesn’t take 
>>> user traffic.
>> 
>> There was a bug, fixed long before 6.3.0, where the OOM killer script wasn't 
>> working because the arguments enabling it were in the wrong place.  It was 
>> fixed in 5.5.1 and 6.0.
>> 
>> https://issues.apache.org/jira/browse/SOLR-8145
>> 
>> If the scripts that you are using to get Solr started originated with a much 
>> older version of Solr than you are currently running, maybe you've got the 
>> arguments in the wrong order.
>> 
>> Do you see the commandline arguments for the OOM killer (only available on 
>> *NIX systems, not Windows) on the admin UI dashboard?  If they are properly 
>> placed, you will see them on the dashboard, but if they aren't properly 
>> placed, then you won't see them.  This is what the argument looks like for 
>> one of my Solr installs:
>> 
>> -XX:OnOutOfMemoryError=/opt/solr/bin/oom_solr.sh 8983 /var/solr/logs
>> 
>> Something which you probably already know:  If you're hitting OOM, you need 
>> a larger heap, or you need to adjust the config so it uses less memory.  
>> There are no other ways to "fix" OOM problems.
>> 
>> Thanks,
>> Shawn
> 

Reply via email to