Re: CMS GC - Old Generation collection never finishes (due to GC Allocation Failure?)

Dominique Bejean Fri, 12 Oct 2018 06:05:49 -0700

Hi,

1/
As previously said by other persons, my first action would be to understand
why you need so much heap ?


The first step is to maximize your heap size to 31Gb (or obviously less if
possible).
https://blog.codecentric.de/en/2014/02/35gb-heap-less-32gb-java-jvm-memory-oddities/

Can you provide some typical sorl requests covering most of your use cases
? Take them in solr logs in order to provide also hits count and qtime.

   - take care to rows and fl parameters
   - if you are using facets, use JSON API facets


Did you optimise your schema ?

   - remove unnecessary fields from you indices
   - optimize indexed, stored and docValues attributes (do not index or
   store unnecessary)


Did you increase to much solr caches ?

I didn't see the java version you are using.


2/
with huge heap, I would try the G1 GC.


3/
I would stop optimize indexes


4/
It looks like you have enough RAM to for your heap and the system cache (80
Gb + 20 Gb < 120 Gb), but did you disable swap on your server
(vm.swappiness = 1) ?


5/
How often are you updating your indexes on master (continuously, once an
hour, ... once a day) ?


Regards

Dominique



Le mer. 3 oct. 2018 à 23:11, yasoobhaider <yasoobhaid...@gmail.com> a
écrit :

> Hi
>
> I'm working with a Solr cluster with master-slave architecture.
>
> Master and slave config:
> ram: 120GB
> cores: 16
>
> At any point there are between 10-20 slaves in the cluster, each serving
> ~2k
> requests per minute. Each slave houses two collections of approx 10G
> (~2.5mil docs) and 2G(10mil docs) when optimized.
>
> I am working with Solr 6.2.1
>
> Solr configuration:
>
> -XX:+CMSParallelRemarkEnabled
> -XX:+CMSScavengeBeforeRemark
> -XX:+ParallelRefProcEnabled
> -XX:+PrintGCApplicationStoppedTime
> -XX:+PrintGCDateStamps
> -XX:+PrintGCDetails
> -XX:+PrintGCTimeStamps
> -XX:+PrintHeapAtGC
> -XX:+PrintTenuringDistribution
> -XX:+UseCMSInitiatingOccupancyOnly
> -XX:+UseConcMarkSweepGC
> -XX:+UseParNewGC
> -XX:-OmitStackTraceInFastThrow
> -XX:CMSInitiatingOccupancyFraction=50
> -XX:CMSMaxAbortablePrecleanTime=6000
> -XX:ConcGCThreads=4
> -XX:MaxTenuringThreshold=8
> -XX:ParallelGCThreads=4
> -XX:PretenureSizeThreshold=64m
> -XX:SurvivorRatio=15
> -XX:TargetSurvivorRatio=90
> -Xmn10G
> -Xms80G
> -Xmx80G
>
> Some of these configurations have been reached by multiple trial and errors
> over time, including the huge heap size.
>
> This cluster usually runs without any error.
>
> In the usual scenario, old gen gc is triggered according to the
> configuration at 50% old gen occupancy, and the collector clears out the
> memory over the next minute or so. This happens every 10-15 minutes.
>
> However, I have noticed that sometimes the GC pattern of the slaves
> completely changes and old gen gc is not able to clear the memory.
>
> After observing the gc logs closely for multiple old gen gc collections, I
> noticed that the old gen gc is triggered at 50% occupancy, but if there is
> a
> GC Allocation Failure before the collection completes (after CMS Initial
> Remark but before CMS reset), the old gen collection is not able to clear
> much memory. And as soon as this collection completes, another old gen gc
> is
> triggered.
>
> And in worst case scenarios, this cycle of old gen gc triggering, GC
> allocation failure keeps happening, and the old gen memory keeps
> increasing,
> leading to a single threaded STW GC, which is not able to do much, and I
> have to restart the solr server.
>
> The last time this happened after the following sequence of events:
>
> 1. We optimized the bigger collection bringing it to its optimized size of
> ~10G.
> 2. For an unrelated reason, we had stopped indexing to the master. We
> usually index at a low-ish throughput of ~1mil docs/day. This is relevant
> as
> when we are indexing, the size of the collection increases, and this
> effects
> the heap size used by collection.
> 3. The slaves started behaving erratically, with old gc collection not
> being
> able to free up the required memory and finally being stuck in a STW GC.
>
> As unlikely as this sounds, this is the only thing that changed on the
> cluster. There was no change in query throughput or type of queries.
>
> I restarted the slaves multiple times but the gc behaved in the same way
> for
> over three days. Then when we fixed the indexing and made it live, the
> slaves resumed their original gc pattern and are running without any issues
> for over 24 hours now.
>
> I would really be grateful for any advice on the following:
>
> 1. What could be the reason behind CMS not being able to free up the
> memory?
> What are some experiments I can run to solve this problem?
> 2. Can stopping/starting indexing be a reason for such drastic changes to
> GC
> pattern?
> 3. I have read at multiple places on this mailing list that the heap size
> should be much lower (2x-3x the size of collection), but the last time I
> tried CMS was not able to run smoothly and GC STW would occur which was
> only
> solved by a restart. My reasoning for this is that the type of queries and
> the throughput are also a factor in deciding the heap size, so it may be
> that our queries are creating too many objects maybe. Is my reasoning
> correct or should I try with a lower heap size (if it helps achieve a
> stable
> gc pattern)?
>
> (4. Silly question, but what is the right way to ask question on the
> mailing
> list? via mail or via the nabble website? I sent this question earlier as a
> mail, but it was not showing up on the nabble website so I am posting it
> from the website now)
>
>
>

Re: CMS GC - Old Generation collection never finishes (due to GC Allocation Failure?)

Reply via email to