Hi, 1/ As previously said by other persons, my first action would be to understand why you need so much heap ?
The first step is to maximize your heap size to 31Gb (or obviously less if possible). https://blog.codecentric.de/en/2014/02/35gb-heap-less-32gb-java-jvm-memory-oddities/ Can you provide some typical sorl requests covering most of your use cases ? Take them in solr logs in order to provide also hits count and qtime. - take care to rows and fl parameters - if you are using facets, use JSON API facets Did you optimise your schema ? - remove unnecessary fields from you indices - optimize indexed, stored and docValues attributes (do not index or store unnecessary) Did you increase to much solr caches ? I didn't see the java version you are using. 2/ with huge heap, I would try the G1 GC. 3/ I would stop optimize indexes 4/ It looks like you have enough RAM to for your heap and the system cache (80 Gb + 20 Gb < 120 Gb), but did you disable swap on your server (vm.swappiness = 1) ? 5/ How often are you updating your indexes on master (continuously, once an hour, ... once a day) ? Regards Dominique Le mer. 3 oct. 2018 à 23:11, yasoobhaider <yasoobhaid...@gmail.com> a écrit : > Hi > > I'm working with a Solr cluster with master-slave architecture. > > Master and slave config: > ram: 120GB > cores: 16 > > At any point there are between 10-20 slaves in the cluster, each serving > ~2k > requests per minute. Each slave houses two collections of approx 10G > (~2.5mil docs) and 2G(10mil docs) when optimized. > > I am working with Solr 6.2.1 > > Solr configuration: > > -XX:+CMSParallelRemarkEnabled > -XX:+CMSScavengeBeforeRemark > -XX:+ParallelRefProcEnabled > -XX:+PrintGCApplicationStoppedTime > -XX:+PrintGCDateStamps > -XX:+PrintGCDetails > -XX:+PrintGCTimeStamps > -XX:+PrintHeapAtGC > -XX:+PrintTenuringDistribution > -XX:+UseCMSInitiatingOccupancyOnly > -XX:+UseConcMarkSweepGC > -XX:+UseParNewGC > -XX:-OmitStackTraceInFastThrow > -XX:CMSInitiatingOccupancyFraction=50 > -XX:CMSMaxAbortablePrecleanTime=6000 > -XX:ConcGCThreads=4 > -XX:MaxTenuringThreshold=8 > -XX:ParallelGCThreads=4 > -XX:PretenureSizeThreshold=64m > -XX:SurvivorRatio=15 > -XX:TargetSurvivorRatio=90 > -Xmn10G > -Xms80G > -Xmx80G > > Some of these configurations have been reached by multiple trial and errors > over time, including the huge heap size. > > This cluster usually runs without any error. > > In the usual scenario, old gen gc is triggered according to the > configuration at 50% old gen occupancy, and the collector clears out the > memory over the next minute or so. This happens every 10-15 minutes. > > However, I have noticed that sometimes the GC pattern of the slaves > completely changes and old gen gc is not able to clear the memory. > > After observing the gc logs closely for multiple old gen gc collections, I > noticed that the old gen gc is triggered at 50% occupancy, but if there is > a > GC Allocation Failure before the collection completes (after CMS Initial > Remark but before CMS reset), the old gen collection is not able to clear > much memory. And as soon as this collection completes, another old gen gc > is > triggered. > > And in worst case scenarios, this cycle of old gen gc triggering, GC > allocation failure keeps happening, and the old gen memory keeps > increasing, > leading to a single threaded STW GC, which is not able to do much, and I > have to restart the solr server. > > The last time this happened after the following sequence of events: > > 1. We optimized the bigger collection bringing it to its optimized size of > ~10G. > 2. For an unrelated reason, we had stopped indexing to the master. We > usually index at a low-ish throughput of ~1mil docs/day. This is relevant > as > when we are indexing, the size of the collection increases, and this > effects > the heap size used by collection. > 3. The slaves started behaving erratically, with old gc collection not > being > able to free up the required memory and finally being stuck in a STW GC. > > As unlikely as this sounds, this is the only thing that changed on the > cluster. There was no change in query throughput or type of queries. > > I restarted the slaves multiple times but the gc behaved in the same way > for > over three days. Then when we fixed the indexing and made it live, the > slaves resumed their original gc pattern and are running without any issues > for over 24 hours now. > > I would really be grateful for any advice on the following: > > 1. What could be the reason behind CMS not being able to free up the > memory? > What are some experiments I can run to solve this problem? > 2. Can stopping/starting indexing be a reason for such drastic changes to > GC > pattern? > 3. I have read at multiple places on this mailing list that the heap size > should be much lower (2x-3x the size of collection), but the last time I > tried CMS was not able to run smoothly and GC STW would occur which was > only > solved by a restart. My reasoning for this is that the type of queries and > the throughput are also a factor in deciding the heap size, so it may be > that our queries are creating too many objects maybe. Is my reasoning > correct or should I try with a lower heap size (if it helps achieve a > stable > gc pattern)? > > (4. Silly question, but what is the right way to ask question on the > mailing > list? via mail or via the nabble website? I sent this question earlier as a > mail, but it was not showing up on the nabble website so I am posting it > from the website now) > > >