Our experience with G1GC was that 31gb wasn’t optimal (for us) because while 
you have less frequent full GCs they are bigger when they do happen.  But even 
so, not to the point of a 9.5s full collection.

Unless it is a rare event associated with something weird happening outside of 
the JVM (there are some whacky interactions between memory and dirty page 
writing that could cause it, but not typically), then that is evidence of a 
really tough fight to reclaim memory.  There are a lot of things that can 
impact garbage collection performance.  Something is either being pushed very 
hard, or something is being constrained very tightly compared to resource 
demand.

I’m with Erick, I wouldn’t be putting my attention right now on anything but 
the GC issue. Everything else that happens within the JVM envelope is going to 
be a misread on timing until you have stable garbage collection.  You might 
have other issues later, but you aren’t going to know what those are yet.

One thing you could at least try to eliminate quickly as a factor.  Are repairs 
running at the time that things are slow?  In prior to 3.11.5 you lack one of 
the tuning knobs for doing a tradeoff on memory vs network bandwidth when doing 
repairs.

I’d also make sure you have tuned C* to migrate whatever you reasonably can to 
be off-heap.

Another thought for surprise demands on memory.  I don’t know if this is in 
3.11.0, you’ll have to check the C* bash scripts for launching the service.  
The number of malloc arenas haven’t always been curtailed, and that could 
result in an explosion in memory demand.  I just don’t recall where in C* 
version history that was addressed.


From: Erick Ramirez <erick.rami...@datastax.com>
Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Date: Wednesday, February 26, 2020 at 9:55 PM
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Subject: Re: Hints replays very slow in one DC

Message from External Sender
Nodes are going down due to Out of Memory and we are using 31GB heap size in 
DC1 , however in DC2 (Which serves the traffic) has 16GB heap .
Why we had to increase heap in DC1 is because , DC1 nodes were going down due 
Out of Memory issue but DC2 nodes never went down .

It doesn't sound right that the primary DC is DC2 but DC1 is under load. You 
might not be aware of it but the symptom suggests DC1 is getting hit with lots 
of traffic. If you run netstat (or whatever utility/tool of your choice), you 
should see established connections to the cluster. That should give you clues 
as to where it's coming from.

We also noticed below kind of messages in system.log
FailureDetector.java:288 - Not marking nodes down due to local pause of 
9532654114 > 5000000000

That's another smoking gun that the nodes are buried in GC. A 9.5-second pause 
is significant. The slow hinted handoffs is really the least of your problem 
right now. If nodes weren't going down, there wouldn't be hints to handoff in 
the first place. Cheers!

GOT QUESTIONS? Apache Cassandra experts from the community and DataStax have 
answers! Share your expertise on 
https://community.datastax.com/<https://urldefense.proofpoint.com/v2/url?u=https-3A__community.datastax.com_&d=DwMFaQ&c=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA&r=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc&m=C0gRic-Qm5s2TDaPnWIg9ki0Zfc99_sNxDDPBTS4Sqw&s=ts13dLS5C9fN0TvYJQmSKlqMnSHpS-j3blE22HMedsg&e=>.

Reply via email to