On a side node, do you monitor your disk I/O to see whether the disk bandwidth can catch up with the huge spikes in write ? Use dstat during the insert storm to see if you have big values for CPU wait
On Wed, Aug 3, 2016 at 12:41 PM, Ben Slater <ben.sla...@instaclustr.com> wrote: > Yes, looks like you have a (at least one) 100MB partition which is big > enough to cause issues. When you do lots of writes to the large partition > it is likely to end up getting compacted (as per the log) and compactions > often use a lot of memory / cause a lot of GC when they hit large > partitions. This, in addition to the write load is probably pushing you > over the edge. > > There are some improvements in 3.6 that might help ( > https://issues.apache.org/jira/browse/CASSANDRA-11206) but the 2.2 to 3.x > upgrade path seems risky at best at the moment. In any event, your best > solution would be to find a way to make your partitions smaller (like > 1/10th of the size). > > Cheers > Ben > <https://issues.apache.org/jira/browse/CASSANDRA-11206> > > On Wed, 3 Aug 2016 at 12:35 Kevin Burton <bur...@spinn3r.com> wrote: > >> I have a theory as to what I think is happening here. >> >> There is a correlation between the massive content all at once, and our >> outags. >> >> Our scheme uses large buckets of content where we write to a >> bucket/partition for 5 minutes, then move to a new one. This way we can >> page through buckets. >> >> I think what's happening is that CS is reading the entire partition into >> memory, then slicing through it... which would explain why its running out >> of memory. >> >> system.log:WARN [CompactionExecutor:294] 2016-08-03 02:01:55,659 >> BigTableWriter.java:184 - Writing large partition >> blogindex/content_legacy_2016_08_02:1470154500099 (106107128 bytes) >> >> On Tue, Aug 2, 2016 at 6:43 PM, Kevin Burton <bur...@spinn3r.com> wrote: >> >>> We have a 60 node CS cluster running 2.2.7 and about 20GB of RAM >>> allocated to each C* node. We're aware of the recommended 8GB limit to >>> keep GCs low but our memory has been creeping up (probably) related to this >>> bug. >>> >>> Here's what we're seeing... if we do a low level of writes we think >>> everything generally looks good. >>> >>> What happens is that we then need to catch up and then do a TON of >>> writes all in a small time window. Then CS nodes start dropping like >>> flies. Some of them just GC frequently and are able to recover. When they >>> GC like this we see GC pause in the 30 second range which then cause them >>> to not gossip for a while and they drop out of the cluster. >>> >>> This happens as a flurry around the cluster so we're not always able to >>> catch which ones are doing it as they recover. However, if we have 3 down, >>> we mostly have a locked up cluster. Writes don't complete and our app >>> essentially locks up. >>> >>> SOME of the boxes never recover. I'm in this state now. We have t3-5 >>> nodes that are in GC storms which they won't recover from. >>> >>> I reconfigured the GC settings to enable jstat. >>> >>> I was able to catch it while it was happening: >>> >>> ^Croot@util0067 ~ # sudo -u cassandra jstat -gcutil 4235 2500 >>> S0 S1 E O M CCS YGC YGCT FGC FGCT >>> GCT >>> 0.00 100.00 100.00 94.76 97.60 93.06 10435 1686.191 471 1139.142 >>> 2825.332 >>> 0.00 100.00 100.00 94.76 97.60 93.06 10435 1686.191 471 1139.142 >>> 2825.332 >>> 0.00 100.00 100.00 94.76 97.60 93.06 10435 1686.191 471 1139.142 >>> 2825.332 >>> 0.00 100.00 100.00 94.76 97.60 93.06 10435 1686.191 471 1139.142 >>> 2825.332 >>> 0.00 100.00 100.00 94.76 97.60 93.06 10435 1686.191 471 1139.142 >>> 2825.332 >>> 0.00 100.00 100.00 94.76 97.60 93.06 10435 1686.191 471 1139.142 >>> 2825.332 >>> >>> ... as you can see the box is legitimately out of memory. S0, S1, E and >>> O are all completely full. >>> >>> I'm not sure were to go from here. I think 20GB for our work load is >>> more than reasonable. >>> >>> 90% of the time they're well below 10GB of RAM used. While I was >>> watching this box I was seeing 30% RAM used until it decided to climb to >>> 100% >>> >>> Any advice on what do do next... I don't see anything obvious in the >>> logs to signal a problem. >>> >>> I attached all the command line arguments we use. Note that I think >>> that the cassandra-env.sh script puts them in there twice. >>> >>> -ea >>> -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar >>> -XX:+CMSClassUnloadingEnabled >>> -XX:+UseThreadPriorities >>> -XX:ThreadPriorityPolicy=42 >>> -Xms20000M >>> -Xmx20000M >>> -Xmn4096M >>> -XX:+HeapDumpOnOutOfMemoryError >>> -Xss256k >>> -XX:StringTableSize=1000003 >>> -XX:+UseParNewGC >>> -XX:+UseConcMarkSweepGC >>> -XX:+CMSParallelRemarkEnabled >>> -XX:SurvivorRatio=8 >>> -XX:MaxTenuringThreshold=1 >>> -XX:CMSInitiatingOccupancyFraction=75 >>> -XX:+UseCMSInitiatingOccupancyOnly >>> -XX:+UseTLAB >>> -XX:CompileCommandFile=/hotspot_compiler >>> -XX:CMSWaitDuration=10000 >>> -XX:+CMSParallelInitialMarkEnabled >>> -XX:+CMSEdenChunksRecordAlways >>> -XX:CMSWaitDuration=10000 >>> -XX:+UseCondCardMark >>> -XX:+PrintGCDetails >>> -XX:+PrintGCDateStamps >>> -XX:+PrintHeapAtGC >>> -XX:+PrintTenuringDistribution >>> -XX:+PrintGCApplicationStoppedTime >>> -XX:+PrintPromotionFailure >>> -XX:PrintFLSStatistics=1 >>> -Xloggc:/var/log/cassandra/gc.log >>> -XX:+UseGCLogFileRotation >>> -XX:NumberOfGCLogFiles=10 >>> -XX:GCLogFileSize=10M >>> -Djava.net.preferIPv4Stack=true >>> -Dcom.sun.management.jmxremote.port=7199 >>> -Dcom.sun.management.jmxremote.rmi.port=7199 >>> -Dcom.sun.management.jmxremote.ssl=false >>> -Dcom.sun.management.jmxremote.authenticate=false >>> -Djava.library.path=/usr/share/cassandra/lib/sigar-bin >>> -XX:+UnlockCommercialFeatures >>> -XX:+FlightRecorder >>> -ea >>> -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar >>> -XX:+CMSClassUnloadingEnabled >>> -XX:+UseThreadPriorities >>> -XX:ThreadPriorityPolicy=42 >>> -Xms20000M >>> -Xmx20000M >>> -Xmn4096M >>> -XX:+HeapDumpOnOutOfMemoryError >>> -Xss256k >>> -XX:StringTableSize=1000003 >>> -XX:+UseParNewGC >>> -XX:+UseConcMarkSweepGC >>> -XX:+CMSParallelRemarkEnabled >>> -XX:SurvivorRatio=8 >>> -XX:MaxTenuringThreshold=1 >>> -XX:CMSInitiatingOccupancyFraction=75 >>> -XX:+UseCMSInitiatingOccupancyOnly >>> -XX:+UseTLAB >>> -XX:CompileCommandFile=/etc/cassandra/hotspot_compiler >>> -XX:CMSWaitDuration=10000 >>> -XX:+CMSParallelInitialMarkEnabled >>> -XX:+CMSEdenChunksRecordAlways >>> -XX:CMSWaitDuration=10000 >>> -XX:+UseCondCardMark >>> -XX:+PrintGCDetails >>> -XX:+PrintGCDateStamps >>> -XX:+PrintHeapAtGC >>> -XX:+PrintTenuringDistribution >>> -XX:+PrintGCApplicationStoppedTime >>> -XX:+PrintPromotionFailure >>> -XX:PrintFLSStatistics=1 >>> -Xloggc:/var/log/cassandra/gc.log >>> -XX:+UseGCLogFileRotation >>> -XX:NumberOfGCLogFiles=10 >>> -XX:GCLogFileSize=10M >>> -Djava.net.preferIPv4Stack=true >>> -Dcom.sun.management.jmxremote.port=7199 >>> -Dcom.sun.management.jmxremote.rmi.port=7199 >>> -Dcom.sun.management.jmxremote.ssl=false >>> -Dcom.sun.management.jmxremote.authenticate=false >>> -Djava.library.path=/usr/share/cassandra/lib/sigar-bin >>> -XX:+UnlockCommercialFeatures >>> -XX:+FlightRecorder >>> -Dlogback.configurationFile=logback.xml >>> -Dcassandra.logdir=/var/log/cassandra >>> -Dcassandra.storagedir= >>> -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid >>> >>> >>> -- >>> >>> We’re hiring if you know of any awesome Java Devops or Linux Operations >>> Engineers! >>> >>> Founder/CEO Spinn3r.com >>> Location: *San Francisco, CA* >>> blog: http://burtonator.wordpress.com >>> … or check out my Google+ profile >>> <https://plus.google.com/102718274791889610666/posts> >>> >>> >> >> >> -- >> >> We’re hiring if you know of any awesome Java Devops or Linux Operations >> Engineers! >> >> Founder/CEO Spinn3r.com >> Location: *San Francisco, CA* >> blog: http://burtonator.wordpress.com >> … or check out my Google+ profile >> <https://plus.google.com/102718274791889610666/posts> >> >> -- > ———————— > Ben Slater > Chief Product Officer > Instaclustr: Cassandra + Spark - Managed | Consulting | Support > +61 437 929 798 >