On a side node, do you monitor your disk I/O to see whether the disk
bandwidth can catch up with the huge spikes in write ? Use dstat during the
insert storm to see if you have big values for CPU wait

On Wed, Aug 3, 2016 at 12:41 PM, Ben Slater <ben.sla...@instaclustr.com>
wrote:

> Yes, looks like you have a (at least one) 100MB partition which is big
> enough to cause issues. When you do lots of writes to the large partition
> it is likely to end up getting compacted (as per the log) and compactions
> often use a lot of memory / cause a lot of GC when they hit large
> partitions. This, in addition to the write load is probably pushing you
> over the edge.
>
> There are some improvements in 3.6 that might help (
> https://issues.apache.org/jira/browse/CASSANDRA-11206) but the 2.2 to 3.x
> upgrade path seems risky at best at the moment. In any event, your best
> solution would be to find a way to make your partitions smaller (like
> 1/10th of the size).
>
> Cheers
> Ben
> <https://issues.apache.org/jira/browse/CASSANDRA-11206>
>
> On Wed, 3 Aug 2016 at 12:35 Kevin Burton <bur...@spinn3r.com> wrote:
>
>> I have a theory as to what I think is happening here.
>>
>> There is a correlation between the massive content all at once, and our
>> outags.
>>
>> Our scheme uses large buckets of content where we write to a
>> bucket/partition for 5 minutes, then move to a new one.  This way we can
>> page through buckets.
>>
>> I think what's happening is that CS is reading the entire partition into
>> memory, then slicing through it... which would explain why its running out
>> of memory.
>>
>> system.log:WARN  [CompactionExecutor:294] 2016-08-03 02:01:55,659
>> BigTableWriter.java:184 - Writing large partition
>> blogindex/content_legacy_2016_08_02:1470154500099 (106107128 bytes)
>>
>> On Tue, Aug 2, 2016 at 6:43 PM, Kevin Burton <bur...@spinn3r.com> wrote:
>>
>>> We have a 60 node CS cluster running 2.2.7 and about 20GB of RAM
>>> allocated to each C* node.  We're aware of the recommended 8GB limit to
>>> keep GCs low but our memory has been creeping up (probably) related to this
>>> bug.
>>>
>>> Here's what we're seeing... if we do a low level of writes we think
>>> everything generally looks good.
>>>
>>> What happens is that we then need to catch up and then do a TON of
>>> writes all in a small time window.  Then CS nodes start dropping like
>>> flies.  Some of them just GC frequently and are able to recover. When they
>>> GC like this we see GC pause in the 30 second range which then cause them
>>> to not gossip for a while and they drop out of the cluster.
>>>
>>> This happens as a flurry around the cluster so we're not always able to
>>> catch which ones are doing it as they recover. However, if we have 3 down,
>>> we mostly have a locked up cluster.  Writes don't complete and our app
>>> essentially locks up.
>>>
>>> SOME of the boxes never recover. I'm in this state now.  We have t3-5
>>> nodes that are in GC storms which they won't recover from.
>>>
>>> I reconfigured the GC settings to enable jstat.
>>>
>>> I was able to catch it while it was happening:
>>>
>>> ^Croot@util0067 ~ # sudo -u cassandra jstat -gcutil 4235 2500
>>>   S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT
>>>   GCT
>>>   0.00 100.00 100.00  94.76  97.60  93.06  10435 1686.191   471 1139.142
>>> 2825.332
>>>   0.00 100.00 100.00  94.76  97.60  93.06  10435 1686.191   471 1139.142
>>> 2825.332
>>>   0.00 100.00 100.00  94.76  97.60  93.06  10435 1686.191   471 1139.142
>>> 2825.332
>>>   0.00 100.00 100.00  94.76  97.60  93.06  10435 1686.191   471 1139.142
>>> 2825.332
>>>   0.00 100.00 100.00  94.76  97.60  93.06  10435 1686.191   471 1139.142
>>> 2825.332
>>>   0.00 100.00 100.00  94.76  97.60  93.06  10435 1686.191   471 1139.142
>>> 2825.332
>>>
>>> ... as you can see the box is legitimately out of memory.  S0, S1, E and
>>> O are all completely full.
>>>
>>> I'm not sure were to go from here.  I think 20GB for our work load is
>>> more than reasonable.
>>>
>>> 90% of the time they're well below 10GB of RAM used.  While I was
>>> watching this box I was seeing 30% RAM used until it decided to climb to
>>> 100%
>>>
>>> Any advice on what do do next... I don't see anything obvious in the
>>> logs to signal a problem.
>>>
>>> I attached all the command line arguments we use.  Note that I think
>>> that the cassandra-env.sh script puts them in there twice.
>>>
>>> -ea
>>> -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar
>>> -XX:+CMSClassUnloadingEnabled
>>> -XX:+UseThreadPriorities
>>> -XX:ThreadPriorityPolicy=42
>>> -Xms20000M
>>> -Xmx20000M
>>> -Xmn4096M
>>> -XX:+HeapDumpOnOutOfMemoryError
>>> -Xss256k
>>> -XX:StringTableSize=1000003
>>> -XX:+UseParNewGC
>>> -XX:+UseConcMarkSweepGC
>>> -XX:+CMSParallelRemarkEnabled
>>> -XX:SurvivorRatio=8
>>> -XX:MaxTenuringThreshold=1
>>> -XX:CMSInitiatingOccupancyFraction=75
>>> -XX:+UseCMSInitiatingOccupancyOnly
>>> -XX:+UseTLAB
>>> -XX:CompileCommandFile=/hotspot_compiler
>>> -XX:CMSWaitDuration=10000
>>> -XX:+CMSParallelInitialMarkEnabled
>>> -XX:+CMSEdenChunksRecordAlways
>>> -XX:CMSWaitDuration=10000
>>> -XX:+UseCondCardMark
>>> -XX:+PrintGCDetails
>>> -XX:+PrintGCDateStamps
>>> -XX:+PrintHeapAtGC
>>> -XX:+PrintTenuringDistribution
>>> -XX:+PrintGCApplicationStoppedTime
>>> -XX:+PrintPromotionFailure
>>> -XX:PrintFLSStatistics=1
>>> -Xloggc:/var/log/cassandra/gc.log
>>> -XX:+UseGCLogFileRotation
>>> -XX:NumberOfGCLogFiles=10
>>> -XX:GCLogFileSize=10M
>>> -Djava.net.preferIPv4Stack=true
>>> -Dcom.sun.management.jmxremote.port=7199
>>> -Dcom.sun.management.jmxremote.rmi.port=7199
>>> -Dcom.sun.management.jmxremote.ssl=false
>>> -Dcom.sun.management.jmxremote.authenticate=false
>>> -Djava.library.path=/usr/share/cassandra/lib/sigar-bin
>>> -XX:+UnlockCommercialFeatures
>>> -XX:+FlightRecorder
>>> -ea
>>> -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar
>>> -XX:+CMSClassUnloadingEnabled
>>> -XX:+UseThreadPriorities
>>> -XX:ThreadPriorityPolicy=42
>>> -Xms20000M
>>> -Xmx20000M
>>> -Xmn4096M
>>> -XX:+HeapDumpOnOutOfMemoryError
>>> -Xss256k
>>> -XX:StringTableSize=1000003
>>> -XX:+UseParNewGC
>>> -XX:+UseConcMarkSweepGC
>>> -XX:+CMSParallelRemarkEnabled
>>> -XX:SurvivorRatio=8
>>> -XX:MaxTenuringThreshold=1
>>> -XX:CMSInitiatingOccupancyFraction=75
>>> -XX:+UseCMSInitiatingOccupancyOnly
>>> -XX:+UseTLAB
>>> -XX:CompileCommandFile=/etc/cassandra/hotspot_compiler
>>> -XX:CMSWaitDuration=10000
>>> -XX:+CMSParallelInitialMarkEnabled
>>> -XX:+CMSEdenChunksRecordAlways
>>> -XX:CMSWaitDuration=10000
>>> -XX:+UseCondCardMark
>>> -XX:+PrintGCDetails
>>> -XX:+PrintGCDateStamps
>>> -XX:+PrintHeapAtGC
>>> -XX:+PrintTenuringDistribution
>>> -XX:+PrintGCApplicationStoppedTime
>>> -XX:+PrintPromotionFailure
>>> -XX:PrintFLSStatistics=1
>>> -Xloggc:/var/log/cassandra/gc.log
>>> -XX:+UseGCLogFileRotation
>>> -XX:NumberOfGCLogFiles=10
>>> -XX:GCLogFileSize=10M
>>> -Djava.net.preferIPv4Stack=true
>>> -Dcom.sun.management.jmxremote.port=7199
>>> -Dcom.sun.management.jmxremote.rmi.port=7199
>>> -Dcom.sun.management.jmxremote.ssl=false
>>> -Dcom.sun.management.jmxremote.authenticate=false
>>> -Djava.library.path=/usr/share/cassandra/lib/sigar-bin
>>> -XX:+UnlockCommercialFeatures
>>> -XX:+FlightRecorder
>>> -Dlogback.configurationFile=logback.xml
>>> -Dcassandra.logdir=/var/log/cassandra
>>> -Dcassandra.storagedir=
>>> -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid
>>>
>>>
>>> --
>>>
>>> We’re hiring if you know of any awesome Java Devops or Linux Operations
>>> Engineers!
>>>
>>> Founder/CEO Spinn3r.com
>>> Location: *San Francisco, CA*
>>> blog: http://burtonator.wordpress.com
>>> … or check out my Google+ profile
>>> <https://plus.google.com/102718274791889610666/posts>
>>>
>>>
>>
>>
>> --
>>
>> We’re hiring if you know of any awesome Java Devops or Linux Operations
>> Engineers!
>>
>> Founder/CEO Spinn3r.com
>> Location: *San Francisco, CA*
>> blog: http://burtonator.wordpress.com
>> … or check out my Google+ profile
>> <https://plus.google.com/102718274791889610666/posts>
>>
>> --
> ————————
> Ben Slater
> Chief Product Officer
> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
> +61 437 929 798
>

Reply via email to