It looks like you are doing a good work with this cluster and know a lot
about JVM, that's good :-).

our machine configurations are : 2 X 800 GB SSD , 48 cores, 64 GB RAM


That's good hardware too.

With 64 GB of ram I would probably directly give a try to
`MAX_HEAP_SIZE=8G` on one of the 2 bad nodes probably.

Also I would also probably try lowering `HEAP_NEWSIZE=2G.` and using
`-XX:MaxTenuringThreshold=15`, still on the canary node to observe the
effects. But that's just an idea of something I would try to see the
impacts, I don't think it will solve your current issues or even make it
worse for this node.

Using G1GC would allow you to use a bigger Heap size. Using C*2.1 would
allow you to store the memtables off-heap. Those are 2 improvements
reducing the heap pressure that you might be interested in.

I have spent time reading about all other options before including them and
> a similar configuration on our other prod cluster is showing good GC graphs
> via gcviewer.


So, let's look for an other reason.

there are MUTATION and READ messages dropped in high number on nodes in
> question and on other 5 nodes it varies between 1-3.


- Is Memory, CPU or disk a bottleneck? Is one of those running at the
limits?

concurrent_compactors: 48


Reducing this to 8 would free some space for transactions (R&W requests).
It is probably worth a try, even more when compaction is not keeping up and
compaction throughput is not throttled.

Just found an issue about that:
https://issues.apache.org/jira/browse/CASSANDRA-7139

Looks like `concurrent_compactors: 8` is the new default.

C*heers,
-----------------------
Alain Rodriguez - al...@thelastpickle.com
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com






2016-03-02 12:27 GMT+01:00 Anishek Agarwal <anis...@gmail.com>:

> Thanks a lot Alian for the details.
>
>> `HEAP_NEWSIZE=4G.` is probably far too high (try 1200M <-> 2G)
>> `MAX_HEAP_SIZE=6G` might be too low, how much memory is available (You
>> might want to keep this as it or even reduce it if you have less than 16 GB
>> of native memory. Go with 8 GB if you have a lot of memory.
>> `-XX:MaxTenuringThreshold=50` is the highest value I have seen in use so
>> far. I had luck with values between 4 <--> 16 in the past. I would give  a
>> try with 15.
>> `-XX:CMSInitiatingOccupancyFraction=70`--> Why not using default - 75 ?
>> Using default and then tune from there to improve things is generally a
>> good idea.
>
>
>
> we have a lot of reads and writes onto the system so keeping the high new
> size to make sure enough is held in memory including caches / memtables etc
> --number of flush_writers : 4 for us. similarly keeping less in old
> generation to make sure we spend less time with CMS GC most of the data is
> transient in memory for us. Keeping high TenuringThreshold because we don't
> want objects going to old generation and just die in young generation given
> we have configured large survivor spaces.
> using occupancyFraction as 70 since
> given heap is 4G
> survivor space is : 400 mb -- 2 survivor spaces
> 70 % of 2G (old generation) = 1.4G
>
> so once we are just below 1.4G and we have to move the full survivor +
> some extra during a par new gc due to promotion failure, everything will
> fit in old generation, and will trigger CMS.
>
> I have spent time reading about all other options before including them
> and a similar configuration on our other prod cluster is showing good GC
> graphs via gcviewer.
>
> tp stats on all machines show flush writer blocked at : 0.3% of total
>
> the two nodes in question have stats almost as below
>
>    - specifically there are pending was in readStage, MutationStage and
>    RequestResponseStage
>
> Pool Name                    Active   Pending      Completed   Blocked
> All time blocked
>
> ReadStage                        21        19     2141798645         0
>             0
>
> RequestResponseStage              0         1      803242391         0
>             0
>
> MutationStage                     0         0      291813703         0
>             0
>
> ReadRepairStage                   0         0      200544344         0
>             0
>
> ReplicateOnWriteStage             0         0              0         0
>             0
>
> GossipStage                       0         0         292477         0
>             0
>
> CacheCleanupExecutor              0         0              0         0
>             0
>
> MigrationStage                    0         0              0         0
>             0
>
> MemoryMeter                       0         0           2172         0
>             0
>
> FlushWriter                       0         0           2756         0
>             6
>
> ValidationExecutor                0         0            101         0
>             0
>
> InternalResponseStage             0         0              0         0
>             0
>
> AntiEntropyStage                  0         0            202         0
>             0
>
> MemtablePostFlusher               0         0           4395         0
>             0
>
> MiscStage                         0         0              0         0
>             0
>
> PendingRangeCalculator            0         0             20         0
>             0
>
> CompactionExecutor                4         4          49323         0
>             0
>
> commitlog_archiver                0         0              0         0
>             0
>
> HintedHandoff                     0         0            116         0
>             0
>
>
> Message type           Dropped
>
> RANGE_SLICE                  0
>
> READ_REPAIR                 36
>
> PAGED_RANGE                  0
>
> BINARY                       0
>
> READ                     11471
>
> MUTATION                   898
>
> _TRACE                       0
>
> REQUEST_RESPONSE             0
>
> COUNTER_MUTATION             0
>
> all the other 5 nodes show no pending numbers.
>
>
> our machine configurations are : 2 X 800 GB SSD , 48 cores, 64 GB RAM
> compaction throughput is 0 MB/s
> concurrent_compactors: 48
> flush_writers: 4
>
>
>> I think Jeff is trying to spot a wide row messing with your system, so
>> looking at the max row size on those nodes compared to other is more
>> relevant than average size for this check.
>
>
> i think is what you are looking for, please correct me if i am wrong
>
> Compacted partition maximum bytes: 1629722
> similar value on all 7 nodes.
>
> grep -i "ERROR" /var/log/cassandra/system.log
>
>
> there are MUTATION and READ messages dropped in high number on nodes in
> question and on other 5 nodes it varies between 1-3.
>
> On Wed, Mar 2, 2016 at 4:15 PM, Alain RODRIGUEZ <arodr...@gmail.com>
> wrote:
>
>> Hi Anishek,
>>
>> Even if it highly depends on your workload, here are my thoughts:
>>
>> `HEAP_NEWSIZE=4G.` is probably far too high (try 1200M <-> 2G)
>> `MAX_HEAP_SIZE=6G` might be too low, how much memory is available (You
>> might want to keep this as it or even reduce it if you have less than 16 GB
>> of native memory. Go with 8 GB if you have a lot of memory.
>> `-XX:MaxTenuringThreshold=50` is the highest value I have seen in use so
>> far. I had luck with values between 4 <--> 16 in the past. I would give  a
>> try with 15.
>> `-XX:CMSInitiatingOccupancyFraction=70`--> Why not using default - 75 ?
>> Using default and then tune from there to improve things is generally a
>> good idea.
>>
>> You also use a bunch of option I don't know about, if you are uncertain
>> about them, you could try a default conf without the options you added and
>> just the using the changes above from default
>> https://github.com/apache/cassandra/blob/cassandra-2.0/conf/cassandra-env.sh.
>> Or you might find more useful information on a nice reference about this
>> topic which is Al Tobey's blog post about tuning 2.1. Go to the 'Java
>> Virtual Machine' part:
>> https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html
>>
>> FWIW, I also saw improvement in the past by upgrading to 2.1, Java 8 and
>> G1GC. G1GC is supposed to be easier to configure too.
>>
>> the average row size for compacted partitions is about 1640 bytes on all
>>> nodes. We have replication factor 3 but the problem is only on two nodes.
>>>
>>
>> I think Jeff is trying to spot a wide row messing with your system, so
>> looking at the max row size on those nodes compared to other is more
>> relevant than average size for this check.
>>
>> the only other thing that stands out in cfstats is the read time and
>>> write time on the nodes with high GC is 5-7 times higher than other 5
>>> nodes, but i think thats expected.
>>
>>
>> I would probably look at this the reverse way: I imagine that extra GC
>>  is a consequence of something going wrong on those nodes as JVM / GC are
>> configured the same way cluster-wide. GC / JVM issues are often due to
>> Cassandra / system / hardware issues, inducing extra pressure on the JVM. I
>> would try to tune JVM / GC only once the system is healthy. So I often saw
>> high GC being a consequence rather than the root cause of an issue.
>>
>> To explore this possibility:
>>
>> Does this command show some dropped or blocked tasks? This would add
>> pressure to heap.
>> nodetool tpstats
>>
>> Do you have errors in logs? Always good to know when facing an issue.
>> grep -i "ERROR" /var/log/cassandra/system.log
>>
>> How are compactions tuned (throughput + concurrent compactors)? This
>> tuning might explain compactions not keeping up or a high GC pressure.
>>
>> What are your disks / CPU? To help us giving you good arbitrary values to
>> try.
>>
>> Is there some iowait ? Could point to a bottleneck or bad hardware.
>> iostats -mx 5 100
>>
>> ...
>>
>> Hope one of those will point you to an issue, but there are many more
>> thing you could check.
>>
>> Let us know how it goes,
>>
>> C*heers,
>> -----------------------
>> Alain Rodriguez - al...@thelastpickle.com
>> France
>>
>> The Last Pickle - Apache Cassandra Consulting
>> http://www.thelastpickle.com
>>
>>
>>
>> 2016-03-02 10:33 GMT+01:00 Anishek Agarwal <anis...@gmail.com>:
>>
>>> also MAX_HEAP_SIZE=6G and HEAP_NEWSIZE=4G.
>>>
>>> On Wed, Mar 2, 2016 at 1:40 PM, Anishek Agarwal <anis...@gmail.com>
>>> wrote:
>>>
>>>> Hey Jeff,
>>>>
>>>> one of the nodes with high GC has 1400 SST tables, all other nodes have
>>>> about 500-900 SST tables. the other node with high GC has 636 SST tables.
>>>>
>>>> the average row size for compacted partitions is about 1640 bytes on
>>>> all nodes. We have replication factor 3 but the problem is only on two
>>>> nodes.
>>>> the only other thing that stands out in cfstats is the read time and
>>>> write time on the nodes with high GC is 5-7 times higher than other 5
>>>> nodes, but i think thats expected.
>>>>
>>>> thanks
>>>> anishek
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Mar 2, 2016 at 1:09 PM, Jeff Jirsa <jeff.ji...@crowdstrike.com>
>>>> wrote:
>>>>
>>>>> Compaction falling behind will likely cause additional work on reads
>>>>> (more sstables to merge), but I’d be surprised if it manifested in super
>>>>> long GC. When you say twice as many sstables, how many is that?.
>>>>>
>>>>> In cfstats, does anything stand out? Is max row size on those nodes
>>>>> larger than on other nodes?
>>>>>
>>>>> What you don’t show in your JVM options is the new gen size – if you
>>>>> do have unusually large partitions on those two nodes (especially likely 
>>>>> if
>>>>> you have rf=2 – if you have rf=3, then there’s probably a third node
>>>>> misbehaving you haven’t found yet), then raising new gen size can help
>>>>> handle the garbage created by reading large partitions without having to
>>>>> tolerate the promotion. Estimates for the amount of garbage vary, but it
>>>>> could be “gigabytes” of garbage on a very wide partition (see
>>>>> https://issues.apache.org/jira/browse/CASSANDRA-9754 for work in
>>>>> progress to help mitigate that type of pain).
>>>>>
>>>>> - Jeff
>>>>>
>>>>> From: Anishek Agarwal
>>>>> Reply-To: "user@cassandra.apache.org"
>>>>> Date: Tuesday, March 1, 2016 at 11:12 PM
>>>>> To: "user@cassandra.apache.org"
>>>>> Subject: Lot of GC on two nodes out of 7
>>>>>
>>>>> Hello,
>>>>>
>>>>> we have a cassandra cluster of 7 nodes, all of them have the same JVM
>>>>> GC configurations, all our writes /  reads use the TokenAware Policy
>>>>> wrapping a DCAware policy. All nodes are part of same Datacenter.
>>>>>
>>>>> We are seeing that two nodes are having high GC collection times. Then
>>>>> mostly seem to spend time in GC like about 300-600 ms. This also seems to
>>>>> result in higher CPU utilisation on these machines. Other  5 nodes don't
>>>>> have this problem.
>>>>>
>>>>> There is no additional repair activity going on the cluster, we are
>>>>> not sure why this is happening.
>>>>> we checked cfhistograms on the two CF we have in the cluster and
>>>>> number of reads seems to be almost same.
>>>>>
>>>>> we also used cfstats to see the number of ssttables on each node and
>>>>> one of the nodes with the above problem has twice the number of ssttables
>>>>> than other nodes. This still doesnot explain why two nodes have high GC
>>>>> Overheads. our GC config is as below:
>>>>>
>>>>> JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"
>>>>>
>>>>> JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"
>>>>>
>>>>> JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"
>>>>>
>>>>> JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"
>>>>>
>>>>> JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=50"
>>>>>
>>>>> JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=70"
>>>>>
>>>>> JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
>>>>>
>>>>> JVM_OPTS="$JVM_OPTS -XX:+UseTLAB"
>>>>>
>>>>> JVM_OPTS="$JVM_OPTS -XX:MaxPermSize=256m"
>>>>>
>>>>> JVM_OPTS="$JVM_OPTS -XX:+AggressiveOpts"
>>>>>
>>>>> JVM_OPTS="$JVM_OPTS -XX:+UseCompressedOops"
>>>>>
>>>>> JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"
>>>>>
>>>>> JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=48"
>>>>>
>>>>> JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=48"
>>>>>
>>>>> JVM_OPTS="$JVM_OPTS -XX:-ExplicitGCInvokesConcurrent"
>>>>>
>>>>> JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"
>>>>>
>>>>> JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"
>>>>>
>>>>> JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"
>>>>>
>>>>> # earlier value 131072 = 32768 * 4
>>>>>
>>>>> JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=131072"
>>>>>
>>>>> JVM_OPTS="$JVM_OPTS -XX:CMSScheduleRemarkEdenSizeThreshold=104857600"
>>>>>
>>>>> JVM_OPTS="$JVM_OPTS -XX:CMSRescanMultiple=32768"
>>>>>
>>>>> JVM_OPTS="$JVM_OPTS -XX:CMSConcMarkMultiple=32768"
>>>>>
>>>>> #new
>>>>>
>>>>> JVM_OPTS="$JVM_OPTS -XX:+CMSConcurrentMTEnabled"
>>>>>
>>>>> We are using cassandra 2.0.17. If anyone has any suggestion as to how
>>>>> what else we can look for to understand why this is happening please do
>>>>> reply.
>>>>>
>>>>>
>>>>>
>>>>> Thanks
>>>>> anishek
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to