One more observation …

When we compare read latencies between non-prod (where nodes were removed) to 
prod clusters, even though the node load as measure by size of /data dir is 
similar, yet the read latencies are 5 times slower in the downsized non-prod 

The only difference we see is that prod reads from 4 sstables whereas non-prod 
reads from 5 as cfhistograms. 

Non-prod /data size
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1    885G  454G  432G  52% /data
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1    885G  439G  446G  50% /data
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1    885G  368G  518G  42% /data
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1    885G  431G  455G  49% /data
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1    885G  463G  423G  53% /data
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1    885G  406G  479G  46% /data
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1    885G  419G  466G  48% /data
Filesystem      Size  Used Avail Use% Mounted on

Prod /data size
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1    885G  352G  534G  40% /data
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1    885G  423G  462G  48% /data
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1    885G  431G  454G  49% /data
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1    885G  442G  443G  50% /data
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1    885G  454G  431G  52% /data

Cfhistograms: comparing prod to non-prod

08:21:38                Percentile  SSTables     Write Latency      Read 
Latency    Partition Size        Cell Count
08:21:38                                              (micros)          
(micros)           (bytes)                  
08:21:38                50%             1.00             24.60           
4055.27             11864                 4
08:21:38                75%             2.00             35.43          
14530.76             17084                 4
08:21:38                95%             4.00            126.93          
89970.66             35425                 4
08:21:38                98%             5.00            219.34         
155469.30             73457                 4
08:21:38                99%             5.00            219.34         
186563.16            105778                 4
08:21:38                Min             0.00              5.72             
17.09                87                 3
08:21:38                Max             7.00          20924.30        
1386179.89          14530764                 4

07:41:42                Percentile  SSTables     Write Latency      Read 
Latency    Partition Size        Cell Count
07:41:42                                              (micros)          
(micros)           (bytes)                  
07:41:42                50%             1.00             24.60           
2346.80             11864                 4
07:41:42                75%             2.00             29.52           
4866.32             17084                 4
07:41:42                95%             3.00             73.46          
14530.76             29521                 4
07:41:42                98%             4.00            182.79          
25109.16             61214                 4
07:41:42                99%             4.00            182.79          
36157.19             88148                 4
07:41:42                Min             0.00              9.89             
20.50                87                 0
07:41:42                Max             5.00            219.34         
155469.30          12108970                 4

Thank you

From: Fd Habash
Sent: Thursday, February 22, 2018 9:00 AM
Subject: RE: Cluster Repairs 'nodetool repair -pr' Cause Severe IncreaseinRead 
Latency After Shrinking Cluster

“ data was allowed to fully rebalance/repair/drain before the next node was 
taken off?”
Judging by the messages, the decomm was healthy. As an example - Announcing that I have left the ring for 30000ms   
INFO  [RMI TCP Connection(4)-] 2016-01-07 06:00:52,662 – DECOMMISSIONED

I do not believe repairs were run after each node removal. I’ll double-check. 

I’m not sure what you mean by ‘rebalance’? How do you check if a node is 
balanced? Load/size of data dir? 

As for the drain, there was no need to drain and I believe it is not something 
you do as part of decomm’ing a node. 

did you take 1 off per rack/AZ?
We removed 3 nodes, one from each AZ in sequence

These are some of the cfhistogram metrics. Read latencies are high after the 
removal of the nodes
You can see reads of 186ms are at the 99th% from 5 sstables. There are awfully 
high numbers given that these metrics measure C* storage layer read 

Does this mean removing the nodes undersized the cluster? 

key_space_01/cf_01 histograms
Percentile  SSTables     Write Latency      Read Latency    Partition Size      
  Cell Count
                              (micros)          (micros)           (bytes)      
50%             1.00             24.60           4055.27             11864      
75%             2.00             35.43          14530.76             17084      
95%             4.00            126.93          89970.66             35425      
98%             5.00            219.34         155469.30             73457      
99%             5.00            219.34         186563.16            105778      
Min             0.00              5.72             17.09                87      
Max             7.00          20924.30        1386179.89          14530764      

key_space_01/cf_01 histograms
Percentile  SSTables     Write Latency      Read Latency    Partition Size      
  Cell Count
                              (micros)          (micros)           (bytes)      
50%             1.00             29.52           4055.27             11864      
75%             2.00             42.51          10090.81             17084      
95%             4.00            152.32          52066.35             35425      
98%             4.00            219.34          89970.66             73457      
99%             5.00            219.34         155469.30             88148      
Min             0.00              9.89             24.60                87      
Max             6.00           1955.67         557074.61          14530764      

Thank you

From: Carl Mueller
Sent: Wednesday, February 21, 2018 4:33 PM
Subject: Re: Cluster Repairs 'nodetool repair -pr' Cause Severe Increase inRead 
Latency After Shrinking Cluster

Hm nodetool decommision performs the streamout of the replicated data, and you 
said that was apparently without error...

But if you dropped three nodes in one AZ/rack on a five node with RF3, then we 
have a missing RF factor unless NetworkTopologyStrategy fails over to another 
AZ. But that would also entail cross-az streaming and queries and repair.

On Wed, Feb 21, 2018 at 3:30 PM, Carl Mueller <> 
sorry for the idiot questions... 

data was allowed to fully rebalance/repair/drain before the next node was taken 

did you take 1 off per rack/AZ?

On Wed, Feb 21, 2018 at 12:29 PM, Fred Habash <> wrote:
One node at a time 

On Feb 21, 2018 10:23 AM, "Carl Mueller" <> wrote:
What is your replication factor? 
Single datacenter, three availability zones, is that right?
You removed one node at a time or three at once?

On Wed, Feb 21, 2018 at 10:20 AM, Fd Habash <> wrote:
We have had a 15 node cluster across three zones and cluster repairs using 
‘nodetool repair -pr’ took about 3 hours to finish. Lately, we shrunk the 
cluster to 12. Since then, same repair job has taken up to 12 hours to finish 
and most times, it never does. 
More importantly, at some point during the repair cycle, we see read latencies 
jumping to 1-2 seconds and applications immediately notice the impact.
stream_throughput_outbound_megabits_per_sec is set at 200 and 
compaction_throughput_mb_per_sec at 64. The /data dir on the nodes is around 
~500GB at 44% usage. 
When shrinking the cluster, the ‘nodetool decommision’ was eventless. It 
completed successfully with no issues.
What could possibly cause repairs to cause this impact following cluster 
downsizing? Taking three nodes out does not seem compatible with such a drastic 
effect on repair and read latency. 
Any expert insights will be appreciated. 
Thank you

Reply via email to