follow up question on performance issue with 'counter writes'- is there a 
parameter or condition that limits the allocation rate for 
'CounterMutationStage'? I see 13-18mb/s for 4.1.4 Vs 20-25mb/s for 4.0.5.

The back-end infra is same for both the clusters and same test cases/data model.
    On Saturday, March 30, 2024 at 08:40:28 AM PDT, Jon Haddad 
<j...@jonhaddad.com> wrote:  
 
 Hi,

Unfortunately, the numbers you're posting have no meaning without context.  The 
speculative retries could be the cause of a problem, or you could simply be 
executing enough queries and you have a fairly high variance in latency which 
triggers them often.  It's unclear how many queries / second you're executing 
and there's no historical information to suggest if what you're seeing now is 
an anomaly or business as usual.
If you want to determine if your theory that speculative retries are causing 
your performance issue, then you could try changing speculative retry to a 
fixed value instead of a percentile, such as 50MS.  It's easy enough to try and 
you can get an answer to your question almost immediately.
The problem with this is that you're essentially guessing based on very limited 
information - the output of a nodetool command you've run "every few secs".  I 
prefer to use a more data driven approach.  Get a CPU flame graph and figure 
out where your time is spent: 
https://rustyrazorblade.com/post/2023/2023-11-07-async-profiler/
The flame graph will reveal where your time is spent, and you can focus on 
improving that, rather than looking at a random statistic that you've picked.
I just gave a talk at SCALE on distributed systems performance troubleshooting. 
 You'll be better off following a methodical process than guessing at potential 
root causes, because the odds of you correctly guessing the root cause in a 
system this complex is close to zero.  My talk is here: 
https://www.youtube.com/watch?v=VX9tHk3VTLE
I'm guessing you don't have dashboards in place if you're relying on nodetool 
output with grep.  If your cluster is under 6 nodes, you can take advantage of 
AxonOps's free tier: https://axonops.com/
Good dashboards are essential for these types of problems.    
Jon


On Sat, Mar 30, 2024 at 2:33 AM ranju goel <goel.ra...@gmail.com> wrote:

Hi All,
On debugging the cluster for performance dip seen while using 4.1.4,  i found 
high speculation retries Value in nodetool tablestats during read operation.
I ran the below tablestats command and checked its output after every few secs 
and noticed that retries are on rising side. Also there is one open ticket 
(https://issues.apache.org/jira/browse/CASSANDRA-18766) similar to 
this./usr/share/cassandra/bin/nodetool -u <username> -pw <pwd> -p <port> 
tablestats <keyspace> | grep -i 'Speculative retries' 

                    

                Speculative retries: 11633

                ..

                ..

                Speculative retries: 13727

     

                Speculative retries: 14256

                Speculative retries: 14855

                Speculative retries: 14858

                Speculative retries: 14859

                Speculative retries: 14873

                Speculative retries: 14875

                Speculative retries: 14890

                Speculative retries: 14893

                Speculative retries: 14896

                Speculative retries: 14901

                Speculative retries: 14905

                Speculative retries: 14946

                Speculative retries: 14948

                Speculative retries: 14957




Suspecting this could be performance dip cause.  Please add in case anyone 
knows more about it.




Regards













On Wed, Mar 27, 2024 at 10:43 PM Subroto Barua via user 
<user@cassandra.apache.org> wrote:

 we are seeing similar perf issues with counter writes - to reproduce:

cassandra-stress counter_write n=100000 no-warmup cl=LOCAL_QUORUM -rate 
threads=50 -mode native cql3 user=<user> password=<pw> -name <cluster_name> 


op rate: 39,260 ops (4.1) and 63,689 ops (4.0)
latency 99th percentile: 7.7ms (4.1) and 1.8ms (4.0)
Total GC count: 750 (4.1) and 744 (4.0)
Avg GC time: 106 ms (4.1) and 110.1 ms (4.0)

    On Wednesday, March 27, 2024 at 12:18:50 AM PDT, ranju goel 
<goel.ra...@gmail.com> wrote:  
 
 Hi All,

Was going through this mail chain 
(https://www.mail-archive.com/user@cassandra.apache.org/msg63564.html) and was 
wondering that if this could cause a performance degradation in 4.1 without 
changing compactionThroughput. 

As seeing performance dip in Read/Write after upgrading from 4.0 to 4.1.

RegardsRanju  

  

Reply via email to