Re: Troubleshooting random node latency spikes

2017-01-23 Thread Brooke Jensen
Hi Ted.

How long are the latency spikes when they occur?  Have you investigated
compactions (nodetool compactionstats) during the spike?

Are you also seeing large latency spikes in the p95 (95th percentile)
metrics? p99 catches outliers, which aren't necessarily always cause for
alarm.

Are the nodes showing any other signs of stress? CPU, GC, etc? Is there
anything pending in nodetool tpstats?

Regarding the read repairs, have you tested writing at a higher consistency
to see if that changes the number of RR occurring?


*Brooke Jensen*
VP Technical Operations & Customer Services
www.instaclustr.com | support.instaclustr.com
<https://support.instaclustr.com/hc/en-us>

This email has been sent on behalf of Instaclustr Limited (Australia) and
Instaclustr Inc (USA). This email and any attachments may contain
confidential and legally privileged information.  If you are not the
intended recipient, do not copy or disclose its content, but please reply
to this email immediately and highlight the error to the sender and then
immediately delete the message.

On 18 January 2017 at 02:11, <sean_r_dur...@homedepot.com> wrote:

> Is this Java 8 with the G1 garbage collector or CMS? With Java 7 and CMS,
> garbage collection can cause delays like you are seeing. I haven’t seen
> that problem with G1, but garbage collection is where I would start looking.
>
>
>
>
>
> Sean Durity
>
> *From:* Ted Pearson [mailto:t...@tedpearson.com]
> *Sent:* Thursday, January 05, 2017 2:34 PM
> *To:* user@cassandra.apache.org
> *Subject:* Troubleshooting random node latency spikes
>
>
>
> Greetings!
>
> I'm working on setting up a new cassandra cluster with a write-heavy
> workload (50% writes), and I've run into a strange spiky latency problem.
> My application metrics showed random latency spikes. I tracked the latency
> back to spikes on individual cassandra nodes. 
> ClientRequest.Latency.Read/Write.p99
> is occasionally jumping on one node at a time to several seconds, instead
> of its normal value of around 1000 microseconds. I also noticed
> that ReadRepair.RepairedBackground.m1_rate goes from zero to a non-zero
> (around 1-2/sec) during the spike on that node. I'm lost as to why these
> spikes are happening, hope someone can give me ideas.
>
> I attempted to test if the ReadRepair metric is causally linked to the
> latency spikes, but even when I changed dclocal_read_repair_chance to 0 on
> my tables, even though the metrics showed no ReadRepair.Attempted, the
> ReadRepair.RepairedBackground metric still went up during latency spikes.
> Am I misunderstanding what this metric tracks? I don't understand why it
> went up if I turned off read repair.
>
> I'm currently running 2.2.6 in a dual-datacenter setup. It's patched to
> allow metrics to be recency-biased instead of tracking latency over the
> entire running of the java process. I'm using STCS. There is a large amount
> of data per node, about 500GB currently. I expect each row to be less than
> 10KB. It's currently running on way overpowered hardware - 512GB/raid 0 on
> nvme/44 cores on 2 sockets. All of my queries (reads and writes) are
> LOCAL_ONE and I'm using r=3.
>
>
>
> Thanks,
>
> Ted
>
> --
>
> The information in this Internet Email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this Email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful. When addressed
> to our clients any opinions or advice contained in this Email are subject
> to the terms and conditions expressed in any applicable governing The Home
> Depot terms of business or client engagement letter. The Home Depot
> disclaims all responsibility and liability for the accuracy and content of
> this attachment and for any damages or losses arising from any
> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
> items of a destructive nature, which may be contained in this attachment
> and shall not be liable for direct, indirect, consequential or special
> damages in connection with this e-mail message or its attachment.
>
> --
>
> The information in this Internet Email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this Email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful. When addressed
> to our clients any opinions or advice contained in this Email are subject
> to t

RE: Troubleshooting random node latency spikes

2017-01-17 Thread SEAN_R_DURITY
Is this Java 8 with the G1 garbage collector or CMS? With Java 7 and CMS, 
garbage collection can cause delays like you are seeing. I haven’t seen that 
problem with G1, but garbage collection is where I would start looking.


Sean Durity
From: Ted Pearson [mailto:t...@tedpearson.com]
Sent: Thursday, January 05, 2017 2:34 PM
To: user@cassandra.apache.org
Subject: Troubleshooting random node latency spikes


Greetings!

I'm working on setting up a new cassandra cluster with a write-heavy workload 
(50% writes), and I've run into a strange spiky latency problem. My application 
metrics showed random latency spikes. I tracked the latency back to spikes on 
individual cassandra nodes. ClientRequest.Latency.Read/Write.p99 is 
occasionally jumping on one node at a time to several seconds, instead of its 
normal value of around 1000 microseconds. I also noticed that 
ReadRepair.RepairedBackground.m1_rate goes from zero to a non-zero (around 
1-2/sec) during the spike on that node. I'm lost as to why these spikes are 
happening, hope someone can give me ideas.

I attempted to test if the ReadRepair metric is causally linked to the latency 
spikes, but even when I changed dclocal_read_repair_chance to 0 on my tables, 
even though the metrics showed no ReadRepair.Attempted, the 
ReadRepair.RepairedBackground metric still went up during latency spikes. Am I 
misunderstanding what this metric tracks? I don't understand why it went up if 
I turned off read repair.

I'm currently running 2.2.6 in a dual-datacenter setup. It's patched to allow 
metrics to be recency-biased instead of tracking latency over the entire 
running of the java process. I'm using STCS. There is a large amount of data 
per node, about 500GB currently. I expect each row to be less than 10KB. It's 
currently running on way overpowered hardware - 512GB/raid 0 on nvme/44 cores 
on 2 sockets. All of my queries (reads and writes) are LOCAL_ONE and I'm using 
r=3.



Thanks,

Ted



The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages in connection with this e-mail 
message or its attachment.



The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages in connection with this e-mail 
message or its attachment.


Troubleshooting random node latency spikes

2017-01-05 Thread Ted Pearson
Greetings!
I'm working on setting up a new cassandra cluster with a write-heavy workload 
(50% writes), and I've run into a strange spiky latency problem. My application 
metrics showed random latency spikes. I tracked the latency back to spikes on 
individual cassandra nodes. ClientRequest.Latency.Read/Write.p99 is 
occasionally jumping on one node at a time to several seconds, instead of its 
normal value of around 1000 microseconds. I also noticed that 
ReadRepair.RepairedBackground.m1_rate goes from zero to a non-zero (around 
1-2/sec) during the spike on that node. I'm lost as to why these spikes are 
happening, hope someone can give me ideas.
I attempted to test if the ReadRepair metric is causally linked to the latency 
spikes, but even when I changed dclocal_read_repair_chance to 0 on my tables, 
even though the metrics showed no ReadRepair.Attempted, the 
ReadRepair.RepairedBackground metric still went up during latency spikes. Am I 
misunderstanding what this metric tracks? I don't understand why it went up if 
I turned off read repair.
I'm currently running 2.2.6 in a dual-datacenter setup. It's patched to allow 
metrics to be recency-biased instead of tracking latency over the entire 
running of the java process. I'm using STCS. There is a large amount of data 
per node, about 500GB currently. I expect each row to be less than 10KB. It's 
currently running on way overpowered hardware - 512GB/raid 0 on nvme/44 cores 
on 2 sockets. All of my queries (reads and writes) are LOCAL_ONE and I'm using 
r=3.

Thanks,
Ted