Re: Troubleshooting random node latency spikes
Hi Ted. How long are the latency spikes when they occur? Have you investigated compactions (nodetool compactionstats) during the spike? Are you also seeing large latency spikes in the p95 (95th percentile) metrics? p99 catches outliers, which aren't necessarily always cause for alarm. Are the nodes showing any other signs of stress? CPU, GC, etc? Is there anything pending in nodetool tpstats? Regarding the read repairs, have you tested writing at a higher consistency to see if that changes the number of RR occurring? *Brooke Jensen* VP Technical Operations & Customer Services www.instaclustr.com | support.instaclustr.com <https://support.instaclustr.com/hc/en-us> This email has been sent on behalf of Instaclustr Limited (Australia) and Instaclustr Inc (USA). This email and any attachments may contain confidential and legally privileged information. If you are not the intended recipient, do not copy or disclose its content, but please reply to this email immediately and highlight the error to the sender and then immediately delete the message. On 18 January 2017 at 02:11, wrote: > Is this Java 8 with the G1 garbage collector or CMS? With Java 7 and CMS, > garbage collection can cause delays like you are seeing. I haven’t seen > that problem with G1, but garbage collection is where I would start looking. > > > > > > Sean Durity > > *From:* Ted Pearson [mailto:t...@tedpearson.com] > *Sent:* Thursday, January 05, 2017 2:34 PM > *To:* user@cassandra.apache.org > *Subject:* Troubleshooting random node latency spikes > > > > Greetings! > > I'm working on setting up a new cassandra cluster with a write-heavy > workload (50% writes), and I've run into a strange spiky latency problem. > My application metrics showed random latency spikes. I tracked the latency > back to spikes on individual cassandra nodes. > ClientRequest.Latency.Read/Write.p99 > is occasionally jumping on one node at a time to several seconds, instead > of its normal value of around 1000 microseconds. I also noticed > that ReadRepair.RepairedBackground.m1_rate goes from zero to a non-zero > (around 1-2/sec) during the spike on that node. I'm lost as to why these > spikes are happening, hope someone can give me ideas. > > I attempted to test if the ReadRepair metric is causally linked to the > latency spikes, but even when I changed dclocal_read_repair_chance to 0 on > my tables, even though the metrics showed no ReadRepair.Attempted, the > ReadRepair.RepairedBackground metric still went up during latency spikes. > Am I misunderstanding what this metric tracks? I don't understand why it > went up if I turned off read repair. > > I'm currently running 2.2.6 in a dual-datacenter setup. It's patched to > allow metrics to be recency-biased instead of tracking latency over the > entire running of the java process. I'm using STCS. There is a large amount > of data per node, about 500GB currently. I expect each row to be less than > 10KB. It's currently running on way overpowered hardware - 512GB/raid 0 on > nvme/44 cores on 2 sockets. All of my queries (reads and writes) are > LOCAL_ONE and I'm using r=3. > > > > Thanks, > > Ted > > -- > > The information in this Internet Email is confidential and may be legally > privileged. It is intended solely for the addressee. Access to this Email > by anyone else is unauthorized. If you are not the intended recipient, any > disclosure, copying, distribution or any action taken or omitted to be > taken in reliance on it, is prohibited and may be unlawful. When addressed > to our clients any opinions or advice contained in this Email are subject > to the terms and conditions expressed in any applicable governing The Home > Depot terms of business or client engagement letter. The Home Depot > disclaims all responsibility and liability for the accuracy and content of > this attachment and for any damages or losses arising from any > inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other > items of a destructive nature, which may be contained in this attachment > and shall not be liable for direct, indirect, consequential or special > damages in connection with this e-mail message or its attachment. > > -- > > The information in this Internet Email is confidential and may be legally > privileged. It is intended solely for the addressee. Access to this Email > by anyone else is unauthorized. If you are not the intended recipient, any > disclosure, copying, distribution or any action taken or omitted to be > taken in reliance on it, is prohibited and may be unlawful. When addressed > to our clients any opinions or advice contained in this Email are
RE: Troubleshooting random node latency spikes
Is this Java 8 with the G1 garbage collector or CMS? With Java 7 and CMS, garbage collection can cause delays like you are seeing. I haven’t seen that problem with G1, but garbage collection is where I would start looking. Sean Durity From: Ted Pearson [mailto:t...@tedpearson.com] Sent: Thursday, January 05, 2017 2:34 PM To: user@cassandra.apache.org Subject: Troubleshooting random node latency spikes Greetings! I'm working on setting up a new cassandra cluster with a write-heavy workload (50% writes), and I've run into a strange spiky latency problem. My application metrics showed random latency spikes. I tracked the latency back to spikes on individual cassandra nodes. ClientRequest.Latency.Read/Write.p99 is occasionally jumping on one node at a time to several seconds, instead of its normal value of around 1000 microseconds. I also noticed that ReadRepair.RepairedBackground.m1_rate goes from zero to a non-zero (around 1-2/sec) during the spike on that node. I'm lost as to why these spikes are happening, hope someone can give me ideas. I attempted to test if the ReadRepair metric is causally linked to the latency spikes, but even when I changed dclocal_read_repair_chance to 0 on my tables, even though the metrics showed no ReadRepair.Attempted, the ReadRepair.RepairedBackground metric still went up during latency spikes. Am I misunderstanding what this metric tracks? I don't understand why it went up if I turned off read repair. I'm currently running 2.2.6 in a dual-datacenter setup. It's patched to allow metrics to be recency-biased instead of tracking latency over the entire running of the java process. I'm using STCS. There is a large amount of data per node, about 500GB currently. I expect each row to be less than 10KB. It's currently running on way overpowered hardware - 512GB/raid 0 on nvme/44 cores on 2 sockets. All of my queries (reads and writes) are LOCAL_ONE and I'm using r=3. Thanks, Ted The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment. The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.
Troubleshooting random node latency spikes
Greetings! I'm working on setting up a new cassandra cluster with a write-heavy workload (50% writes), and I've run into a strange spiky latency problem. My application metrics showed random latency spikes. I tracked the latency back to spikes on individual cassandra nodes. ClientRequest.Latency.Read/Write.p99 is occasionally jumping on one node at a time to several seconds, instead of its normal value of around 1000 microseconds. I also noticed that ReadRepair.RepairedBackground.m1_rate goes from zero to a non-zero (around 1-2/sec) during the spike on that node. I'm lost as to why these spikes are happening, hope someone can give me ideas. I attempted to test if the ReadRepair metric is causally linked to the latency spikes, but even when I changed dclocal_read_repair_chance to 0 on my tables, even though the metrics showed no ReadRepair.Attempted, the ReadRepair.RepairedBackground metric still went up during latency spikes. Am I misunderstanding what this metric tracks? I don't understand why it went up if I turned off read repair. I'm currently running 2.2.6 in a dual-datacenter setup. It's patched to allow metrics to be recency-biased instead of tracking latency over the entire running of the java process. I'm using STCS. There is a large amount of data per node, about 500GB currently. I expect each row to be less than 10KB. It's currently running on way overpowered hardware - 512GB/raid 0 on nvme/44 cores on 2 sockets. All of my queries (reads and writes) are LOCAL_ONE and I'm using r=3. Thanks, Ted