[jira] [Commented] (CASSANDRA-6127) vnodes don't scale to hundreds of nodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13827750#comment-13827750 ] Quentin Conner commented on CASSANDRA-6127: --- Yes, both use case 1 and use case 2 (detailed in early comment above) were cured by patch #3. Zero flaps were recorded in multiple trials in both use cases. Patch #3 cures the flaps, but does not address the cpu usage symptom. This was tested against the cassandra-1.2 branch. I am conducting the same test today against use case 2 today, but using the current cassandra-2.0 branch of source. vnodes don't scale to hundreds of nodes --- Key: CASSANDRA-6127 URL: https://issues.apache.org/jira/browse/CASSANDRA-6127 Project: Cassandra Issue Type: Bug Components: Core Environment: Any cluster that has vnodes and consists of hundreds of physical nodes. Reporter: Tupshin Harper Assignee: Jonathan Ellis Attachments: 2013-11-05_18-04-03_no_compression_cpu_time.png, 2013-11-05_18-09-38_compression_on_cpu_time.png, 6000vnodes.patch, AdjustableGossipPeriod.patch, cpu-vs-token-graph.png, delayEstimatorUntilStatisticallyValid.patch, flaps-vs-tokens.png There are a lot of gossip-related issues related to very wide clusters that also have vnodes enabled. Let's use this ticket as a master in case there are sub-tickets. The most obvious symptom I've seen is with 1000 nodes in EC2 with m1.xlarge instances. Each node configured with 32 vnodes. Without vnodes, cluster spins up fine and is ready to handle requests within 30 minutes or less. With vnodes, nodes are reporting constant up/down flapping messages with no external load on the cluster. After a couple of hours, they were still flapping, had very high cpu load, and the cluster never looked like it was going to stabilize or be useful for traffic. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6127) vnodes don't scale to hundreds of nodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13826949#comment-13826949 ] Jonathan Ellis commented on CASSANDRA-6127: --- bq. Untested patch #3. Delays output from FailureDetector until statistically valid number of samples have been obtained. Did we ever find a scenario where we can demonstrate this patch making a difference? Because I think it's a good idea in theory. vnodes don't scale to hundreds of nodes --- Key: CASSANDRA-6127 URL: https://issues.apache.org/jira/browse/CASSANDRA-6127 Project: Cassandra Issue Type: Bug Components: Core Environment: Any cluster that has vnodes and consists of hundreds of physical nodes. Reporter: Tupshin Harper Assignee: Jonathan Ellis Attachments: 2013-11-05_18-04-03_no_compression_cpu_time.png, 2013-11-05_18-09-38_compression_on_cpu_time.png, 6000vnodes.patch, AdjustableGossipPeriod.patch, cpu-vs-token-graph.png, delayEstimatorUntilStatisticallyValid.patch, flaps-vs-tokens.png There are a lot of gossip-related issues related to very wide clusters that also have vnodes enabled. Let's use this ticket as a master in case there are sub-tickets. The most obvious symptom I've seen is with 1000 nodes in EC2 with m1.xlarge instances. Each node configured with 32 vnodes. Without vnodes, cluster spins up fine and is ready to handle requests within 30 minutes or less. With vnodes, nodes are reporting constant up/down flapping messages with no external load on the cluster. After a couple of hours, they were still flapping, had very high cpu load, and the cluster never looked like it was going to stabilize or be useful for traffic. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6127) vnodes don't scale to hundreds of nodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13820669#comment-13820669 ] Jonathan Ellis commented on CASSANDRA-6127: --- bq. ISTM that FD processing Gossip updates synchronously is a fundamental problem. Any hiccup in processing will cause FD false positives. I've pulled a fix for this out to CASSANDRA-6338. vnodes don't scale to hundreds of nodes --- Key: CASSANDRA-6127 URL: https://issues.apache.org/jira/browse/CASSANDRA-6127 Project: Cassandra Issue Type: Bug Components: Core Environment: Any cluster that has vnodes and consists of hundreds of physical nodes. Reporter: Tupshin Harper Assignee: Jonathan Ellis Attachments: 2013-11-05_18-04-03_no_compression_cpu_time.png, 2013-11-05_18-09-38_compression_on_cpu_time.png, 6000vnodes.patch, AdjustableGossipPeriod.patch, cpu-vs-token-graph.png, delayEstimatorUntilStatisticallyValid.patch, flaps-vs-tokens.png There are a lot of gossip-related issues related to very wide clusters that also have vnodes enabled. Let's use this ticket as a master in case there are sub-tickets. The most obvious symptom I've seen is with 1000 nodes in EC2 with m1.xlarge instances. Each node configured with 32 vnodes. Without vnodes, cluster spins up fine and is ready to handle requests within 30 minutes or less. With vnodes, nodes are reporting constant up/down flapping messages with no external load on the cluster. After a couple of hours, they were still flapping, had very high cpu load, and the cluster never looked like it was going to stabilize or be useful for traffic. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6127) vnodes don't scale to hundreds of nodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13816037#comment-13816037 ] Quentin Conner commented on CASSANDRA-6127: --- Good morning. We saw the same CPU usage profile with cassandra-1.2 8e7d7285cdeac4f2527c933280d595bbddd26935 (which included the patch to not flush peers CF). CPU time was spent in looking up EndpointState or spent in PHI calculation. No surprises were found. No race conditions, no deadlocks or mutex/monitor contention. I do not know if flapping happens in 1.2 head without vnodes. I will find out today, if I can get the nodes (having trouble this morning allocating from EC2). Will keep trying (Fridays seem better) but could slip into the weekend... vnodes don't scale to hundreds of nodes --- Key: CASSANDRA-6127 URL: https://issues.apache.org/jira/browse/CASSANDRA-6127 Project: Cassandra Issue Type: Bug Components: Core Environment: Any cluster that has vnodes and consists of hundreds of physical nodes. Reporter: Tupshin Harper Assignee: Jonathan Ellis Attachments: 2013-11-05_18-04-03_no_compression_cpu_time.png, 2013-11-05_18-09-38_compression_on_cpu_time.png, 6000vnodes.patch, AdjustableGossipPeriod.patch, delayEstimatorUntilStatisticallyValid.patch There are a lot of gossip-related issues related to very wide clusters that also have vnodes enabled. Let's use this ticket as a master in case there are sub-tickets. The most obvious symptom I've seen is with 1000 nodes in EC2 with m1.xlarge instances. Each node configured with 32 vnodes. Without vnodes, cluster spins up fine and is ready to handle requests within 30 minutes or less. With vnodes, nodes are reporting constant up/down flapping messages with no external load on the cluster. After a couple of hours, they were still flapping, had very high cpu load, and the cluster never looked like it was going to stabilize or be useful for traffic. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6127) vnodes don't scale to hundreds of nodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13816044#comment-13816044 ] Quentin Conner commented on CASSANDRA-6127: --- Tupshin, can you further quantify the CPU usage you observed, in terms of USER CPU and KERNEL CPU? Also, can you confirm the number of nodes and vnodes for those observations. I've seen about 25% user cpu @ 256 nodes and 60% @ 512 nodes. Kernel cpu was under 5% for both in my trials. vnodes don't scale to hundreds of nodes --- Key: CASSANDRA-6127 URL: https://issues.apache.org/jira/browse/CASSANDRA-6127 Project: Cassandra Issue Type: Bug Components: Core Environment: Any cluster that has vnodes and consists of hundreds of physical nodes. Reporter: Tupshin Harper Assignee: Jonathan Ellis Attachments: 2013-11-05_18-04-03_no_compression_cpu_time.png, 2013-11-05_18-09-38_compression_on_cpu_time.png, 6000vnodes.patch, AdjustableGossipPeriod.patch, delayEstimatorUntilStatisticallyValid.patch There are a lot of gossip-related issues related to very wide clusters that also have vnodes enabled. Let's use this ticket as a master in case there are sub-tickets. The most obvious symptom I've seen is with 1000 nodes in EC2 with m1.xlarge instances. Each node configured with 32 vnodes. Without vnodes, cluster spins up fine and is ready to handle requests within 30 minutes or less. With vnodes, nodes are reporting constant up/down flapping messages with no external load on the cluster. After a couple of hours, they were still flapping, had very high cpu load, and the cluster never looked like it was going to stabilize or be useful for traffic. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6127) vnodes don't scale to hundreds of nodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13815286#comment-13815286 ] Jonathan Ellis commented on CASSANDRA-6127: --- ISTM that FD processing Gossip updates synchronously is a fundamental problem. Any hiccup in processing will cause FD false positives. (And even if our own code is perfect, GC pauses can still do this to us.) Wouldn't it be better if we: - time heartbeats based on when they arrive instead of when Gossip processes them - teach FD to recognize that its information is only good up to the most recently processed message -- the absence of messages after that doesn't mean everyone is down unless the Gossip stage is empty vnodes don't scale to hundreds of nodes --- Key: CASSANDRA-6127 URL: https://issues.apache.org/jira/browse/CASSANDRA-6127 Project: Cassandra Issue Type: Bug Components: Core Environment: Any cluster that has vnodes and consists of hundreds of physical nodes. Reporter: Tupshin Harper Assignee: Jonathan Ellis Attachments: 6000vnodes.patch, AdjustableGossipPeriod.patch, delayEstimatorUntilStatisticallyValid.patch There are a lot of gossip-related issues related to very wide clusters that also have vnodes enabled. Let's use this ticket as a master in case there are sub-tickets. The most obvious symptom I've seen is with 1000 nodes in EC2 with m1.xlarge instances. Each node configured with 32 vnodes. Without vnodes, cluster spins up fine and is ready to handle requests within 30 minutes or less. With vnodes, nodes are reporting constant up/down flapping messages with no external load on the cluster. After a couple of hours, they were still flapping, had very high cpu load, and the cluster never looked like it was going to stabilize or be useful for traffic. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6127) vnodes don't scale to hundreds of nodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13815297#comment-13815297 ] Tupshin Harper commented on CASSANDRA-6127: --- +1. Strongly agree with Jonathan's analysis and proposal. vnodes don't scale to hundreds of nodes --- Key: CASSANDRA-6127 URL: https://issues.apache.org/jira/browse/CASSANDRA-6127 Project: Cassandra Issue Type: Bug Components: Core Environment: Any cluster that has vnodes and consists of hundreds of physical nodes. Reporter: Tupshin Harper Assignee: Jonathan Ellis Attachments: 6000vnodes.patch, AdjustableGossipPeriod.patch, delayEstimatorUntilStatisticallyValid.patch There are a lot of gossip-related issues related to very wide clusters that also have vnodes enabled. Let's use this ticket as a master in case there are sub-tickets. The most obvious symptom I've seen is with 1000 nodes in EC2 with m1.xlarge instances. Each node configured with 32 vnodes. Without vnodes, cluster spins up fine and is ready to handle requests within 30 minutes or less. With vnodes, nodes are reporting constant up/down flapping messages with no external load on the cluster. After a couple of hours, they were still flapping, had very high cpu load, and the cluster never looked like it was going to stabilize or be useful for traffic. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6127) vnodes don't scale to hundreds of nodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13815379#comment-13815379 ] Brandon Williams commented on CASSANDRA-6127: - At this point, I think we should: * see if the flapping happens with vnodes (maybe Quentin already knows from his last test) * see if the flapping happens without vnodes but the same number of nodes Because if sum() in ArrivalWindow is burning the most CPU in the Gossiper task (note: not bottlenecking, each call was at most ~3ms, there were just lots of them) then that means that the problem is no longer tied to vnodes (if it ever was, since sum is per-node, not per-token) and we should probably open a new ticket (can't start a cluster of size =X all at once, or similar) and discuss there. We know that clusters much larger than any discussed on this ticket exist, but I don't think any of them have all rebooted at once. vnodes don't scale to hundreds of nodes --- Key: CASSANDRA-6127 URL: https://issues.apache.org/jira/browse/CASSANDRA-6127 Project: Cassandra Issue Type: Bug Components: Core Environment: Any cluster that has vnodes and consists of hundreds of physical nodes. Reporter: Tupshin Harper Assignee: Jonathan Ellis Attachments: 6000vnodes.patch, AdjustableGossipPeriod.patch, delayEstimatorUntilStatisticallyValid.patch There are a lot of gossip-related issues related to very wide clusters that also have vnodes enabled. Let's use this ticket as a master in case there are sub-tickets. The most obvious symptom I've seen is with 1000 nodes in EC2 with m1.xlarge instances. Each node configured with 32 vnodes. Without vnodes, cluster spins up fine and is ready to handle requests within 30 minutes or less. With vnodes, nodes are reporting constant up/down flapping messages with no external load on the cluster. After a couple of hours, they were still flapping, had very high cpu load, and the cluster never looked like it was going to stabilize or be useful for traffic. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6127) vnodes don't scale to hundreds of nodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813928#comment-13813928 ] Quentin Conner commented on CASSANDRA-6127: --- Good cpu profile results were obtained last night with the 1.2.9 code line. Switching over to the cassandra-1.2 HEAD this morning for up-to-date analysis. CPU profile of 1.2.9 showed bottleneck was computation of sum for the ArrivalWindow deque members (inter-arrival times of gossip messages). vnodes don't scale to hundreds of nodes --- Key: CASSANDRA-6127 URL: https://issues.apache.org/jira/browse/CASSANDRA-6127 Project: Cassandra Issue Type: Bug Components: Core Environment: Any cluster that has vnodes and consists of hundreds of physical nodes. Reporter: Tupshin Harper Assignee: Jonathan Ellis Attachments: 6000vnodes.patch, AdjustableGossipPeriod.patch, delayEstimatorUntilStatisticallyValid.patch There are a lot of gossip-related issues related to very wide clusters that also have vnodes enabled. Let's use this ticket as a master in case there are sub-tickets. The most obvious symptom I've seen is with 1000 nodes in EC2 with m1.xlarge instances. Each node configured with 32 vnodes. Without vnodes, cluster spins up fine and is ready to handle requests within 30 minutes or less. With vnodes, nodes are reporting constant up/down flapping messages with no external load on the cluster. After a couple of hours, they were still flapping, had very high cpu load, and the cluster never looked like it was going to stabilize or be useful for traffic. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6127) vnodes don't scale to hundreds of nodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813134#comment-13813134 ] Matt Stump commented on CASSANDRA-6127: --- As another datapoint/use case create a 32 node ring with vnodes, decommission one of the nodes and observe the logs. Every node in the ring will be marked as down by the gossiper, then immediately be re-added again as up/available. vnodes don't scale to hundreds of nodes --- Key: CASSANDRA-6127 URL: https://issues.apache.org/jira/browse/CASSANDRA-6127 Project: Cassandra Issue Type: Bug Components: Core Environment: Any cluster that has vnodes and consists of hundreds of physical nodes. Reporter: Tupshin Harper Assignee: Jonathan Ellis Attachments: 6000vnodes.patch, AdjustableGossipPeriod.patch, delayEstimatorUntilStatisticallyValid.patch There are a lot of gossip-related issues related to very wide clusters that also have vnodes enabled. Let's use this ticket as a master in case there are sub-tickets. The most obvious symptom I've seen is with 1000 nodes in EC2 with m1.xlarge instances. Each node configured with 32 vnodes. Without vnodes, cluster spins up fine and is ready to handle requests within 30 minutes or less. With vnodes, nodes are reporting constant up/down flapping messages with no external load on the cluster. After a couple of hours, they were still flapping, had very high cpu load, and the cluster never looked like it was going to stabilize or be useful for traffic. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6127) vnodes don't scale to hundreds of nodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813147#comment-13813147 ] Brandon Williams commented on CASSANDRA-6127: - bq. Every node in the ring will be marked as down by the gossiper In which node's view? (or all of them?) vnodes don't scale to hundreds of nodes --- Key: CASSANDRA-6127 URL: https://issues.apache.org/jira/browse/CASSANDRA-6127 Project: Cassandra Issue Type: Bug Components: Core Environment: Any cluster that has vnodes and consists of hundreds of physical nodes. Reporter: Tupshin Harper Assignee: Jonathan Ellis Attachments: 6000vnodes.patch, AdjustableGossipPeriod.patch, delayEstimatorUntilStatisticallyValid.patch There are a lot of gossip-related issues related to very wide clusters that also have vnodes enabled. Let's use this ticket as a master in case there are sub-tickets. The most obvious symptom I've seen is with 1000 nodes in EC2 with m1.xlarge instances. Each node configured with 32 vnodes. Without vnodes, cluster spins up fine and is ready to handle requests within 30 minutes or less. With vnodes, nodes are reporting constant up/down flapping messages with no external load on the cluster. After a couple of hours, they were still flapping, had very high cpu load, and the cluster never looked like it was going to stabilize or be useful for traffic. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6127) vnodes don't scale to hundreds of nodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813160#comment-13813160 ] Matt Stump commented on CASSANDRA-6127: --- We're observing the logs of a random sample of nodes and on all nodes observed the entire ring is marked as down, so I assume it's for all nodes. vnodes don't scale to hundreds of nodes --- Key: CASSANDRA-6127 URL: https://issues.apache.org/jira/browse/CASSANDRA-6127 Project: Cassandra Issue Type: Bug Components: Core Environment: Any cluster that has vnodes and consists of hundreds of physical nodes. Reporter: Tupshin Harper Assignee: Jonathan Ellis Attachments: 6000vnodes.patch, AdjustableGossipPeriod.patch, delayEstimatorUntilStatisticallyValid.patch There are a lot of gossip-related issues related to very wide clusters that also have vnodes enabled. Let's use this ticket as a master in case there are sub-tickets. The most obvious symptom I've seen is with 1000 nodes in EC2 with m1.xlarge instances. Each node configured with 32 vnodes. Without vnodes, cluster spins up fine and is ready to handle requests within 30 minutes or less. With vnodes, nodes are reporting constant up/down flapping messages with no external load on the cluster. After a couple of hours, they were still flapping, had very high cpu load, and the cluster never looked like it was going to stabilize or be useful for traffic. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6127) vnodes don't scale to hundreds of nodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813184#comment-13813184 ] Jonathan Ellis commented on CASSANDRA-6127: --- How heavy is read/write load? vnodes don't scale to hundreds of nodes --- Key: CASSANDRA-6127 URL: https://issues.apache.org/jira/browse/CASSANDRA-6127 Project: Cassandra Issue Type: Bug Components: Core Environment: Any cluster that has vnodes and consists of hundreds of physical nodes. Reporter: Tupshin Harper Assignee: Jonathan Ellis Attachments: 6000vnodes.patch, AdjustableGossipPeriod.patch, delayEstimatorUntilStatisticallyValid.patch There are a lot of gossip-related issues related to very wide clusters that also have vnodes enabled. Let's use this ticket as a master in case there are sub-tickets. The most obvious symptom I've seen is with 1000 nodes in EC2 with m1.xlarge instances. Each node configured with 32 vnodes. Without vnodes, cluster spins up fine and is ready to handle requests within 30 minutes or less. With vnodes, nodes are reporting constant up/down flapping messages with no external load on the cluster. After a couple of hours, they were still flapping, had very high cpu load, and the cluster never looked like it was going to stabilize or be useful for traffic. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6127) vnodes don't scale to hundreds of nodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813198#comment-13813198 ] Matt Stump commented on CASSANDRA-6127: --- Zero to minimal load. 177 writes/second, 0 reads against the entire ring. m2.4xlarge instances. vnodes don't scale to hundreds of nodes --- Key: CASSANDRA-6127 URL: https://issues.apache.org/jira/browse/CASSANDRA-6127 Project: Cassandra Issue Type: Bug Components: Core Environment: Any cluster that has vnodes and consists of hundreds of physical nodes. Reporter: Tupshin Harper Assignee: Jonathan Ellis Attachments: 6000vnodes.patch, AdjustableGossipPeriod.patch, delayEstimatorUntilStatisticallyValid.patch There are a lot of gossip-related issues related to very wide clusters that also have vnodes enabled. Let's use this ticket as a master in case there are sub-tickets. The most obvious symptom I've seen is with 1000 nodes in EC2 with m1.xlarge instances. Each node configured with 32 vnodes. Without vnodes, cluster spins up fine and is ready to handle requests within 30 minutes or less. With vnodes, nodes are reporting constant up/down flapping messages with no external load on the cluster. After a couple of hours, they were still flapping, had very high cpu load, and the cluster never looked like it was going to stabilize or be useful for traffic. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6127) vnodes don't scale to hundreds of nodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813309#comment-13813309 ] Brandon Williams commented on CASSANDRA-6127: - With CASSANDRA-6244 and CASSANDRA-6297 in 1.2 head, I think we need to re-verify this is still a problem. vnodes don't scale to hundreds of nodes --- Key: CASSANDRA-6127 URL: https://issues.apache.org/jira/browse/CASSANDRA-6127 Project: Cassandra Issue Type: Bug Components: Core Environment: Any cluster that has vnodes and consists of hundreds of physical nodes. Reporter: Tupshin Harper Assignee: Jonathan Ellis Attachments: 6000vnodes.patch, AdjustableGossipPeriod.patch, delayEstimatorUntilStatisticallyValid.patch There are a lot of gossip-related issues related to very wide clusters that also have vnodes enabled. Let's use this ticket as a master in case there are sub-tickets. The most obvious symptom I've seen is with 1000 nodes in EC2 with m1.xlarge instances. Each node configured with 32 vnodes. Without vnodes, cluster spins up fine and is ready to handle requests within 30 minutes or less. With vnodes, nodes are reporting constant up/down flapping messages with no external load on the cluster. After a couple of hours, they were still flapping, had very high cpu load, and the cluster never looked like it was going to stabilize or be useful for traffic. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6127) vnodes don't scale to hundreds of nodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13811823#comment-13811823 ] Quentin Conner commented on CASSANDRA-6127: --- Monday (11/4) I will be start getting the CPU profiling captured with a 256 or 512 node cluster. Plan is to capture with internode compression and without. I was able to get semi-reproduction this week in a 256 node cluster -- one node had twice the cpu utilization of the others (20% user versus 10% user). But I had too much logging enabled and that skewed results. vnodes don't scale to hundreds of nodes --- Key: CASSANDRA-6127 URL: https://issues.apache.org/jira/browse/CASSANDRA-6127 Project: Cassandra Issue Type: Bug Components: Core Environment: Any cluster that has vnodes and consists of hundreds of physical nodes. Reporter: Tupshin Harper Assignee: Jonathan Ellis Attachments: 6000vnodes.patch, AdjustableGossipPeriod.patch, delayEstimatorUntilStatisticallyValid.patch There are a lot of gossip-related issues related to very wide clusters that also have vnodes enabled. Let's use this ticket as a master in case there are sub-tickets. The most obvious symptom I've seen is with 1000 nodes in EC2 with m1.xlarge instances. Each node configured with 32 vnodes. Without vnodes, cluster spins up fine and is ready to handle requests within 30 minutes or less. With vnodes, nodes are reporting constant up/down flapping messages with no external load on the cluster. After a couple of hours, they were still flapping, had very high cpu load, and the cluster never looked like it was going to stabilize or be useful for traffic. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6127) vnodes don't scale to hundreds of nodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808260#comment-13808260 ] Quentin Conner commented on CASSANDRA-6127: --- Brandon, You said Patch #3 will make it take much longer for a rebooted node to know who's actually up or down, exacerbating CASSANDRA-4288. I've given this some thought and want to see if I understand your concern. Patch #3 serves to send a zero value for phi, for newly-discovered nodes, until an accurate calculation of variance is complete. This would be 40 seconds, applicable to new nodes only. However (and this is what I'm looking for you to confirm) If a new node comes online, but is stopped again within 40 seconds of start-up, the FD will not convict it until the end of that 40 seconds. I suspect this occurs less frequently than adding a node to a cluster, but probably depends on your use case (dev vs prod). In my view, we can't escape the math, and the need to amass 40 samples. That is why the bug exists today. I agree we should look at tying thrift to a healthy startup as a compensating measure. Instead of a fixed amount of time (gossip rounds), perhaps we should consider adding a hold-down timer based on a statistical measure? This hold-down timer could be implemented for newly discovered nodes to suppress interaction until Gossip stabilizes. Just like we have a high-water mark for phi to denote failure, we could set a low-water mark and call it a trust threshold. We wouldn't enable thrift communications to the new node until their phi value is below this low-water mark. So the condition for recognizing a new node for thrift purposes could be two fold: 1. valid computation for variance (40 samples obtained in the 1000 sample window) 2. accurate phi value is indeed below the low-water mark vnodes don't scale to hundreds of nodes --- Key: CASSANDRA-6127 URL: https://issues.apache.org/jira/browse/CASSANDRA-6127 Project: Cassandra Issue Type: Bug Components: Core Environment: Any cluster that has vnodes and consists of hundreds of physical nodes. Reporter: Tupshin Harper Assignee: Jonathan Ellis Attachments: 6000vnodes.patch, AdjustableGossipPeriod.patch, delayEstimatorUntilStatisticallyValid.patch There are a lot of gossip-related issues related to very wide clusters that also have vnodes enabled. Let's use this ticket as a master in case there are sub-tickets. The most obvious symptom I've seen is with 1000 nodes in EC2 with m1.xlarge instances. Each node configured with 32 vnodes. Without vnodes, cluster spins up fine and is ready to handle requests within 30 minutes or less. With vnodes, nodes are reporting constant up/down flapping messages with no external load on the cluster. After a couple of hours, they were still flapping, had very high cpu load, and the cluster never looked like it was going to stabilize or be useful for traffic. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6127) vnodes don't scale to hundreds of nodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808328#comment-13808328 ] Brandon Williams commented on CASSANDRA-6127: - Let's move that discussion to CASSANDRA-4288, since that change is orthogonal to the actual problem we have here, regardless of whether it fixes it or just papers over the problem. What we need to do next on this ticket is either correlate a thread dump to what is burning up CPU, or attach a debugger and see where the time is being spent. vnodes don't scale to hundreds of nodes --- Key: CASSANDRA-6127 URL: https://issues.apache.org/jira/browse/CASSANDRA-6127 Project: Cassandra Issue Type: Bug Components: Core Environment: Any cluster that has vnodes and consists of hundreds of physical nodes. Reporter: Tupshin Harper Assignee: Jonathan Ellis Attachments: 6000vnodes.patch, AdjustableGossipPeriod.patch, delayEstimatorUntilStatisticallyValid.patch There are a lot of gossip-related issues related to very wide clusters that also have vnodes enabled. Let's use this ticket as a master in case there are sub-tickets. The most obvious symptom I've seen is with 1000 nodes in EC2 with m1.xlarge instances. Each node configured with 32 vnodes. Without vnodes, cluster spins up fine and is ready to handle requests within 30 minutes or less. With vnodes, nodes are reporting constant up/down flapping messages with no external load on the cluster. After a couple of hours, they were still flapping, had very high cpu load, and the cluster never looked like it was going to stabilize or be useful for traffic. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6127) vnodes don't scale to hundreds of nodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808335#comment-13808335 ] Jonathan Ellis commented on CASSANDRA-6127: --- bq. it would be better to limit that in the config instead of failing at an assert later on. Split that out to CASSANDRA-6267. vnodes don't scale to hundreds of nodes --- Key: CASSANDRA-6127 URL: https://issues.apache.org/jira/browse/CASSANDRA-6127 Project: Cassandra Issue Type: Bug Components: Core Environment: Any cluster that has vnodes and consists of hundreds of physical nodes. Reporter: Tupshin Harper Assignee: Jonathan Ellis Attachments: 6000vnodes.patch, AdjustableGossipPeriod.patch, delayEstimatorUntilStatisticallyValid.patch There are a lot of gossip-related issues related to very wide clusters that also have vnodes enabled. Let's use this ticket as a master in case there are sub-tickets. The most obvious symptom I've seen is with 1000 nodes in EC2 with m1.xlarge instances. Each node configured with 32 vnodes. Without vnodes, cluster spins up fine and is ready to handle requests within 30 minutes or less. With vnodes, nodes are reporting constant up/down flapping messages with no external load on the cluster. After a couple of hours, they were still flapping, had very high cpu load, and the cluster never looked like it was going to stabilize or be useful for traffic. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6127) vnodes don't scale to hundreds of nodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13806703#comment-13806703 ] Chris Burroughs commented on CASSANDRA-6127: bq. I'd just set a max of 1024. No one could ever need more than that. (Famous last words) Isn't that equivalent to saying no one will have a heterogeneous cluster with more than a 1024/256 = 4 performance delta between physical nodes? SSD vs spinny could account for more than that. vnodes don't scale to hundreds of nodes --- Key: CASSANDRA-6127 URL: https://issues.apache.org/jira/browse/CASSANDRA-6127 Project: Cassandra Issue Type: Bug Components: Core Environment: Any cluster that has vnodes and consists of hundreds of physical nodes. Reporter: Tupshin Harper Assignee: Jonathan Ellis Attachments: 6000vnodes.patch, AdjustableGossipPeriod.patch, delayEstimatorUntilStatisticallyValid.patch There are a lot of gossip-related issues related to very wide clusters that also have vnodes enabled. Let's use this ticket as a master in case there are sub-tickets. The most obvious symptom I've seen is with 1000 nodes in EC2 with m1.xlarge instances. Each node configured with 32 vnodes. Without vnodes, cluster spins up fine and is ready to handle requests within 30 minutes or less. With vnodes, nodes are reporting constant up/down flapping messages with no external load on the cluster. After a couple of hours, they were still flapping, had very high cpu load, and the cluster never looked like it was going to stabilize or be useful for traffic. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6127) vnodes don't scale to hundreds of nodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13806796#comment-13806796 ] Jonathan Ellis commented on CASSANDRA-6127: --- I'm okay with that limitation. Intuitively it's reasonable that C* can't compensate for really ridiculous performance differences. (Of course, you could also reduce the weak nodes below 256.) vnodes don't scale to hundreds of nodes --- Key: CASSANDRA-6127 URL: https://issues.apache.org/jira/browse/CASSANDRA-6127 Project: Cassandra Issue Type: Bug Components: Core Environment: Any cluster that has vnodes and consists of hundreds of physical nodes. Reporter: Tupshin Harper Assignee: Jonathan Ellis Attachments: 6000vnodes.patch, AdjustableGossipPeriod.patch, delayEstimatorUntilStatisticallyValid.patch There are a lot of gossip-related issues related to very wide clusters that also have vnodes enabled. Let's use this ticket as a master in case there are sub-tickets. The most obvious symptom I've seen is with 1000 nodes in EC2 with m1.xlarge instances. Each node configured with 32 vnodes. Without vnodes, cluster spins up fine and is ready to handle requests within 30 minutes or less. With vnodes, nodes are reporting constant up/down flapping messages with no external load on the cluster. After a couple of hours, they were still flapping, had very high cpu load, and the cluster never looked like it was going to stabilize or be useful for traffic. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6127) vnodes don't scale to hundreds of nodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13805267#comment-13805267 ] Chris Burroughs commented on CASSANDRA-6127: It would be helpful to dump the interval times for a node that is flapping (dumpInterArrivalTimes on the FD) so we can see how long the heartbeats are taking. A per endpoint histogram of heartbeat arrival latency seems a worthwhile o.a.c.Metric to have all the time. [~qconner] On the topic of wait until there is enough data before doing stuff you might also be interested in the heuristic report from CASSANDRA-4288 vnodes don't scale to hundreds of nodes --- Key: CASSANDRA-6127 URL: https://issues.apache.org/jira/browse/CASSANDRA-6127 Project: Cassandra Issue Type: Bug Components: Core Environment: Any cluster that has vnodes and consists of hundreds of physical nodes. Reporter: Tupshin Harper Assignee: Jonathan Ellis Attachments: 6000vnodes.patch, AdjustableGossipPeriod.patch, delayEstimatorUntilStatisticallyValid.patch There are a lot of gossip-related issues related to very wide clusters that also have vnodes enabled. Let's use this ticket as a master in case there are sub-tickets. The most obvious symptom I've seen is with 1000 nodes in EC2 with m1.xlarge instances. Each node configured with 32 vnodes. Without vnodes, cluster spins up fine and is ready to handle requests within 30 minutes or less. With vnodes, nodes are reporting constant up/down flapping messages with no external load on the cluster. After a couple of hours, they were still flapping, had very high cpu load, and the cluster never looked like it was going to stabilize or be useful for traffic. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6127) vnodes don't scale to hundreds of nodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13805838#comment-13805838 ] Quentin Conner commented on CASSANDRA-6127: --- I grabbed some sample log files from 10 nodes of 256 in a run today. [flap-intervals.tar.gz|http://qconner.s3.amazonaws.com/flap-intervals.tar.gz] Convictions are happening with only 1 to 5 intervals recorded. Patch #3 is looking like the winner but we should do the math by hand to be sure (volunteers?). Also, I just tested [Patch #3|https://issues.apache.org/jira/secure/attachment/12610117/delayEstimatorUntilStatisticallyValid.patch] and found 0 flaps for the same setup as yesterday (256 nodes, phi=8, normal 1000 ms gossip period). vnodes don't scale to hundreds of nodes --- Key: CASSANDRA-6127 URL: https://issues.apache.org/jira/browse/CASSANDRA-6127 Project: Cassandra Issue Type: Bug Components: Core Environment: Any cluster that has vnodes and consists of hundreds of physical nodes. Reporter: Tupshin Harper Assignee: Jonathan Ellis Attachments: 6000vnodes.patch, AdjustableGossipPeriod.patch, delayEstimatorUntilStatisticallyValid.patch There are a lot of gossip-related issues related to very wide clusters that also have vnodes enabled. Let's use this ticket as a master in case there are sub-tickets. The most obvious symptom I've seen is with 1000 nodes in EC2 with m1.xlarge instances. Each node configured with 32 vnodes. Without vnodes, cluster spins up fine and is ready to handle requests within 30 minutes or less. With vnodes, nodes are reporting constant up/down flapping messages with no external load on the cluster. After a couple of hours, they were still flapping, had very high cpu load, and the cluster never looked like it was going to stabilize or be useful for traffic. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6127) vnodes don't scale to hundreds of nodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13805880#comment-13805880 ] Jonathan Ellis commented on CASSANDRA-6127: --- Patch 1 will break things since later on we write the length of the string as two bytes. I think we're fine with 1700 vnodes per machine TBH, although it would be better to limit that in the config instead of failing at an assert later on. vnodes don't scale to hundreds of nodes --- Key: CASSANDRA-6127 URL: https://issues.apache.org/jira/browse/CASSANDRA-6127 Project: Cassandra Issue Type: Bug Components: Core Environment: Any cluster that has vnodes and consists of hundreds of physical nodes. Reporter: Tupshin Harper Assignee: Jonathan Ellis Attachments: 6000vnodes.patch, AdjustableGossipPeriod.patch, delayEstimatorUntilStatisticallyValid.patch There are a lot of gossip-related issues related to very wide clusters that also have vnodes enabled. Let's use this ticket as a master in case there are sub-tickets. The most obvious symptom I've seen is with 1000 nodes in EC2 with m1.xlarge instances. Each node configured with 32 vnodes. Without vnodes, cluster spins up fine and is ready to handle requests within 30 minutes or less. With vnodes, nodes are reporting constant up/down flapping messages with no external load on the cluster. After a couple of hours, they were still flapping, had very high cpu load, and the cluster never looked like it was going to stabilize or be useful for traffic. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6127) vnodes don't scale to hundreds of nodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13805890#comment-13805890 ] Tupshin Harper commented on CASSANDRA-6127: --- I'd just set a max of 1024. No one could ever need more than that. (Famous last words) vnodes don't scale to hundreds of nodes --- Key: CASSANDRA-6127 URL: https://issues.apache.org/jira/browse/CASSANDRA-6127 Project: Cassandra Issue Type: Bug Components: Core Environment: Any cluster that has vnodes and consists of hundreds of physical nodes. Reporter: Tupshin Harper Assignee: Jonathan Ellis Attachments: 6000vnodes.patch, AdjustableGossipPeriod.patch, delayEstimatorUntilStatisticallyValid.patch There are a lot of gossip-related issues related to very wide clusters that also have vnodes enabled. Let's use this ticket as a master in case there are sub-tickets. The most obvious symptom I've seen is with 1000 nodes in EC2 with m1.xlarge instances. Each node configured with 32 vnodes. Without vnodes, cluster spins up fine and is ready to handle requests within 30 minutes or less. With vnodes, nodes are reporting constant up/down flapping messages with no external load on the cluster. After a couple of hours, they were still flapping, had very high cpu load, and the cluster never looked like it was going to stabilize or be useful for traffic. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6127) vnodes don't scale to hundreds of nodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13805928#comment-13805928 ] Brandon Williams commented on CASSANDRA-6127: - Patch #3 will make it take much longer for a rebooted node to know who's actually up or down, exacerbating CASSANDRA-4288. I'd still like to know *why* things are taking longer with vnodes, and I'm especially hesitant to make any adjustments to the gossiper or FD since we know they work fine with single tokens, and also because they *have no knowledge about tokens*, it's just another opaque state to them. I suspect something in StorageService is blocking the gossiper long enough to cause this, perhaps CASSANDRA-6244 or something similar. vnodes don't scale to hundreds of nodes --- Key: CASSANDRA-6127 URL: https://issues.apache.org/jira/browse/CASSANDRA-6127 Project: Cassandra Issue Type: Bug Components: Core Environment: Any cluster that has vnodes and consists of hundreds of physical nodes. Reporter: Tupshin Harper Assignee: Jonathan Ellis Attachments: 6000vnodes.patch, AdjustableGossipPeriod.patch, delayEstimatorUntilStatisticallyValid.patch There are a lot of gossip-related issues related to very wide clusters that also have vnodes enabled. Let's use this ticket as a master in case there are sub-tickets. The most obvious symptom I've seen is with 1000 nodes in EC2 with m1.xlarge instances. Each node configured with 32 vnodes. Without vnodes, cluster spins up fine and is ready to handle requests within 30 minutes or less. With vnodes, nodes are reporting constant up/down flapping messages with no external load on the cluster. After a couple of hours, they were still flapping, had very high cpu load, and the cluster never looked like it was going to stabilize or be useful for traffic. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6127) vnodes don't scale to hundreds of nodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13805977#comment-13805977 ] Jonathan Ellis commented on CASSANDRA-6127: --- Couldn't we tie the thrift/native server startup to I have enough gossip data now? vnodes don't scale to hundreds of nodes --- Key: CASSANDRA-6127 URL: https://issues.apache.org/jira/browse/CASSANDRA-6127 Project: Cassandra Issue Type: Bug Components: Core Environment: Any cluster that has vnodes and consists of hundreds of physical nodes. Reporter: Tupshin Harper Assignee: Jonathan Ellis Attachments: 6000vnodes.patch, AdjustableGossipPeriod.patch, delayEstimatorUntilStatisticallyValid.patch There are a lot of gossip-related issues related to very wide clusters that also have vnodes enabled. Let's use this ticket as a master in case there are sub-tickets. The most obvious symptom I've seen is with 1000 nodes in EC2 with m1.xlarge instances. Each node configured with 32 vnodes. Without vnodes, cluster spins up fine and is ready to handle requests within 30 minutes or less. With vnodes, nodes are reporting constant up/down flapping messages with no external load on the cluster. After a couple of hours, they were still flapping, had very high cpu load, and the cluster never looked like it was going to stabilize or be useful for traffic. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6127) vnodes don't scale to hundreds of nodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13805981#comment-13805981 ] Brandon Williams commented on CASSANDRA-6127: - That might confuse autodiscovery clients, at least without further changes. vnodes don't scale to hundreds of nodes --- Key: CASSANDRA-6127 URL: https://issues.apache.org/jira/browse/CASSANDRA-6127 Project: Cassandra Issue Type: Bug Components: Core Environment: Any cluster that has vnodes and consists of hundreds of physical nodes. Reporter: Tupshin Harper Assignee: Jonathan Ellis Attachments: 6000vnodes.patch, AdjustableGossipPeriod.patch, delayEstimatorUntilStatisticallyValid.patch There are a lot of gossip-related issues related to very wide clusters that also have vnodes enabled. Let's use this ticket as a master in case there are sub-tickets. The most obvious symptom I've seen is with 1000 nodes in EC2 with m1.xlarge instances. Each node configured with 32 vnodes. Without vnodes, cluster spins up fine and is ready to handle requests within 30 minutes or less. With vnodes, nodes are reporting constant up/down flapping messages with no external load on the cluster. After a couple of hours, they were still flapping, had very high cpu load, and the cluster never looked like it was going to stabilize or be useful for traffic. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6127) vnodes don't scale to hundreds of nodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13804361#comment-13804361 ] Quentin Conner commented on CASSANDRA-6127: --- *Background and Reproduction* The symptom is evident with the presence of is now DOWN messages in the Cassandra system.log file. The recording of a node DOWN is often followed by a node UP a few seconds later. Users have coined this phenomenon gossip flap and the occurence of Gossip flaps has a machine and a human consequence. Humans react strongly to the (temporary) marking of a node down. Automated monitoring may trigger SNMP traps, etc. A busy node that doesn't transmit heartbeat gossip messages on time will be marked as down though it may still be performing useful work. Machine reactions include other C* nodes buffering of row mutations and storage of hints on disk when another node is marked down. I have not explored the machine reactions but imagine the endpointSnitch could also be affected from the client frame of reference. One piece of good news is that I was able to reproduce two different use cases that elicit the is now DOWN message in Log4J log files. Use Case #1 is as follows: provision 256 or 512 nodes in EC2 install Cassandra 1.2.9 take defaults except specify num_tokens=256 in c*.yaml start one node at a time Use Case #2 is as follows: provision 32 nodes in EC2 install Cassandra 1.2.9 take defaults in c*.yaml configure rack start one node at a time when all nodes are up create about 1GB of data e.g. tools/bin/cassandra-stress -c 20 -l 3 -n 100 provision a 33rdxtra node in EC2 install Cassandra 1.2.9 take defaults except specify num_tokens=256 start the node (auto_bootstrap=true) vnodes don't scale to hundreds of nodes --- Key: CASSANDRA-6127 URL: https://issues.apache.org/jira/browse/CASSANDRA-6127 Project: Cassandra Issue Type: Bug Components: Core Environment: Any cluster that has vnodes and consists of hundreds of physical nodes. Reporter: Tupshin Harper Assignee: Jonathan Ellis There are a lot of gossip-related issues related to very wide clusters that also have vnodes enabled. Let's use this ticket as a master in case there are sub-tickets. The most obvious symptom I've seen is with 1000 nodes in EC2 with m1.xlarge instances. Each node configured with 32 vnodes. Without vnodes, cluster spins up fine and is ready to handle requests within 30 minutes or less. With vnodes, nodes are reporting constant up/down flapping messages with no external load on the cluster. After a couple of hours, they were still flapping, had very high cpu load, and the cluster never looked like it was going to stabilize or be useful for traffic. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6127) vnodes don't scale to hundreds of nodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13804365#comment-13804365 ] Quentin Conner commented on CASSANDRA-6127: --- *Feature Suggestion* The current Gossip failure detector is characterized by a sliding window of elapsed time, a heartbeat message period and a PHI threshold used to make the continuous random variable (lower case phi) into a dichotomous (binary) random variable. That PHI (uppercase) threshold is called phi_convict_threshold. I don't have a better mathmatical theory or derivation at this writing, but I do have an easy workaround for your consideration. While phi_convict_threshold is adjustable, the period (or frequency) of Gossip messages is not. Adjusting the gossip period to integrate over a longer time baseline reduced false positives from the Gossip failure detector. The side effect increases the elapsed time to detect a legitimately-failed node. Depending on user workload characteristics, and the related sources of latency (CPU, disk and network activity or transient delays) cited above, a System Architect could present a reasonable use case for controlling the Gossip message period. The goal would be to set a detection window that accomodates common occurences for a given deployment scenario. Not all data centers are created equal. Patches and results from implementation will follow in subsequent posts. *Potential Next Steps* Explore concern about sensitivity to gossip period. Do the vnode gossip messages exceed capacity for peers to ingest? Explore concern about phi estimates from un-filled (new) deque. Explore concern about assuming Gaussian PDF. Networks (not computers) generally characterize expected arrival time by Poisson distribution, not Gaussian. vnodes don't scale to hundreds of nodes --- Key: CASSANDRA-6127 URL: https://issues.apache.org/jira/browse/CASSANDRA-6127 Project: Cassandra Issue Type: Bug Components: Core Environment: Any cluster that has vnodes and consists of hundreds of physical nodes. Reporter: Tupshin Harper Assignee: Jonathan Ellis There are a lot of gossip-related issues related to very wide clusters that also have vnodes enabled. Let's use this ticket as a master in case there are sub-tickets. The most obvious symptom I've seen is with 1000 nodes in EC2 with m1.xlarge instances. Each node configured with 32 vnodes. Without vnodes, cluster spins up fine and is ready to handle requests within 30 minutes or less. With vnodes, nodes are reporting constant up/down flapping messages with no external load on the cluster. After a couple of hours, they were still flapping, had very high cpu load, and the cluster never looked like it was going to stabilize or be useful for traffic. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6127) vnodes don't scale to hundreds of nodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13804362#comment-13804362 ] Quentin Conner commented on CASSANDRA-6127: --- *Analysis* My first experiments aimed to quantify the length of Gossip messages and determine what factors drive the message length. I found the size of certain gossip messages increases proportionally with the number of vnodes (num_tokens in c.yaml). I recorded message size over the num_tokens and number of nodes domains (64,128,256,512,...) for tokens and (32,64,128,256,512) for nodes. I also made non-rigorous observation of User and Kernel CPU (Ubuntu 10.0.4 LTS). My hunch is that both vnode count and node count have a mild effect on user CPU resource usage. What is the rough estimate of bytes sent for certain Gossip messages and why does this matter? The Phi Accrual Failure Detector (Hayashibara, et al) assumes fixed length heartbeat messages while Cassandra uses variable length messages. I observed a correlation with larger messages, higher vnodes and false positive detections by the Gossip FailureDetector. These observations, IMHO, are not explained by the research paper. I formed a hypothesis that the false positives are due to jitter in the interval values. I wondered if perhaps using a longer baseline to integrate over would reduce the jitter. I have a second theory to follow up on. A newly added node will not have a long history of Gossip heartbeat interarrival times. At least 40 samples are needed to compute mean, variance with any statistical significance. It's possible the phi estimation algorithm is simply invalid for newly created nodes and that is why we see them flap shortly after creation. In any case, the message of interest is the GossipDigestAck2 (GDA2) because it is the largest of the Gossip messages. GDA2 contains the set of EndpointStateMaps (node metadata) for newly-discovered nodes, i.e. those nodes just added to an existing cluster. When each node becomes aware of joining node, they Gossip it to three randomly-chosen other nodes. The GDA2 message is tailored to contain the delta of new node metadata the receiving node is unaware of. For a single node, the upper limit on GDA message size is roughly 3 * N * k * V Where N is the number of nodes in the cluster, V is the number of tokens (vnodes) per cluster, k is a constant value, approximately 64 bytes, that represents a serialized token plus some other endpoint metadata. If one is running hundreds of nodes in a cluster, the Gossip message traffic created when a node joins can be significant and increases with the number of nodes. I believe this to be the first order effect and probably violates one of the assumptions of the PHI Accrual Failure Detection, that heartbeat messages are small enough not to consume a relevant amount of compute or communication resources. The variable transmission time (due to variable length messages) is a clear violation of assumptions, if I've read the source code correctly. On a related topic, there is a hard-coded limitation to the number of vnodes due to the serialization of the GDA messages. No more than 1720 vnodes can be configured without creating a greater than 32K serialized String vnode message. A patch is provided below for future use should this become an issue. In clusters with hundreds of nodes, GDA2 messages can be 200 KB or 2 MB if many nodes join simultaneously. This is not an issue if the computer experiences no latency from competing workloads. In the real world, nodes are added because the cluster load has grown in terms of retained data, or in terms of a high transaction arrival rate. This means node resources may be fully utilized when adding new nodes is typically attempted. It occured to me that we have another use case to accomodate. It is common to experience transient failure modes, even in modern data centers with disciplined maintenance practices. Ethernet cables get moved, switches and routers rebooted. BGP route errors and other temporary interruptions may occur with the network fabric in real world scenarios. People make mistakes, plans change and preventative maintenance often causes short-lived interruptions occur with network, CPU and disk subsystems. vnodes don't scale to hundreds of nodes --- Key: CASSANDRA-6127 URL: https://issues.apache.org/jira/browse/CASSANDRA-6127 Project: Cassandra Issue Type: Bug Components: Core Environment: Any cluster that has vnodes and consists of hundreds of physical nodes. Reporter: Tupshin Harper Assignee: Jonathan Ellis There are a lot of gossip-related issues related to very wide clusters that also have vnodes enabled. Let's use this ticket as a
[jira] [Commented] (CASSANDRA-6127) vnodes don't scale to hundreds of nodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13804459#comment-13804459 ] Quentin Conner commented on CASSANDRA-6127: --- First results with workaround patch #2. No load. No data. Only system keyspace and Gossip on a 256 node m1.medium cluster in EC2. Nodes started in rapid succession. *phi=8, variable gossip_period* 1154 flaps for 1 sec 685 flaps for 2 sec 146 flaps for 3 sec 88 flaps for 4 sec 70 flaps for 5 sec 100 flaps for 10 sec *phi=12* 1289 flaps for 1 sec 77 flaps for 2 sec 6 flaps for 3 sec 1 flaps for 4 sec 3 flaps for 5 sec 1 flaps for 6 sec 0 flaps for 8 sec 1 flaps for 10 sec vnodes don't scale to hundreds of nodes --- Key: CASSANDRA-6127 URL: https://issues.apache.org/jira/browse/CASSANDRA-6127 Project: Cassandra Issue Type: Bug Components: Core Environment: Any cluster that has vnodes and consists of hundreds of physical nodes. Reporter: Tupshin Harper Assignee: Jonathan Ellis Attachments: 6000vnodes.patch, AdjustableGossipPeriod.patch, delayEstimatorUntilStatisticallyValid.patch There are a lot of gossip-related issues related to very wide clusters that also have vnodes enabled. Let's use this ticket as a master in case there are sub-tickets. The most obvious symptom I've seen is with 1000 nodes in EC2 with m1.xlarge instances. Each node configured with 32 vnodes. Without vnodes, cluster spins up fine and is ready to handle requests within 30 minutes or less. With vnodes, nodes are reporting constant up/down flapping messages with no external load on the cluster. After a couple of hours, they were still flapping, had very high cpu load, and the cluster never looked like it was going to stabilize or be useful for traffic. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6127) vnodes don't scale to hundreds of nodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13804545#comment-13804545 ] Brandon Williams commented on CASSANDRA-6127: - It would be helpful to dump the interval times for a node that is flapping (dumpInterArrivalTimes on the FD) so we can see how long the heartbeats are taking. If some are excessively long, we need to get threads dumps/debugger timings from the gossiper to see if something is blocking it or taking a long time before changing any fundamentals (gossip interval, FD formula) that we already know work in principle without vnodes. Increasing the payload size to 32k shouldn't cause these problems, since that is only sent during initial state synchronization and isn't all that large to begin with. vnodes don't scale to hundreds of nodes --- Key: CASSANDRA-6127 URL: https://issues.apache.org/jira/browse/CASSANDRA-6127 Project: Cassandra Issue Type: Bug Components: Core Environment: Any cluster that has vnodes and consists of hundreds of physical nodes. Reporter: Tupshin Harper Assignee: Jonathan Ellis Attachments: 6000vnodes.patch, AdjustableGossipPeriod.patch, delayEstimatorUntilStatisticallyValid.patch There are a lot of gossip-related issues related to very wide clusters that also have vnodes enabled. Let's use this ticket as a master in case there are sub-tickets. The most obvious symptom I've seen is with 1000 nodes in EC2 with m1.xlarge instances. Each node configured with 32 vnodes. Without vnodes, cluster spins up fine and is ready to handle requests within 30 minutes or less. With vnodes, nodes are reporting constant up/down flapping messages with no external load on the cluster. After a couple of hours, they were still flapping, had very high cpu load, and the cluster never looked like it was going to stabilize or be useful for traffic. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6127) vnodes don't scale to hundreds of nodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13804770#comment-13804770 ] Brandon Williams commented on CASSANDRA-6127: - Can you see if adding -Dcassandra.unsafesystem=true allows the cluster to stabilize at some point? vnodes don't scale to hundreds of nodes --- Key: CASSANDRA-6127 URL: https://issues.apache.org/jira/browse/CASSANDRA-6127 Project: Cassandra Issue Type: Bug Components: Core Environment: Any cluster that has vnodes and consists of hundreds of physical nodes. Reporter: Tupshin Harper Assignee: Jonathan Ellis Attachments: 6000vnodes.patch, AdjustableGossipPeriod.patch, delayEstimatorUntilStatisticallyValid.patch There are a lot of gossip-related issues related to very wide clusters that also have vnodes enabled. Let's use this ticket as a master in case there are sub-tickets. The most obvious symptom I've seen is with 1000 nodes in EC2 with m1.xlarge instances. Each node configured with 32 vnodes. Without vnodes, cluster spins up fine and is ready to handle requests within 30 minutes or less. With vnodes, nodes are reporting constant up/down flapping messages with no external load on the cluster. After a couple of hours, they were still flapping, had very high cpu load, and the cluster never looked like it was going to stabilize or be useful for traffic. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6127) vnodes don't scale to hundreds of nodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13783372#comment-13783372 ] Jonathan Ellis commented on CASSANDRA-6127: --- bq. After a couple of hours, they were still flapping, had very high cpu load To clarify, this is a bit of a mashup of multiple observations: bq. When there was zero traffic on the cluster, we were seeing flapping without very high cpu. On smaller tests, we saw much higher cpu than expected when under load. vnodes don't scale to hundreds of nodes --- Key: CASSANDRA-6127 URL: https://issues.apache.org/jira/browse/CASSANDRA-6127 Project: Cassandra Issue Type: Bug Components: Core Environment: Any cluster that has vnodes and consists of hundreds of physical nodes. Reporter: Tupshin Harper Assignee: Jonathan Ellis There are a lot of gossip-related issues related to very wide clusters that also have vnodes enabled. Let's use this ticket as a master in case there are sub-tickets. The most obvious symptom I've seen is with 1000 nodes in EC2 with m1.xlarge instances. Each node configured with 32 vnodes. Without vnodes, cluster spins up fine and is ready to handle requests within 30 minutes or less. With vnodes, nodes are reporting constant up/down flapping messages with no external load on the cluster. After a couple of hours, they were still flapping, had very high cpu load, and the cluster never looked like it was going to stabilize or be useful for traffic. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6127) vnodes don't scale to hundreds of nodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13783481#comment-13783481 ] Darla Baker commented on CASSANDRA-6127: Per Jonathan's request, I'm adding an update here regarding eBay's experience on https://support.datastax.com/tickets/6928 which was the result of first stage of executing the plan from https://support.datastax.com/requests/6636. He had an existing 32 cluster DSE 3.1.0 cluster in their PHX data center. Their plan was to add a second data center to the cluster in SLC with 50 nodes and vnodes enabled. They were to begin with bringing all nodes up with auto bootstrapping turned off to prevent any data streaming until they were ready to make other changes to bring the data center fully online. Essentially immediately upon bringing the nodes up in SLC, the nodes in PHX began reporting as down and he began receiving SMS messages and calls from application engineers that the application which uses that cluster was down. As we were in triage mode, the most expedient course of action was to shut down the SLC nodes and remove them from gossip. Upon trying to execute the nodetool removenode command we hit CASSANDRA-5857 although we thought up to this point that nodetool decommission was responsible for the issue. In any case, we started the process of executing the workaround as per that ticket. At the point we parted, the process was going slowly but he reported it was working and the nodes were disappearing from the ring and the application engineers were reporting that the application was back online. At some point during the weekend, Alex reached out to Jeremy who was on call and Jeremy who was able to finally get the nodes removed from gossip and fully stabilize the 32 node PHX data center and fully decommission the SLC data center. Alex attached some logs to the ticket during the event. We were seeing node flapping and NPEs during the event. Ticket https://support.datastax.com/tickets/6917 contains some additional details on the test cases. Ticket https://support.datastax.com/tickets/6939 contains the alternate plan that eBay is considering in light of the difficulties encountered with bringing SLC online. vnodes don't scale to hundreds of nodes --- Key: CASSANDRA-6127 URL: https://issues.apache.org/jira/browse/CASSANDRA-6127 Project: Cassandra Issue Type: Bug Components: Core Environment: Any cluster that has vnodes and consists of hundreds of physical nodes. Reporter: Tupshin Harper Assignee: Jonathan Ellis There are a lot of gossip-related issues related to very wide clusters that also have vnodes enabled. Let's use this ticket as a master in case there are sub-tickets. The most obvious symptom I've seen is with 1000 nodes in EC2 with m1.xlarge instances. Each node configured with 32 vnodes. Without vnodes, cluster spins up fine and is ready to handle requests within 30 minutes or less. With vnodes, nodes are reporting constant up/down flapping messages with no external load on the cluster. After a couple of hours, they were still flapping, had very high cpu load, and the cluster never looked like it was going to stabilize or be useful for traffic. -- This message was sent by Atlassian JIRA (v6.1#6144)