[ 
https://issues.apache.org/jira/browse/CASSANDRA-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13804362#comment-13804362
 ] 

Quentin Conner commented on CASSANDRA-6127:
-------------------------------------------

*Analysis*

My first experiments aimed to quantify the length of Gossip messages and 
determine what factors drive the message length.  I found the size of certain 
gossip messages increases proportionally with the number of vnodes (num_tokens 
in c.yaml).  I recorded message size over the num_tokens and number of nodes 
domains (64,128,256,512,...) for tokens and (32,64,128,256,512) for nodes.  I 
also made non-rigorous observation of User and Kernel CPU (Ubuntu 10.0.4 LTS).  
My hunch is that both vnode count and node count have a mild effect on user CPU 
resource usage.

What is the rough estimate of bytes sent for certain Gossip messages and why 
does this matter?  The Phi Accrual Failure Detector (Hayashibara, et al) 
assumes fixed length heartbeat messages while Cassandra uses variable length 
messages.  I observed a correlation with larger messages, higher vnodes and 
false positive detections by the Gossip FailureDetector.  These observations, 
IMHO, are not explained by the research paper.  I formed a hypothesis that the 
false positives are due to jitter in the interval values.  I wondered if 
perhaps using a longer baseline to integrate over would reduce the jitter.  

I have a second theory to follow up on.  A newly added node will not have a 
long history of Gossip heartbeat interarrival times.  At least 40 samples are 
needed to compute mean, variance with any statistical significance.  It's 
possible the phi estimation algorithm is simply invalid for newly created nodes 
and that is why we see them flap shortly after creation.

In any case, the message of interest is the GossipDigestAck2 (GDA2) because it 
is the largest of the Gossip messages.  GDA2 contains the set of 
EndpointStateMaps (node metadata) for newly-discovered nodes, i.e. those nodes 
just added to an existing cluster.  When each node becomes aware of joining 
node, they Gossip it to three randomly-chosen other nodes.  The GDA2 message is 
tailored to contain the delta of new node metadata the receiving node is 
unaware of.

For a single node, the upper limit on GDA message size is roughly 3 * N * k * V
Where N is the number of nodes in the cluster,
V is the number of tokens (vnodes) per cluster,
k is a constant value, approximately 64 bytes, that represents a serialized 
token plus some other endpoint metadata.

If one is running hundreds of nodes in a cluster, the Gossip message traffic 
created when a node joins can be significant and increases with the number of 
nodes.  I believe this to be the first order effect and probably violates one 
of the assumptions of the PHI Accrual Failure Detection, that heartbeat 
messages are small enough not to consume a relevant amount of compute or 
communication resources.  The variable transmission time (due to variable 
length messages) is a clear violation of assumptions, if I've read the source 
code correctly.

On a related topic, there is a hard-coded limitation to the number of vnodes 
due to the serialization of the GDA messages.
No more than 1720 vnodes can be configured without creating a greater than 32K 
serialized String vnode message.  A patch is provided below for future use 
should this become an issue.

In clusters with hundreds of nodes, GDA2 messages can be 200 KB or 2 MB if many 
nodes join simultaneously.  This is not an issue if the computer experiences no 
latency from competing workloads.  In the real world, nodes are added because 
the cluster load has grown in terms of retained data, or in terms of a high 
transaction arrival rate.  This means node resources may be fully utilized when 
adding new nodes is typically attempted.

It occured to me that we have another use case to accomodate.  It is common to 
experience transient failure modes, even in modern data centers with 
disciplined maintenance practices.  Ethernet cables get moved, switches and 
routers rebooted.  BGP route errors and other temporary interruptions may occur 
with the network fabric in real world scenarios.  People make mistakes, plans 
change and preventative maintenance often causes short-lived interruptions 
occur with network, CPU and disk subsystems.


> vnodes don't scale to hundreds of nodes
> ---------------------------------------
>
>                 Key: CASSANDRA-6127
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6127
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: Any cluster that has vnodes and consists of hundreds of 
> physical nodes.
>            Reporter: Tupshin Harper
>            Assignee: Jonathan Ellis
>
> There are a lot of gossip-related issues related to very wide clusters that 
> also have vnodes enabled. Let's use this ticket as a master in case there are 
> sub-tickets.
> The most obvious symptom I've seen is with 1000 nodes in EC2 with m1.xlarge 
> instances. Each node configured with 32 vnodes.
> Without vnodes, cluster spins up fine and is ready to handle requests within 
> 30 minutes or less. 
> With vnodes, nodes are reporting constant up/down flapping messages with no 
> external load on the cluster. After a couple of hours, they were still 
> flapping, had very high cpu load, and the cluster never looked like it was 
> going to stabilize or be useful for traffic.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to