[jira] [Updated] (CASSANDRA-3831) scaling to large clusters in GossipStage impossible due to calculatePendingRanges

Jonathan Ellis (Updated) (JIRA) Tue, 07 Feb 2012 11:11:22 -0800

     [ 
https://issues.apache.org/jira/browse/CASSANDRA-3831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jonathan Ellis updated CASSANDRA-3831:
--------------------------------------

    Fix Version/s:     (was: 1.2)
                   1.1
    
> scaling to large clusters in GossipStage impossible due to 
> calculatePendingRanges 
> ----------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3831
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3831
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Peter Schuller
>            Assignee: Peter Schuller
>            Priority: Critical
>             Fix For: 1.1
>
>         Attachments: CASSANDRA-3831-memoization-not-for-inclusion.txt, 
> CASSANDRA-3831-trunk-group-add-dc-tokens.txt
>
>
> (most observations below are from 0.8, but I just now tested on
> trunk and I can trigger this problem *just* by bootstrapping a ~180
> nod cluster concurrently, presumably due to the number of nodes that
> are simultaneously in bootstrap state)
> It turns out that:
> * (1) calculatePendingRanges is not just expensive, it's computationally 
> complex - cubic or worse
> * (2) it gets called *NOT* just once per node being bootstrapped/leaving etc, 
> but is called repeatedly *while* nodes are in these states
> As a result, clusters start exploding when you start reading 100-300
> nodes. The GossipStage will get backed up because a single
> calculdatePenginRanges takes seconds, and depending on what the
> average heartbeat interval is in relation to this, this can lead to
> *massive* cluster-wide flapping.
> This all started because we hit this in production; several nodes
> would start flapping several other nodes as down, with many nodes
> seeing the entire cluster, or a large portion of it, as down. Logging
> in to some of these nodes you would see that they would be constantly
> flapping up/down for minutes at a time until one became lucky and it
> stabilized.
> In the end we had to perform an emergency full-cluster restart with
> gossip patched to force-forget certain nodes in bootstrapping state.
> I can't go into all details here from the post-mortem (just the
> write-up would take a day), but in short:
> * We graphed the number of hosts in the cluster that had more than 5
>   Down (in a cluster that should have 0 down) on a minutely timeline.
> * We also graphed the number of hosts in the cluster that had GossipStage 
> backed up.
> * The two graphs correlated *extremely* well
> * jstack sampling showed it being CPU bound doing mostly sorting under 
> calculatePendingRanges
> * We were never able to exactly reproduce it with normal RING_DELAY and 
> gossip intervals, even on a 184 node cluster (the production cluster is 
> around 180).
> * Dropping RING_DELAY and in particular dropping gossip interval to 10 ms 
> instead of 1000 ms, we were able to observe all of the behavior we saw in 
> production.
> So our steps to reproduce are:
> * Launch 184 node cluster w/ gossip interval at 10ms and RING_DELAY at 1 
> second.
> * Do something like: {{while [ 1 ] ; do date ; echo decom ; nodetool 
> decommission ; date ; echo done leaving decommed for a while ; sleep 3 ; date 
> ; echo done restarting; sudo rm -rf /data/disk1/commitlog/* ; sudo rm -rf 
> /data/diskarray/tables/* ; sudo monit restart cassandra ;date ; echo 
> restarted waiting for a while ; sleep 40; done}} (or just do a manual 
> decom/bootstrap once, it triggers every time)
> * Watch all nodes flap massively and not recover at all, or maybe after a 
> *long* time.
> I observed the flapping using a python script that every 5 second
> (randomly spread out) asked for unreachable nodes from *all* nodes in
> the cluster, and printed any nodes and their counts when they had
> unreachables > 5. The cluster can be observed instantly going into
> massive flapping when leaving/bootstrap is initiated. Script needs
> Cassandra running with Jolokia enabled for http/json access to
> JMX. Can provide scrit if needed after cleanup.
> The phi conviction, based on logging I added, was legitimate. Using
> the 10 ms interval the average heartbeat interval ends up being like 25
> ms or something like that. As a result, a single ~ 2 second delay in
> gossip stage is huge in comparison to those 25 ms, and so we go past
> the phi conviction threshold. This is much more sensitive than in
> production, but it's the *same* effect, even if it triggers less
> easily for real.
> The best work around currently internally is to memoize
> calculatePendingRanges so that we don't re-calculate if token meta
> data, list of moving, list of bootstrapping and list of leaving are
> all the same as on prior calculation. It's not entirely clear at this
> point whether there is a clean fix to avoid executing
> calculatePendingRanges more than once per unique node in this state.
> It should be noted though that even if that is fixed, it is not
> acceptable to spend several seconds doing these calculations on a ~
> 200 node cluster and it needs to be made fundamentally more efficient.
> Here is a dump of thoughts by me in an internal JIRA ticket (not
> exhaustive, I just went as far as to show that there is an issue;
> there might be worse things I missed, but worse than cubic is bad
> enough that I stopped):
> (Comment uses 0.8 source.)
> {quote}
> Okay, so let's break down the computational complexity here.
> Suppose ring size is {{n}} and number of bootstrapping/leaving tokens is 
> {{m}}.  One of two places that take time (by measurement) is this part of 
> calculatePendingRanges():
> {code}
>        // At this stage pendingRanges has been updated according to leave 
> operations. We can
>         // now continue the calculation by checking bootstrapping nodes.
>         // For each of the bootstrapping nodes, simply add and remove them 
> one by one to
>         // allLeftMetadata and check in between what their ranges would be.
>         for (Map.Entry<Token, InetAddress> entry : bootstrapTokens.entrySet())
>         {
>             InetAddress endpoint = entry.getValue();
>             allLeftMetadata.updateNormalToken(entry.getKey(), endpoint);
>             for (Range range : 
> strategy.getAddressRanges(allLeftMetadata).get(endpoint))
>                 pendingRanges.put(range, endpoint);
>             allLeftMetadata.removeEndpoint(endpoint);
>         }
> {code}
> I'll ignore stuff that's log(n) or better.
> The outer loops is {{O(m)}}. The inner loop is {{O(n)}}, making aggregate so 
> far {{O(nm)}}.
> We have a call in there to updateNormalTokens() which implies a sorting, 
> which his {{O(n log(n))}}. So now we're at {{O(n log(n) m)}}.
> Next up we call {{getAddressRanges()}} which immediately does another {{O(n 
> log(n)}} sort. we're still at {{O(n log(n) m}}. It then iterates (linear) and:
> * calls {{getPrimaryRangeFor()}} for each.
> * calls {{calculateNaturalEndpoints}} for each.
> The former ends up sorting again, so now we're at {{O(n log(n) n log(n) m}} 
> (worse than quadratic).
> {{NTS.calculateNaturalEndpoints}} starts by collecting token meta data for 
> nodes in the DC, by using {{updateNormalToken}}, which *implies sorting*. 
> Woha woha. Now we're at {{O(n log(n) n log (n) n log(n) m)}}.
> I might have missed things that are even worse, but this is bad enough to 
> warrant this ticket. To put into perspective, 168 ^ 3 is 4.7 million.
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-3831) scaling to large clusters in GossipStage impossible due to calculatePendingRanges

Reply via email to