Looking at the code more... it seems the issue is here In AssignmentManager.processDeadServersAndRegionsInTransition():
... failoverCleanupDone(); if (!failover) { // Fresh cluster startup. LOG.info("Clean cluster startup. Assigning user regions"); assignAllUserRegions(allRegions); } ... As soon as a single node failed, failover is set to true. So no assignment is done. Later on the servers are recovered from the "crash" (which a clean restart is as well), and then all unassigned regions (all of them now, since they are unassigned earlier) are assigned round-robin. On Tue, Mar 14, 2017 at 1:54 PM, Lars George <lars.geo...@gmail.com> wrote: > Hi, > > I had this happened at multiple clusters recently where after the > restart the locality dropped from close to or exactly 100% down to > single digits. The reason is that all regions were completely shuffled > and reassigned to random servers. Upon reading the (yet again > non-trivial) assignment code, I found that a single server missing > will trigger a full "recovery" of all servers, which includes a drop > of all previous assignments and random new assignment. > > This is just terrible! In addition, I also assumed that - at least the > StochasticLoadBalancer - is checking which node holds most of the data > of a region locality wise and picks that server. But that is _not_ the > case! It just spreads everything seemingly randomly across the > servers. > > To me this is a big regression (or straight bug) given that a single > server out of, for example, hundreds could trigger that and destroy > the locality completely. Running a major compaction is not an approach > for many reasons. > > This used to work better, why that regression? > > Lars