Hi,

I had this happened at multiple clusters recently where after the
restart the locality dropped from close to or exactly 100% down to
single digits. The reason is that all regions were completely shuffled
and reassigned to random servers. Upon reading the (yet again
non-trivial) assignment code, I found that a single server missing
will trigger a full "recovery" of all servers, which includes a drop
of all previous assignments and random new assignment.

This is just terrible! In addition, I also assumed that - at least the
StochasticLoadBalancer - is checking which node holds most of the data
of a region locality wise and picks that server. But that is _not_ the
case! It just spreads everything seemingly randomly across the
servers.

To me this is a big regression (or straight bug) given that a single
server out of, for example, hundreds could trigger that and destroy
the locality completely. Running a major compaction is not an approach
for many reasons.

This used to work better, why that regression?

Lars

Reply via email to