I created a ticket for that issue: https://issues.apache.org/jira/browse/HELIX-276 though now I'm not sure whether to qualify it as an improvement or a defect, looks like it is both !
Thanks, Matthieu On Oct 22, 2013, at 08:42 , Kanak Biscuitwala <[email protected]> wrote: > I need to verify this, but I suspect two things are going on, having just > taken a quick look at the code: > > 1) I wrote some code a while back to rearrange node preference order if what > was calculated did not sufficiently balance the number replicas in state s > across nodes. I suspect this code is causing the problem. > > 2) The algorithm's initial assignment ignores preferred placement altogether, > and just places everything uniformly by a hash. This is because the algorithm > treats all replicas as orphans on the first run. Subsequent rebalances > improve the situation as the algorithm never removes preferred replicas from > their nodes. I think this should probably be changed so that the preferred > replicas are placed first, especially if len(liveNodes) == len(allNodes). > > 3) If nodes are configured and launched at the same time, the preferred > placement is not necessarily static, though the hashing scheme is probably > flexible enough to allow for this. > > I'll investigate in the morning. > > Date: Mon, 21 Oct 2013 23:21:21 -0700 > Subject: Re: Favoring some transitions when rebalancing in full_auto mode > From: [email protected] > To: [email protected] > > Kanak, I thought this should be the default behavior. When the list of > participants is generated for each partition, it comprises of > > preferred participants. i.e if all nodes were up where would this partition > reside > non preferred participants. i.e when one of preferred participant is down we > select a non preferred participant > If the list we generate ensures that preferred participants are put ahead of > non-preferred, the behavior Matthieu is expecting should happen by default > without additional changes. > > > Am i missing something ? > > > > On Fri, Oct 18, 2013 at 11:03 AM, Matthieu Morel <[email protected]> wrote: > Thanks Kanak for the explanation. > > It will definitely be very useful to have a few more knobs for tuning the > rebalancing algorithm. I'll post a ticket soon. > > > On Oct 18, 2013, at 19:16 , Kanak Biscuitwala <[email protected]> > wrote: > > Currently, the FULL_AUTO algorithm does not take this into account. The > algorithm optimizes for minimal movement and even distribution of states. > What I see here is that there is a tie in terms of even distribution, and > current presence of the replica would be a good tiebreaker. I can see why > this would be useful, though. Please create an issue and we'll pick it up > when we're able. > > On a somewhat related note, I noticed in your example code that you configure > and launch your nodes at the same time. The FULL_AUTO rebalancer performs > better when you configure your nodes ahead of time (even if you specify more > than you actually ever start). This is, of course, optional. > > Thanks for the advice. Currently we expect Helix to recompute states and > partitions as nodes join the cluster, though indeed it's probably more > efficient to compute some of the schedule ahead of time. I'll see how to > apply your suggestion. > > > Best regards, > > Matthieu > > > Thanks, > Kanak > > From: Matthieu Morel <[email protected]> > Reply-To: "[email protected]" <[email protected]> > Date: Friday, October 18, 2013 10:03 AM > To: "[email protected]" <[email protected]> > Subject: Favoring some transitions when rebalancing in full_auto mode > > Hi, > > In FULL_AUTO mode, helix computes both partitioning and states. > > In a leader-replica model, I observed that when rebalancing due to a failure > of the Leader node, Helix does not promote an existing replica to leader, but > instead assigns a new leader (I.e. going from offline to replica to leader). > > For quick failover, we need to have the replica promoted to leader instead. > Is there a way to do so in FULL_AUTO mode? > > Apparently with SEMI_AUTO that would be possible, but it would imply we > control the partitioning, and we'd prefer Helix to control that as well. > > I tried to play with the priorities in the definition of the state model, > with no luck so far. > > (See the example below for an example of how rebalancing currently takes > place) > > Thanks! > > Matthieu > > > Here we have a deployment with 3 nodes, 3 partitions and 2 desired states, > Leader and Replica (and offline). > > // initial states > > "mapFields":{ > "MY_RESOURCE_0":{ > "instance_1":"REPLICA" > ,"instance_2":"LEADER" > } > ,"MY_RESOURCE_1":{ > "instance_0":"REPLICA" > ,"instance_1":"LEADER" > } > ,"MY_RESOURCE_2":{ > "instance_0":"LEADER" > ,"instance_2":"REPLICA" // Instance2 is replica > } > } > } > > > // instance 0 dies > > "mapFields":{ > "MY_RESOURCE_0":{ > "instance_1":"REPLICA" > ,"instance_2":"LEADER" > } > ,"MY_RESOURCE_1":{ > "instance_1":"LEADER" > ,"instance_2":"REPLICA" > } > ,"MY_RESOURCE_2":{ > "instance_1":"LEADER" // Helix preferred to assign leadership of > resource 2 to instance 1 rather than promoting instance_2 from replica to > leader > ,"instance_2":"REPLICA" // instance 2 is still replica for resource 2 > } > } > }
