Kishore, Thanks for the helpful pointers as usual. You are correct that the delayed transition will also delay the normal bootstrap of a node, which is unacceptable. Thanks for pointing this out.
The idea I had in mind was to extend the notion of "REBALANCE_TIMER" associated with each resource inside Helix to also support multiple timers. Each timer would be associated with a node, and would rebalance partitions hosted on it to other nodes. Supporting this inside Helix would be too intrusive a change. So, I could implement this outside of helix. I would need to implement something similar to ZKHelixAdmin.rebalance(), but the rebalance() would be a targeted rebalance that only rebalances partitions hosted on a particular node. thanks, - Puneet On Sun, Mar 3, 2013 at 8:47 PM, kishore g <[email protected]> wrote: > Hi Puneet, > > Your explanation is correct. > > Regarding the race condition, yes its possible that N1 finished its > transition before receiving the cancellation. But then Helix will send a > opposite transition SLAVE to OFFLINE to N1. Thats the best we can do. > > Yes the support for conflicting transitions need to be built. Currently we > only have the ability to manually cancel a transition. We need the support > for canceling conflicting transitions. Lets file a JIRA and flush out the > design. > > By the way, let me know about the other ideas you had. Its good to have > multiple options and discuss the pros and cons. For example, the problem > with delayed transition is it might add some delay during the cluster the > cluster start up. > > thanks, > Kishore G > > > > > > > On Sun, Mar 3, 2013 at 8:02 PM, Puneet Zaroo <[email protected]> wrote: >> >> Kishore, >> >> Over the weekend I had some other thoughts of how to implement this. >> But thinking some more about it, the timed transition idea looks like >> the one that requires less intrusive changes to Helix. But please let >> me step through it slowly to understand it more. >> >> Lets say node N0 goes down and the partitions on it are moved to N1. >> Lets say N1 receives the callback for the OFFLINE, SLAVE >> transition... but this transition has a configurable delay in it, and >> so does not complete immediately. >> >> In the meantime, node N0 comes back up, so the idealState is >> recalculated in the CustomCodeInvoker to move the partitions of N0 >> back to it. This will make Helix cancel all other conflicting >> transitions. Does this cancellation get propagated to N1 (which is >> inside the OFFLINE, SLAVE transition). This seems a bit racy. What if >> N1 had finished its transition just before receiving the cancellation. >> >> And if I understand correctly, the support for cancelling conflicting >> transitions needs to be built. >> >> Thanks, >> - Puneet >> >> >> >> On Fri, Mar 1, 2013 at 7:33 AM, kishore g <[email protected]> wrote: >> > Hi Puneet, >> > >> > Your understanding of AUTO mode is correct, no partitions will be ever >> > moved >> > by controller to a new node. And if node comes back up, it will still >> > host >> > the partitions it had before going down. >> > >> > This is how it works, >> > in AUTO_REBALANCE Helix has full control so it will create new replicas, >> > assign states as needed. >> > >> > in AUTO mode, it will only not create new replicas unless the idealstate >> > is >> > changed externally ( this can happen when you add new boxes). >> > >> >>>Or will the partition move only happen when some constraints are being >> >>>violated. E.g. if the minimum number of replicas specified is "2", >> >>>then a partition will be assigned to a new node if there are just 2 >> >>>replicas in the system and one of the nodes goes down. >> > >> > In AUTO mode, Helix will try to satisfy the constraints with existing >> > replicas, so if you had assigned 2 replicas but 1 is down, it will see >> > whats >> > the best it can do with that 1 replica. thats where the priority of >> > states >> > come into picture, you specify master is more important than slave, so >> > it >> > will make that replica a master. >> > >> > In AUTO_REBALANCE it would create that replica on another node. This >> > mode is >> > generally suited for stateless systems where moving partition might >> > simply >> > mean moving processing and not data. >> > >> > Thanks, >> > Kishore G >> > >> > >> > >> > >> > >> > >> > On Fri, Mar 1, 2013 at 6:33 AM, Puneet Zaroo <[email protected]> >> > wrote: >> >> >> >> Kishore, >> >> Thanks for the prompt reply once again. >> >> >> >> On Tue, Feb 26, 2013 at 3:39 PM, kishore g <[email protected]> wrote: >> >> > Hi Puneet, >> >> > >> >> > I was about to reply to your previous email but I think its better to >> >> > have a >> >> > separate thread for each requirement. >> >> > >> >> >> >> I agree. >> >> >> >> > We already have ability 3 to trigger rebalance occasionally. Take a >> >> > look >> >> > at >> >> > timer tasks in controller. But i dont think that will be sufficient >> >> > in >> >> > your >> >> > case. >> >> > >> >> > There is another way to solve this which is probably easier to reason >> >> > about >> >> > and elegant. Basically we can introduce a notion of timed transition >> >> > ( >> >> > we >> >> > can discuss on how to implement this). What this means is when a node >> >> > fails >> >> > Helix can request another node to create the replica but with >> >> > additional >> >> > configuration that it should be scheduled after X timeout, we already >> >> > have a >> >> > notion of cancellable transitions built in. So if the old node comes >> >> > up >> >> > within that time helix can cancel the existence transition and put >> >> > the >> >> > old >> >> > node back into SLAVE state. >> >> > >> >> >> >> The timed transition idea does look promising. I will have to think a >> >> bit more about it. >> >> I had a few more mundane questions. >> >> In the "AUTO" mode (as opposed to the AUTO_REBALANCE mode), the DDS is >> >> responsible for object placement. But how does the DDS implement the >> >> object placement support. >> >> >> >> The StateModelDefinition.Builder() class allows one to set the >> >> "upperBound" and the "dynamicUpperBound". But how does one specify a >> >> lower bound for a particular state ? >> >> >> >> Can one safely say that in the "AUTO" mode no partitions will be ever >> >> moved by the controller to a new node, except when the DDS so >> >> desires. >> >> If a node were to go down and come back up, it will still host the >> >> partitions that it had before going down. >> >> Or will the partition move only happen when some constraints are being >> >> violated. E.g. if the minimum number of replicas specified is "2", >> >> then a partition will be assigned to a new node if there are just 2 >> >> replicas in the system and one of the nodes goes down. >> >> >> >> Thanks again for your replies and for open-sourcing a great tool. >> >> >> >> > This design does not require any additional work to handle failures >> >> > of >> >> > controllers or participants and any modification to state model. Its >> >> > basically adding the notion of timed transition that can be cancelled >> >> > if >> >> > needed. >> >> > >> >> > What do you think about the solution? Does it make sense ? >> >> > >> >> > Regarding implementation, this solution can be implemented in the >> >> > current >> >> > state by simply adding additional sleep in the transition (OFFLINE to >> >> > SLAVE) >> >> > and in the custom code invoker you can first send cancel message to >> >> > the >> >> > existing transition and then set the ideal state. But its possible >> >> > for >> >> > Helix >> >> > to automatically cancel it. We need to have additional logic in Helix >> >> > that >> >> > if there is a pending transition and if we compute another transition >> >> > that >> >> > is opposite of that, we can automatically detect that its cancellable >> >> > and >> >> > cancel the existing transition. That will make it more generic and we >> >> > can >> >> > then simply have the transition delay set as a configuration. >> >> > >> >> > thanks, >> >> > Kishore G >> >> > >> >> > >> >> > On Tue, Feb 26, 2013 at 12:12 PM, Puneet Zaroo >> >> > <[email protected]> >> >> > wrote: >> >> >> >> >> >> Hi, >> >> >> >> >> >> I wanted to know how to implement a specific state machine >> >> >> requirement >> >> >> in >> >> >> Helix. >> >> >> Lets say a partition is in the state S2. >> >> >> >> >> >> 1. On an instance hosting it going down, the partition moves to >> >> >> state >> >> >> S3 (but stays on the same instance). >> >> >> 2. If the instance comes back up before a timeout expires, the >> >> >> partition moves to state S1 (stays on the same instance). >> >> >> 3. If the instance does not come back up before the timeout expiry, >> >> >> the partition moves to state S0 (the initial state, on a different >> >> >> instance picked up by the controller). >> >> >> >> >> >> I have a few questions. >> >> >> >> >> >> 1. I believe in order to implement Requirement 1, I have to use the >> >> >> CUSTOM rebalancing feature (as otherwise the partitions will get >> >> >> assigned to a new node). >> >> >> The wiki page says the following about the CUSTOM mode. >> >> >> >> >> >> "Applications will have to implement an interface that Helix will >> >> >> invoke when the cluster state changes. Within this callback, the >> >> >> application can recompute the partition assignment mapping" >> >> >> >> >> >> Which interface does one have to implement ? I am assuming the >> >> >> callbacks are triggered inside the controller. >> >> >> >> >> >> 2. The transition from S2 -> S3 should not issue a callback on the >> >> >> participant (instance) holding that partition. This is because the >> >> >> participant is unavailable and so cannot execute the callback. Is >> >> >> this >> >> >> doable ? >> >> >> >> >> >> 3. One way the time-out (Requirement 3) can be implemented is to >> >> >> occasionally trigger IdealState calculation after a time-out and not >> >> >> only on liveness changes. Does that sound doable ? >> >> >> >> >> >> thanks, >> >> >> - Puneet >> >> > >> >> > >> > >> > > >
