Hi Puneet,

I was about to reply to your previous email but I think its better to have
a separate thread for each requirement.

We already have ability 3 to trigger rebalance occasionally. Take a look at
timer tasks in controller. But i dont think that will be sufficient in your
case.

There is another way to solve this which is probably easier to reason about
and elegant.  Basically we can introduce a notion of timed transition ( we
can discuss on how to implement this). What this means is when a node fails
Helix can request another node to create the replica but with additional
configuration that it should be scheduled after X timeout, we already have
a notion of cancellable transitions built in. So if the old node comes up
within that time helix can cancel the existence transition and put the old
node back into SLAVE state.

This design does not require any additional work to handle failures of
controllers or participants and any modification to state model. Its
basically adding the notion of timed transition that can be cancelled if
needed.

What do you think about the solution? Does it make sense ?

Regarding implementation, this solution can be implemented in the current
state by simply adding additional sleep in the transition (OFFLINE to
SLAVE) and in the custom code invoker you can first send cancel message to
the existing transition and then set the ideal state. But its possible for
Helix to automatically cancel it. We need to have additional logic in Helix
that if there is a pending transition and if we compute another transition
that is opposite of that, we can automatically detect that its cancellable
and cancel the existing transition. That will make it more generic and we
can then simply have the transition delay set as a configuration.

thanks,
Kishore G


On Tue, Feb 26, 2013 at 12:12 PM, Puneet Zaroo <[email protected]>wrote:

> Hi,
>
> I wanted to know how to implement a specific state machine requirement in
> Helix.
> Lets say a partition is in the state S2.
>
> 1. On an instance hosting it going down, the partition moves to state
> S3 (but stays on the same instance).
> 2. If the instance comes back up before a timeout expires, the
> partition moves to state S1 (stays on the same instance).
> 3. If the instance does not come back up before the timeout expiry,
> the partition moves to state S0 (the initial state, on a different
> instance picked up by the controller).
>
> I have a few questions.
>
> 1. I believe in order to implement Requirement 1, I have to use the
> CUSTOM rebalancing feature (as otherwise the partitions will get
> assigned to a new node).
> The wiki page says the following about the CUSTOM mode.
>
> "Applications will have to implement an interface that Helix will
> invoke when the cluster state changes. Within this callback, the
> application can recompute the partition assignment mapping"
>
> Which interface does one have to implement ?  I am assuming the
> callbacks are triggered inside the controller.
>
>  2. The transition from S2 -> S3 should not issue a callback on the
> participant (instance) holding that partition. This is because the
> participant is unavailable and so cannot execute the callback. Is this
> doable ?
>
> 3. One way the time-out (Requirement 3) can be implemented is to
> occasionally trigger IdealState calculation after a time-out and not
> only on liveness changes. Does that sound doable ?
>
> thanks,
> - Puneet
>

Reply via email to