Re: support error->drop transition in helix

Terence Yim Mon, 11 Feb 2013 20:58:49 -0800

I proposed the FATAL state to Kishore before. Let me write it down again
for discussion.


1. An extra state, "FATAL", is introduced. It is a system state, just like
the existing ERROR state, which doesn't need to be explicitly defined in
state model.
2. Just like the current implementation, whenever there is any error during
participant state transition, transit the participant into ERROR state and
stay there.
3. Also just like current implementation, when a given resource is deleted,
trigger state transition from CURRENT_STATE -> DROPPED (and goes through
necessary state transition based on the state model).
4. For participants that have current state = ERROR, trigger ERROR->DROPPED
transition (can have a default callback in the StateModel that do nothing
in this transition, but it's up to further discussion).
5. If and only if there is exception thrown during the ERROR->DROPPED
transition, transit the participant to FATAL state.
6. When a participant gets into FATAL state, there is no way for it to get
out of it without human intervention, meaning a human need to inspect and
reset it manually (or through some tools).

With this, there would be changes in Controller, but no change in
participant if there nothing to specially handled during ERROR->DROPPED
transition. Also, all error handling would be done with state transition,
which gives the participant more consistent way on handling different
scenarios. This also guarantees that every calls are sync and thread safe.

Terence

On Mon, Feb 11, 2013 at 7:23 PM, Santiago Perez <[email protected]>wrote:

> In my proposal FATAL would be a final state, manual intervention required.
>
> 1) In our use case, the problem is that when a regular transition (say
> offline->online) fails and goes to error state. if then the resource gets
> removed, the participant remains in "ERROR" state so we can't reuse it
> because in order to reuse it we need to transit to dropped first.
> 2) The thing is, in our use case the drop comes from an api call which is
> not synchronized with the cluster management code which could issue the
> reset. Also, if we reset it, wouldn't the controller push the transitions
> trying to have reach the ideal state again (likely triggering the same
> issue that led to ERROR?)
>
> Thanks
> Santi
>
>
> On Mon, Feb 11, 2013 at 5:25 PM, Zhen Zhang <[email protected]> wrote:
>
> > If we are going to add a new FATAL state, we might potentially add FATAL
> to
> > all state models and all applications might have to implement
> ERROR->FATAL
> > and FATAL->initial_state transitions.
> >
> > On the other hand, I have a couple of questions:
> > 1) why in your use case, ERROR state is inevitable?
> > 2) if a partition goes to ERROR state, could we reset it, so only error
> > partitions will get an ERROR->initial_state transition and then drop it?
> If
> > no error happens during ERROR->initial_state, the error is recoverable,
> and
> > the resource will be dropped. otherwise, if something goes wrong with
> > ERROR->initial_state, participant remains in ERROR state, drop failed,
> and
> > the resource is not reusable?
> >
> > Thanks,
> > Jason
> >
> > On Mon, Feb 11, 2013 at 1:47 PM, Santiago Perez <[email protected]
> > >wrote:
> >
> > > For our use case that's somewhat problematic. It's still better than
> the
> > > current inability to go from error to dropped but the problem is now if
> > > something goes wrong when dropping there's no way to know that from the
> > > participant states. And that's actually the only unrecoverable
> situation
> > > for our use case. Basically it means that the participant cannot be
> > reused
> > > for another purpose. An alternative solution would be to have a FATAL
> > state
> > > that is reached when a failure occurs when transitioning out of the
> ERROR
> > > state.
> > >
> > > Cheers,
> > > Santi
> > >
> > >
> > > On Wed, Feb 6, 2013 at 1:57 PM, Zhen Zhang <[email protected]> wrote:
> > >
> > > > Hi,
> > > >
> > > > I am going to add the support of  error->drop transition in Helix.
> The
> > > > basic idea is to remove DROPPED state from state model; instead we
> add
> > a
> > > > drop() (or cleanup()) abstract method in StateModel. Applications
> need
> > to
> > > > implement this abstract method to take care of the drop logic. This
> > > > requires no change on the controller side. On the participant side,
> > when
> > > > the participant receives a state-transition message with
> > ToState=DROPPED,
> > > > it will invoke the drop() method in the state model. When the drop()
> > gets
> > > > executed, the partition will be removed from the current state
> > regardless
> > > > of any errors/exceptions during the execution of drop(). This will
> > > prevent
> > > > the infinite loop of calling drop() in case of error/exception in the
> > > > execution of drop(). The advantage of this design is that we can
> remove
> > > > DROPPED state totally from all state model definitions, which keeps
> the
> > > > state model simple. The disadvantage is, in drop() the application
> need
> > > to
> > > > take different drop logics based on the current state (e.g. MASTER,
> > > SLAVE,
> > > > or ERROR, which will be the FromState in the message). Any
> suggestions?
> > > >
> > > > Thanks,
> > > >
> > > > Jason
> > > >
> > >
> >
>

Re: support error->drop transition in helix

Reply via email to