Re: support error->drop transition in helix

Santiago Perez Fri, 22 Mar 2013 12:06:07 -0700

Sounds good, couple of questions though:

1) What will happen when transiting from user defined state to DROPPED?
same as today?
2) Will there be a way in the onError() to know what transition was taking
place? or is that up to the implementation? Are there any possible
directions to be given in the onError() callback?
3) What will the behavior be if any of these methods (other than drop())
fail? Simply ignored?


Thanks,
Santi


On Fri, Mar 22, 2013 at 3:29 PM, Zhen Zhang <[email protected]> wrote:

> Hi, I am fine with FATAL state, but I think we should clearly separate
> helix defined states from user defined states. Helix define states (i.e.
> ERROR, DROPPED, FATAL) need not to be defined in state model and state
> transitions logic involving helix defined states should be common to all
> state models. In addition, helix should provide default implementation for
> transitions involving helix defined states. In case applications don't care
> about them, they don't implement these transitions. Here are what I am
> thinking of:
>
> - Helix will invoke StateModel.onError() if current state is any user
> defined state and error occurs in the transition.
>
> - Helix will invoke StateModel.drop() if current state is ERROR and target
> state is DROPPED. If drop() succeeds, ERROR will transit to initial state
> and then to DROPPED; otherwise to FATAL state.
>
> - Helix will invoke StateModel.reset() if current state is FATAL and we
> issue a reset command. If reset() succeeds, FATAL will transit to initial
> state; otherwise remain in FATAL state. Also reset() should be invoked only
> by admin commands, so in case reset() fails, we don't call it infinitely.
>
> Thanks,
> Jason
>
>
> On Fri, Mar 22, 2013 at 5:36 AM, Santiago Perez <[email protected]
> >wrote:
>
> > I personally prefer the FATAL state approach. What do you think Jason?
> >
> >
> > On Fri, Mar 22, 2013 at 4:50 AM, kishore g <[email protected]> wrote:
> >
> > > Hi Terence/Jason/Santi,
> > >
> > > Did we come to a conclusion on this. Terence proposal looks good to me.
> > If
> > > adding FATAL state is more invasive, I suggest simply disabling the
> > > partition on that node and set some reason for disabling for
> > > auditing/diagnosis. The advantage of this is if the underlying error is
> > > rectified then one can enable the partition and transition ERROR->DROP
> > will
> > > be invoked. Disabling ensures that even if node restarts it will not
> host
> > > that partition again.
> > >
> > > thanks,
> > > Kishore G
> > >
> > >
> > > On Mon, Feb 11, 2013 at 8:58 PM, Terence Yim <[email protected]> wrote:
> > >
> > > > I proposed the FATAL state to Kishore before. Let me write it down
> > again
> > > > for discussion.
> > > >
> > > > 1. An extra state, "FATAL", is introduced. It is a system state, just
> > > like
> > > > the existing ERROR state, which doesn't need to be explicitly defined
> > in
> > > > state model.
> > > > 2. Just like the current implementation, whenever there is any error
> > > during
> > > > participant state transition, transit the participant into ERROR
> state
> > > and
> > > > stay there.
> > > > 3. Also just like current implementation, when a given resource is
> > > deleted,
> > > > trigger state transition from CURRENT_STATE -> DROPPED (and goes
> > through
> > > > necessary state transition based on the state model).
> > > > 4. For participants that have current state = ERROR, trigger
> > > ERROR->DROPPED
> > > > transition (can have a default callback in the StateModel that do
> > nothing
> > > > in this transition, but it's up to further discussion).
> > > > 5. If and only if there is exception thrown during the ERROR->DROPPED
> > > > transition, transit the participant to FATAL state.
> > > > 6. When a participant gets into FATAL state, there is no way for it
> to
> > > get
> > > > out of it without human intervention, meaning a human need to inspect
> > and
> > > > reset it manually (or through some tools).
> > > >
> > > > With this, there would be changes in Controller, but no change in
> > > > participant if there nothing to specially handled during
> ERROR->DROPPED
> > > > transition. Also, all error handling would be done with state
> > transition,
> > > > which gives the participant more consistent way on handling different
> > > > scenarios. This also guarantees that every calls are sync and thread
> > > safe.
> > > >
> > > > Terence
> > > >
> > > > On Mon, Feb 11, 2013 at 7:23 PM, Santiago Perez <
> [email protected]
> > > > >wrote:
> > > >
> > > > > In my proposal FATAL would be a final state, manual intervention
> > > > required.
> > > > >
> > > > > 1) In our use case, the problem is that when a regular transition
> > (say
> > > > > offline->online) fails and goes to error state. if then the
> resource
> > > gets
> > > > > removed, the participant remains in "ERROR" state so we can't reuse
> > it
> > > > > because in order to reuse it we need to transit to dropped first.
> > > > > 2) The thing is, in our use case the drop comes from an api call
> > which
> > > is
> > > > > not synchronized with the cluster management code which could issue
> > the
> > > > > reset. Also, if we reset it, wouldn't the controller push the
> > > transitions
> > > > > trying to have reach the ideal state again (likely triggering the
> > same
> > > > > issue that led to ERROR?)
> > > > >
> > > > > Thanks
> > > > > Santi
> > > > >
> > > > >
> > > > > On Mon, Feb 11, 2013 at 5:25 PM, Zhen Zhang <[email protected]>
> > > wrote:
> > > > >
> > > > > > If we are going to add a new FATAL state, we might potentially
> add
> > > > FATAL
> > > > > to
> > > > > > all state models and all applications might have to implement
> > > > > ERROR->FATAL
> > > > > > and FATAL->initial_state transitions.
> > > > > >
> > > > > > On the other hand, I have a couple of questions:
> > > > > > 1) why in your use case, ERROR state is inevitable?
> > > > > > 2) if a partition goes to ERROR state, could we reset it, so only
> > > error
> > > > > > partitions will get an ERROR->initial_state transition and then
> > drop
> > > > it?
> > > > > If
> > > > > > no error happens during ERROR->initial_state, the error is
> > > recoverable,
> > > > > and
> > > > > > the resource will be dropped. otherwise, if something goes wrong
> > with
> > > > > > ERROR->initial_state, participant remains in ERROR state, drop
> > > failed,
> > > > > and
> > > > > > the resource is not reusable?
> > > > > >
> > > > > > Thanks,
> > > > > > Jason
> > > > > >
> > > > > > On Mon, Feb 11, 2013 at 1:47 PM, Santiago Perez <
> > > [email protected]
> > > > > > >wrote:
> > > > > >
> > > > > > > For our use case that's somewhat problematic. It's still better
> > > than
> > > > > the
> > > > > > > current inability to go from error to dropped but the problem
> is
> > > now
> > > > if
> > > > > > > something goes wrong when dropping there's no way to know that
> > from
> > > > the
> > > > > > > participant states. And that's actually the only unrecoverable
> > > > > situation
> > > > > > > for our use case. Basically it means that the participant
> cannot
> > be
> > > > > > reused
> > > > > > > for another purpose. An alternative solution would be to have a
> > > FATAL
> > > > > > state
> > > > > > > that is reached when a failure occurs when transitioning out of
> > the
> > > > > ERROR
> > > > > > > state.
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Santi
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Feb 6, 2013 at 1:57 PM, Zhen Zhang <[email protected]>
> > > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > I am going to add the support of  error->drop transition in
> > > Helix.
> > > > > The
> > > > > > > > basic idea is to remove DROPPED state from state model;
> instead
> > > we
> > > > > add
> > > > > > a
> > > > > > > > drop() (or cleanup()) abstract method in StateModel.
> > Applications
> > > > > need
> > > > > > to
> > > > > > > > implement this abstract method to take care of the drop
> logic.
> > > This
> > > > > > > > requires no change on the controller side. On the participant
> > > side,
> > > > > > when
> > > > > > > > the participant receives a state-transition message with
> > > > > > ToState=DROPPED,
> > > > > > > > it will invoke the drop() method in the state model. When the
> > > > drop()
> > > > > > gets
> > > > > > > > executed, the partition will be removed from the current
> state
> > > > > > regardless
> > > > > > > > of any errors/exceptions during the execution of drop(). This
> > > will
> > > > > > > prevent
> > > > > > > > the infinite loop of calling drop() in case of
> error/exception
> > in
> > > > the
> > > > > > > > execution of drop(). The advantage of this design is that we
> > can
> > > > > remove
> > > > > > > > DROPPED state totally from all state model definitions, which
> > > keeps
> > > > > the
> > > > > > > > state model simple. The disadvantage is, in drop() the
> > > application
> > > > > need
> > > > > > > to
> > > > > > > > take different drop logics based on the current state (e.g.
> > > MASTER,
> > > > > > > SLAVE,
> > > > > > > > or ERROR, which will be the FromState in the message). Any
> > > > > suggestions?
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > >
> > > > > > > > Jason
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: support error->drop transition in helix

Reply via email to