Re: support error->drop transition in helix

Santiago Perez Fri, 22 Mar 2013 21:50:31 -0700

Sounds great.

Two questions:


1) any chance this could be included in this month's release?
2) can I help in any way?

Thanks,
Santi


On Fri, Mar 22, 2013 at 9:01 PM, Zhen Zhang <[email protected]> wrote:

> Here is my thought:
>
> 1) yes. DROPPED logic will remain the same. We will first transit to user
> defined initial state and then to DROPPED state.
>
> 2) StateModel.onError() should provide the state-transition message that
> causes the error. It should look like StateModel.onError(Message message,
> NotificationContext context).We could either embed error context into
> notification context or provide an addition error context as an argument.
> We probably need to provide context for drop() and reset() also?
>
> 3) if onError() fails, we can still transit to ERROR or we can go directly
> to FATAL. if drop/reset() fails, we remain in FATAL.
>
> Any suggestions?
>
> Thanks,
> Jason
>
> On Fri, Mar 22, 2013 at 12:04 PM, Santiago Perez <[email protected]
> >wrote:
>
> > Sounds good, couple of questions though:
> >
> > 1) What will happen when transiting from user defined state to DROPPED?
> > same as today?
> > 2) Will there be a way in the onError() to know what transition was
> taking
> > place? or is that up to the implementation? Are there any possible
> > directions to be given in the onError() callback?
> > 3) What will the behavior be if any of these methods (other than drop())
> > fail? Simply ignored?
> >
> > Thanks,
> > Santi
> >
> >
> > On Fri, Mar 22, 2013 at 3:29 PM, Zhen Zhang <[email protected]> wrote:
> >
> > > Hi, I am fine with FATAL state, but I think we should clearly separate
> > > helix defined states from user defined states. Helix define states
> (i.e.
> > > ERROR, DROPPED, FATAL) need not to be defined in state model and state
> > > transitions logic involving helix defined states should be common to
> all
> > > state models. In addition, helix should provide default implementation
> > for
> > > transitions involving helix defined states. In case applications don't
> > care
> > > about them, they don't implement these transitions. Here are what I am
> > > thinking of:
> > >
> > > - Helix will invoke StateModel.onError() if current state is any user
> > > defined state and error occurs in the transition.
> > >
> > > - Helix will invoke StateModel.drop() if current state is ERROR and
> > target
> > > state is DROPPED. If drop() succeeds, ERROR will transit to initial
> state
> > > and then to DROPPED; otherwise to FATAL state.
> > >
> > > - Helix will invoke StateModel.reset() if current state is FATAL and we
> > > issue a reset command. If reset() succeeds, FATAL will transit to
> initial
> > > state; otherwise remain in FATAL state. Also reset() should be invoked
> > only
> > > by admin commands, so in case reset() fails, we don't call it
> infinitely.
> > >
> > > Thanks,
> > > Jason
> > >
> > >
> > > On Fri, Mar 22, 2013 at 5:36 AM, Santiago Perez <[email protected]
> > > >wrote:
> > >
> > > > I personally prefer the FATAL state approach. What do you think
> Jason?
> > > >
> > > >
> > > > On Fri, Mar 22, 2013 at 4:50 AM, kishore g <[email protected]>
> > wrote:
> > > >
> > > > > Hi Terence/Jason/Santi,
> > > > >
> > > > > Did we come to a conclusion on this. Terence proposal looks good to
> > me.
> > > > If
> > > > > adding FATAL state is more invasive, I suggest simply disabling the
> > > > > partition on that node and set some reason for disabling for
> > > > > auditing/diagnosis. The advantage of this is if the underlying
> error
> > is
> > > > > rectified then one can enable the partition and transition
> > ERROR->DROP
> > > > will
> > > > > be invoked. Disabling ensures that even if node restarts it will
> not
> > > host
> > > > > that partition again.
> > > > >
> > > > > thanks,
> > > > > Kishore G
> > > > >
> > > > >
> > > > > On Mon, Feb 11, 2013 at 8:58 PM, Terence Yim <[email protected]>
> > wrote:
> > > > >
> > > > > > I proposed the FATAL state to Kishore before. Let me write it
> down
> > > > again
> > > > > > for discussion.
> > > > > >
> > > > > > 1. An extra state, "FATAL", is introduced. It is a system state,
> > just
> > > > > like
> > > > > > the existing ERROR state, which doesn't need to be explicitly
> > defined
> > > > in
> > > > > > state model.
> > > > > > 2. Just like the current implementation, whenever there is any
> > error
> > > > > during
> > > > > > participant state transition, transit the participant into ERROR
> > > state
> > > > > and
> > > > > > stay there.
> > > > > > 3. Also just like current implementation, when a given resource
> is
> > > > > deleted,
> > > > > > trigger state transition from CURRENT_STATE -> DROPPED (and goes
> > > > through
> > > > > > necessary state transition based on the state model).
> > > > > > 4. For participants that have current state = ERROR, trigger
> > > > > ERROR->DROPPED
> > > > > > transition (can have a default callback in the StateModel that do
> > > > nothing
> > > > > > in this transition, but it's up to further discussion).
> > > > > > 5. If and only if there is exception thrown during the
> > ERROR->DROPPED
> > > > > > transition, transit the participant to FATAL state.
> > > > > > 6. When a participant gets into FATAL state, there is no way for
> it
> > > to
> > > > > get
> > > > > > out of it without human intervention, meaning a human need to
> > inspect
> > > > and
> > > > > > reset it manually (or through some tools).
> > > > > >
> > > > > > With this, there would be changes in Controller, but no change in
> > > > > > participant if there nothing to specially handled during
> > > ERROR->DROPPED
> > > > > > transition. Also, all error handling would be done with state
> > > > transition,
> > > > > > which gives the participant more consistent way on handling
> > different
> > > > > > scenarios. This also guarantees that every calls are sync and
> > thread
> > > > > safe.
> > > > > >
> > > > > > Terence
> > > > > >
> > > > > > On Mon, Feb 11, 2013 at 7:23 PM, Santiago Perez <
> > > [email protected]
> > > > > > >wrote:
> > > > > >
> > > > > > > In my proposal FATAL would be a final state, manual
> intervention
> > > > > > required.
> > > > > > >
> > > > > > > 1) In our use case, the problem is that when a regular
> transition
> > > > (say
> > > > > > > offline->online) fails and goes to error state. if then the
> > > resource
> > > > > gets
> > > > > > > removed, the participant remains in "ERROR" state so we can't
> > reuse
> > > > it
> > > > > > > because in order to reuse it we need to transit to dropped
> first.
> > > > > > > 2) The thing is, in our use case the drop comes from an api
> call
> > > > which
> > > > > is
> > > > > > > not synchronized with the cluster management code which could
> > issue
> > > > the
> > > > > > > reset. Also, if we reset it, wouldn't the controller push the
> > > > > transitions
> > > > > > > trying to have reach the ideal state again (likely triggering
> the
> > > > same
> > > > > > > issue that led to ERROR?)
> > > > > > >
> > > > > > > Thanks
> > > > > > > Santi
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Feb 11, 2013 at 5:25 PM, Zhen Zhang <
> [email protected]
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > If we are going to add a new FATAL state, we might
> potentially
> > > add
> > > > > > FATAL
> > > > > > > to
> > > > > > > > all state models and all applications might have to implement
> > > > > > > ERROR->FATAL
> > > > > > > > and FATAL->initial_state transitions.
> > > > > > > >
> > > > > > > > On the other hand, I have a couple of questions:
> > > > > > > > 1) why in your use case, ERROR state is inevitable?
> > > > > > > > 2) if a partition goes to ERROR state, could we reset it, so
> > only
> > > > > error
> > > > > > > > partitions will get an ERROR->initial_state transition and
> then
> > > > drop
> > > > > > it?
> > > > > > > If
> > > > > > > > no error happens during ERROR->initial_state, the error is
> > > > > recoverable,
> > > > > > > and
> > > > > > > > the resource will be dropped. otherwise, if something goes
> > wrong
> > > > with
> > > > > > > > ERROR->initial_state, participant remains in ERROR state,
> drop
> > > > > failed,
> > > > > > > and
> > > > > > > > the resource is not reusable?
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Jason
> > > > > > > >
> > > > > > > > On Mon, Feb 11, 2013 at 1:47 PM, Santiago Perez <
> > > > > [email protected]
> > > > > > > > >wrote:
> > > > > > > >
> > > > > > > > > For our use case that's somewhat problematic. It's still
> > better
> > > > > than
> > > > > > > the
> > > > > > > > > current inability to go from error to dropped but the
> problem
> > > is
> > > > > now
> > > > > > if
> > > > > > > > > something goes wrong when dropping there's no way to know
> > that
> > > > from
> > > > > > the
> > > > > > > > > participant states. And that's actually the only
> > unrecoverable
> > > > > > > situation
> > > > > > > > > for our use case. Basically it means that the participant
> > > cannot
> > > > be
> > > > > > > > reused
> > > > > > > > > for another purpose. An alternative solution would be to
> > have a
> > > > > FATAL
> > > > > > > > state
> > > > > > > > > that is reached when a failure occurs when transitioning
> out
> > of
> > > > the
> > > > > > > ERROR
> > > > > > > > > state.
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > > Santi
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Wed, Feb 6, 2013 at 1:57 PM, Zhen Zhang <
> > [email protected]>
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi,
> > > > > > > > > >
> > > > > > > > > > I am going to add the support of  error->drop transition
> in
> > > > > Helix.
> > > > > > > The
> > > > > > > > > > basic idea is to remove DROPPED state from state model;
> > > instead
> > > > > we
> > > > > > > add
> > > > > > > > a
> > > > > > > > > > drop() (or cleanup()) abstract method in StateModel.
> > > > Applications
> > > > > > > need
> > > > > > > > to
> > > > > > > > > > implement this abstract method to take care of the drop
> > > logic.
> > > > > This
> > > > > > > > > > requires no change on the controller side. On the
> > participant
> > > > > side,
> > > > > > > > when
> > > > > > > > > > the participant receives a state-transition message with
> > > > > > > > ToState=DROPPED,
> > > > > > > > > > it will invoke the drop() method in the state model. When
> > the
> > > > > > drop()
> > > > > > > > gets
> > > > > > > > > > executed, the partition will be removed from the current
> > > state
> > > > > > > > regardless
> > > > > > > > > > of any errors/exceptions during the execution of drop().
> > This
> > > > > will
> > > > > > > > > prevent
> > > > > > > > > > the infinite loop of calling drop() in case of
> > > error/exception
> > > > in
> > > > > > the
> > > > > > > > > > execution of drop(). The advantage of this design is that
> > we
> > > > can
> > > > > > > remove
> > > > > > > > > > DROPPED state totally from all state model definitions,
> > which
> > > > > keeps
> > > > > > > the
> > > > > > > > > > state model simple. The disadvantage is, in drop() the
> > > > > application
> > > > > > > need
> > > > > > > > > to
> > > > > > > > > > take different drop logics based on the current state
> (e.g.
> > > > > MASTER,
> > > > > > > > > SLAVE,
> > > > > > > > > > or ERROR, which will be the FromState in the message).
> Any
> > > > > > > suggestions?
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > >
> > > > > > > > > > Jason
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: support error->drop transition in helix

Reply via email to