Re: support error->drop transition in helix

kishore g Fri, 22 Mar 2013 00:50:32 -0700

Hi Terence/Jason/Santi,

Did we come to a conclusion on this. Terence proposal looks good to me. If
adding FATAL state is more invasive, I suggest simply disabling the
partition on that node and set some reason for disabling for
auditing/diagnosis. The advantage of this is if the underlying error is
rectified then one can enable the partition and transition ERROR->DROP will
be invoked. Disabling ensures that even if node restarts it will not host
that partition again.


thanks,
Kishore G


On Mon, Feb 11, 2013 at 8:58 PM, Terence Yim <[email protected]> wrote:

> I proposed the FATAL state to Kishore before. Let me write it down again
> for discussion.
>
> 1. An extra state, "FATAL", is introduced. It is a system state, just like
> the existing ERROR state, which doesn't need to be explicitly defined in
> state model.
> 2. Just like the current implementation, whenever there is any error during
> participant state transition, transit the participant into ERROR state and
> stay there.
> 3. Also just like current implementation, when a given resource is deleted,
> trigger state transition from CURRENT_STATE -> DROPPED (and goes through
> necessary state transition based on the state model).
> 4. For participants that have current state = ERROR, trigger ERROR->DROPPED
> transition (can have a default callback in the StateModel that do nothing
> in this transition, but it's up to further discussion).
> 5. If and only if there is exception thrown during the ERROR->DROPPED
> transition, transit the participant to FATAL state.
> 6. When a participant gets into FATAL state, there is no way for it to get
> out of it without human intervention, meaning a human need to inspect and
> reset it manually (or through some tools).
>
> With this, there would be changes in Controller, but no change in
> participant if there nothing to specially handled during ERROR->DROPPED
> transition. Also, all error handling would be done with state transition,
> which gives the participant more consistent way on handling different
> scenarios. This also guarantees that every calls are sync and thread safe.
>
> Terence
>
> On Mon, Feb 11, 2013 at 7:23 PM, Santiago Perez <[email protected]
> >wrote:
>
> > In my proposal FATAL would be a final state, manual intervention
> required.
> >
> > 1) In our use case, the problem is that when a regular transition (say
> > offline->online) fails and goes to error state. if then the resource gets
> > removed, the participant remains in "ERROR" state so we can't reuse it
> > because in order to reuse it we need to transit to dropped first.
> > 2) The thing is, in our use case the drop comes from an api call which is
> > not synchronized with the cluster management code which could issue the
> > reset. Also, if we reset it, wouldn't the controller push the transitions
> > trying to have reach the ideal state again (likely triggering the same
> > issue that led to ERROR?)
> >
> > Thanks
> > Santi
> >
> >
> > On Mon, Feb 11, 2013 at 5:25 PM, Zhen Zhang <[email protected]> wrote:
> >
> > > If we are going to add a new FATAL state, we might potentially add
> FATAL
> > to
> > > all state models and all applications might have to implement
> > ERROR->FATAL
> > > and FATAL->initial_state transitions.
> > >
> > > On the other hand, I have a couple of questions:
> > > 1) why in your use case, ERROR state is inevitable?
> > > 2) if a partition goes to ERROR state, could we reset it, so only error
> > > partitions will get an ERROR->initial_state transition and then drop
> it?
> > If
> > > no error happens during ERROR->initial_state, the error is recoverable,
> > and
> > > the resource will be dropped. otherwise, if something goes wrong with
> > > ERROR->initial_state, participant remains in ERROR state, drop failed,
> > and
> > > the resource is not reusable?
> > >
> > > Thanks,
> > > Jason
> > >
> > > On Mon, Feb 11, 2013 at 1:47 PM, Santiago Perez <[email protected]
> > > >wrote:
> > >
> > > > For our use case that's somewhat problematic. It's still better than
> > the
> > > > current inability to go from error to dropped but the problem is now
> if
> > > > something goes wrong when dropping there's no way to know that from
> the
> > > > participant states. And that's actually the only unrecoverable
> > situation
> > > > for our use case. Basically it means that the participant cannot be
> > > reused
> > > > for another purpose. An alternative solution would be to have a FATAL
> > > state
> > > > that is reached when a failure occurs when transitioning out of the
> > ERROR
> > > > state.
> > > >
> > > > Cheers,
> > > > Santi
> > > >
> > > >
> > > > On Wed, Feb 6, 2013 at 1:57 PM, Zhen Zhang <[email protected]>
> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I am going to add the support of  error->drop transition in Helix.
> > The
> > > > > basic idea is to remove DROPPED state from state model; instead we
> > add
> > > a
> > > > > drop() (or cleanup()) abstract method in StateModel. Applications
> > need
> > > to
> > > > > implement this abstract method to take care of the drop logic. This
> > > > > requires no change on the controller side. On the participant side,
> > > when
> > > > > the participant receives a state-transition message with
> > > ToState=DROPPED,
> > > > > it will invoke the drop() method in the state model. When the
> drop()
> > > gets
> > > > > executed, the partition will be removed from the current state
> > > regardless
> > > > > of any errors/exceptions during the execution of drop(). This will
> > > > prevent
> > > > > the infinite loop of calling drop() in case of error/exception in
> the
> > > > > execution of drop(). The advantage of this design is that we can
> > remove
> > > > > DROPPED state totally from all state model definitions, which keeps
> > the
> > > > > state model simple. The disadvantage is, in drop() the application
> > need
> > > > to
> > > > > take different drop logics based on the current state (e.g. MASTER,
> > > > SLAVE,
> > > > > or ERROR, which will be the FromState in the message). Any
> > suggestions?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jason
> > > > >
> > > >
> > >
> >
>

Re: support error->drop transition in helix

Reply via email to