Hi Terence/Jason/Santi, Did we come to a conclusion on this. Terence proposal looks good to me. If adding FATAL state is more invasive, I suggest simply disabling the partition on that node and set some reason for disabling for auditing/diagnosis. The advantage of this is if the underlying error is rectified then one can enable the partition and transition ERROR->DROP will be invoked. Disabling ensures that even if node restarts it will not host that partition again.
thanks, Kishore G On Mon, Feb 11, 2013 at 8:58 PM, Terence Yim <[email protected]> wrote: > I proposed the FATAL state to Kishore before. Let me write it down again > for discussion. > > 1. An extra state, "FATAL", is introduced. It is a system state, just like > the existing ERROR state, which doesn't need to be explicitly defined in > state model. > 2. Just like the current implementation, whenever there is any error during > participant state transition, transit the participant into ERROR state and > stay there. > 3. Also just like current implementation, when a given resource is deleted, > trigger state transition from CURRENT_STATE -> DROPPED (and goes through > necessary state transition based on the state model). > 4. For participants that have current state = ERROR, trigger ERROR->DROPPED > transition (can have a default callback in the StateModel that do nothing > in this transition, but it's up to further discussion). > 5. If and only if there is exception thrown during the ERROR->DROPPED > transition, transit the participant to FATAL state. > 6. When a participant gets into FATAL state, there is no way for it to get > out of it without human intervention, meaning a human need to inspect and > reset it manually (or through some tools). > > With this, there would be changes in Controller, but no change in > participant if there nothing to specially handled during ERROR->DROPPED > transition. Also, all error handling would be done with state transition, > which gives the participant more consistent way on handling different > scenarios. This also guarantees that every calls are sync and thread safe. > > Terence > > On Mon, Feb 11, 2013 at 7:23 PM, Santiago Perez <[email protected] > >wrote: > > > In my proposal FATAL would be a final state, manual intervention > required. > > > > 1) In our use case, the problem is that when a regular transition (say > > offline->online) fails and goes to error state. if then the resource gets > > removed, the participant remains in "ERROR" state so we can't reuse it > > because in order to reuse it we need to transit to dropped first. > > 2) The thing is, in our use case the drop comes from an api call which is > > not synchronized with the cluster management code which could issue the > > reset. Also, if we reset it, wouldn't the controller push the transitions > > trying to have reach the ideal state again (likely triggering the same > > issue that led to ERROR?) > > > > Thanks > > Santi > > > > > > On Mon, Feb 11, 2013 at 5:25 PM, Zhen Zhang <[email protected]> wrote: > > > > > If we are going to add a new FATAL state, we might potentially add > FATAL > > to > > > all state models and all applications might have to implement > > ERROR->FATAL > > > and FATAL->initial_state transitions. > > > > > > On the other hand, I have a couple of questions: > > > 1) why in your use case, ERROR state is inevitable? > > > 2) if a partition goes to ERROR state, could we reset it, so only error > > > partitions will get an ERROR->initial_state transition and then drop > it? > > If > > > no error happens during ERROR->initial_state, the error is recoverable, > > and > > > the resource will be dropped. otherwise, if something goes wrong with > > > ERROR->initial_state, participant remains in ERROR state, drop failed, > > and > > > the resource is not reusable? > > > > > > Thanks, > > > Jason > > > > > > On Mon, Feb 11, 2013 at 1:47 PM, Santiago Perez <[email protected] > > > >wrote: > > > > > > > For our use case that's somewhat problematic. It's still better than > > the > > > > current inability to go from error to dropped but the problem is now > if > > > > something goes wrong when dropping there's no way to know that from > the > > > > participant states. And that's actually the only unrecoverable > > situation > > > > for our use case. Basically it means that the participant cannot be > > > reused > > > > for another purpose. An alternative solution would be to have a FATAL > > > state > > > > that is reached when a failure occurs when transitioning out of the > > ERROR > > > > state. > > > > > > > > Cheers, > > > > Santi > > > > > > > > > > > > On Wed, Feb 6, 2013 at 1:57 PM, Zhen Zhang <[email protected]> > wrote: > > > > > > > > > Hi, > > > > > > > > > > I am going to add the support of error->drop transition in Helix. > > The > > > > > basic idea is to remove DROPPED state from state model; instead we > > add > > > a > > > > > drop() (or cleanup()) abstract method in StateModel. Applications > > need > > > to > > > > > implement this abstract method to take care of the drop logic. This > > > > > requires no change on the controller side. On the participant side, > > > when > > > > > the participant receives a state-transition message with > > > ToState=DROPPED, > > > > > it will invoke the drop() method in the state model. When the > drop() > > > gets > > > > > executed, the partition will be removed from the current state > > > regardless > > > > > of any errors/exceptions during the execution of drop(). This will > > > > prevent > > > > > the infinite loop of calling drop() in case of error/exception in > the > > > > > execution of drop(). The advantage of this design is that we can > > remove > > > > > DROPPED state totally from all state model definitions, which keeps > > the > > > > > state model simple. The disadvantage is, in drop() the application > > need > > > > to > > > > > take different drop logics based on the current state (e.g. MASTER, > > > > SLAVE, > > > > > or ERROR, which will be the FromState in the message). Any > > suggestions? > > > > > > > > > > Thanks, > > > > > > > > > > Jason > > > > > > > > > > > > > > >
