Hunter L created HELIX-787:
------------------------------

             Summary: TASK: Fix stuck tasks after Participant connection loss
                 Key: HELIX-787
                 URL: https://issues.apache.org/jira/browse/HELIX-787
             Project: Apache Helix
          Issue Type: Improvement
            Reporter: Hunter L
            Assignee: Hunter L


When Helix Participants lose ZK connection and enter a new ZK session, that 
causes all task partitions on those Participants to be reset into INIT state. 
This is undesirable because in reality, these tasks are considered dropped and 
should be scheduled on some other instance. This is the Controller side fix for 
this problem: when we detect tasks whose assigned Participants are no longer 
live, we mark them as DROPPED in their parent JobContext so that 
AssignableInstance will not consider them active when it is refreshed in the 
next pipeline. This enables these dropped tasks to be reassigned onto other 
instances.

Note that a Participant-side fix must follow so that upon reset() on task 
partitions, they should be in DROPPED state, not in INIT state. This does not 
inherently solve stuck INIT states on the original Participant. However, by 
letting these tasks be assigned on other instances, this fix lets jobs and 
workflows complete, upon which their CurrentStates will be dropped altogether.

Changelist:
1. Mark task partitions whose assigned Participants are no longer live as 
DROPPED in JobContext



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to