Github user lei-xia commented on the pull request:

    
https://github.com/apache/helix/commit/bada911c7f246cf685c30323e118402cca89111d#commitcomment-26944027
  
    In helix-core/src/main/java/org/apache/helix/GroupCommit.java:
    In helix-core/src/main/java/org/apache/helix/GroupCommit.java on line 142:
    This is for updating partition's CurrentState to ZK.   For regular 
resources (not jobs), if this update fails, controller will send the state 
transition request again, once the participant gets the request, it realizes 
its local currentstate is inconsistent with the one on ZK, so it will try to 
update zk again. So the only consequence is that the state transition latency 
observed by spectators will get longer.
    
    However, since Job's completion is triggered by its update of state to ZK, 
if the update fails, controller has no way to tell whether the job is still 
running or completed.  So the task will be hang there from controller's 
perspective even though it is actually completed in participant.   We added 
retry to reduce this chance, also adding task timeout will be another option to 
avoid a job hang forever if this happens.
    
    To improve the retry, add exponential backoff retry could be another 
option.  Buffering failed updates and retrying it in background could be 
complicated, because we need to make sure the update sequence.  If there is new 
update to the same znode while there is a pending buffered updates in the 
queue, then we need to make sure the old one being written to zk first.  This 
could be non-trival to implement. 
    
    We assume ZK unavailable time will be very minimal, and most of the network 
outage is intermittent (recovered quickly).  If a network outage lasts longer 
than zk timeout (say 30s), the participant will be disconnected from Helix, all 
of its currentStates will be reset anyway.  So comparing to adding more code 
complexity to make sure our write to ZK is gurantteed to be deliverd,  we may 
instead to focus on how we can make sure even if the write fails, participant 
always has other ways to recover it eventually, even though it may take a 
little more delay.   


---

Reply via email to