Github user lei-xia commented on the pull request:
https://github.com/apache/helix/commit/bada911c7f246cf685c30323e118402cca89111d#commitcomment-26944027
In helix-core/src/main/java/org/apache/helix/GroupCommit.java:
In helix-core/src/main/java/org/apache/helix/GroupCommit.java on line 142:
This is for updating partition's CurrentState to ZK. For regular
resources (not jobs), if this update fails, controller will send the state
transition request again, once the participant gets the request, it realizes
its local currentstate is inconsistent with the one on ZK, so it will try to
update zk again. So the only consequence is that the state transition latency
observed by spectators will get longer.
However, since Job's completion is triggered by its update of state to ZK,
if the update fails, controller has no way to tell whether the job is still
running or completed. So the task will be hang there from controller's
perspective even though it is actually completed in participant. We added
retry to reduce this chance, also adding task timeout will be another option to
avoid a job hang forever if this happens.
To improve the retry, add exponential backoff retry could be another
option. Buffering failed updates and retrying it in background could be
complicated, because we need to make sure the update sequence. If there is new
update to the same znode while there is a pending buffered updates in the
queue, then we need to make sure the old one being written to zk first. This
could be non-trival to implement.
We assume ZK unavailable time will be very minimal, and most of the network
outage is intermittent (recovered quickly). If a network outage lasts longer
than zk timeout (say 30s), the participant will be disconnected from Helix, all
of its currentStates will be reset anyway. So comparing to adding more code
complexity to make sure our write to ZK is gurantteed to be deliverd, we may
instead to focus on how we can make sure even if the write fails, participant
always has other ways to recover it eventually, even though it may take a
little more delay.
---