mingmwang commented on code in PR #184:
URL: https://github.com/apache/arrow-ballista/pull/184#discussion_r962445327


##########
ballista/rust/scheduler/src/state/execution_graph.rs:
##########
@@ -581,6 +726,54 @@ impl ExecutionGraph {
         }
     }
 
+    /// Convert running stage to be unresolved
+    fn rollback_running_stage(&mut self, stage_id: usize) -> Result<bool> {

Review Comment:
   Yes, in this PR, If it decides to rollback, the entire stage will be 
rollbacked and rescheduled.  Even there are completed tasks, they will be 
rollback and rescheduled. This approach is simple and safe.  Because its input 
stages need to be recomputed, we are not sure whether the re-computation result 
is deterministic or not.
   
   Regarding when to trigger the executor lost handling, as I noted in the PR's 
description, there could be four cases.  I will implement the shuffle reader 
failure check in a separate PR which will focus on the task failure handing 
case.
   
   1) Executor graceful shutdown, Executor will call RPC to notify the 
Scheduler it is stopped.
   2) Scheduler expired the dead Executors that do not receive heartbeats from 
them.
   3) Shuffle Reader failure due to connectivity issue in reduce tasks(Not 
implemented, will  cover this in following task failure related PRs).
   4) Grpc connection lost detect(Not implemented).
   
   I think we still need the Scheduler to expire the dead executors actively. 
For example if we have heavy tasks in the map stage and the executor is lost, 
no task updates come back from the lost executor, in this case we can only rely 
on the Scheduler expiration logic.
    
   We can have the expiration time longer than the executor's heartbeat time 
interval, so only after missing two or three consecutive heartbeats, the 
Scheduler consider it is dead.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to