mingmwang commented on code in PR #184: URL: https://github.com/apache/arrow-ballista/pull/184#discussion_r962445327
########## ballista/rust/scheduler/src/state/execution_graph.rs: ########## @@ -581,6 +726,54 @@ impl ExecutionGraph { } } + /// Convert running stage to be unresolved + fn rollback_running_stage(&mut self, stage_id: usize) -> Result<bool> { Review Comment: Yes, in this PR, If it decides to rollback, the entire stage will be rollbacked and rescheduled. Even there are completed tasks, they will be rollback and rescheduled. This approach is simple and safe. Because its input stages need to be recomputed, we are not sure whether the re-computation result is deterministic or not. Regarding when to trigger the executor lost handling, as I noted in the PR's description, there could be four cases. I will implement the shuffle reader failure check in a separate PR which will focus on the task failure handing case. 1) Executor graceful shutdown, Executor will call RPC to notify the Scheduler it is stopped. 2) Scheduler expired the dead Executors that do not receive heartbeats from them. 3) Shuffle Reader failure due to connectivity issue in reduce tasks(Not implemented, will cover this in following task failure related PRs). 4) Grpc connection lost detect(Not implemented). I think we still need the Scheduler to expire the dead executors actively. For example if we have heavy tasks in the map stage and the executor is lost, no task updates come back from the lost executor, in this case we can only rely on the Scheduler expiration logic. We can have the expiration time longer than the executor's heartbeat time interval, so only after missing two or three consecutive heartbeats, the Scheduler consider it is dead. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org