milenkovicm commented on code in PR #1212:
URL:
https://github.com/apache/datafusion-ballista/pull/1212#discussion_r2020082454
##########
ballista/scheduler/src/scheduler_server/grpc.rs:
##########
@@ -124,14 +128,36 @@ impl<T: 'static + AsLogicalPlan, U: 'static +
AsExecutionPlan> SchedulerGrpc
};
let mut tasks = vec![];
+ let mut prepare_failed_jobs = HashMap::<String,
Vec<TaskDescription>>::new();
for (_, task) in schedulable_tasks {
- match self.state.task_manager.prepare_task_definition(task) {
+ let job_id = task.partition.job_id.clone();
+ if prepare_failed_jobs.contains_key(&job_id) {
+ prepare_failed_jobs.entry(job_id).or_default().push(task);
+ continue;
+ }
+ match self
+ .state
+ .task_manager
+ .prepare_task_definition(task.clone())
+ {
Ok(task_definition) => tasks.push(task_definition),
Err(e) => {
error!("Error preparing task definition: {:?}", e);
+
prepare_failed_jobs.entry(job_id).or_default().push(task);
}
}
}
+
Review Comment:
I wonder whats the reason for this method?
When we detect that preparation of task failed, we can not recover from it
so job should be cancelled.
Would self.cancel_job(job_id) trigger cancelation of all running tasks for
given job and do cleanup of execution graph?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]