mingmwang commented on PR #184: URL: https://github.com/apache/arrow-ballista/pull/184#issuecomment-1236511171
> Had a chance to review more carefully and I'm not sure I understand how this will interact with task statuses that are arriving from the executors. I haven't traced all the way through the logic but from what I can tell, if a stage is rolled back due to an executor being lost and a task status comes in after the rollback (or during the rollback but is queued waiting for the write lock on the job) then it will error out since the stage is no longer running. But this will cause all task updates in that batch to fail. This is not an error we currently handle so it can cause some cascading issues (including causing other jobs to stall out entirely since task updates are effectively dropped completely). > Good catching. There is some gap here, I will take a look and fix it. In this PR, the executor lost handling and task statuses update will not interleave with each other, they should be no lock contentions. All the updates to the TaskManager can only be triggered through QueryStageSchedulerEvents. The events are queued and processed one by one. Either QueryStageSchedulerEvent::ExecutorLost happens first or QueryStageSchedulerEvent::TaskUpdating happens first. If executor lost event is happened first, the QueryStageSchedulerEvent::TaskUpdating from that executor should be ignored. I already have some check in the task_status rpc call, there is some gap in the TaskManager::update_task_status, I will take a look. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org