mingmwang commented on PR #184:
URL: https://github.com/apache/arrow-ballista/pull/184#issuecomment-1236511171

   > Had a chance to review more carefully and I'm not sure I understand how 
this will interact with task statuses that are arriving from the executors. I 
haven't traced all the way through the logic but from what I can tell, if a 
stage is rolled back due to an executor being lost and a task status comes in 
after the rollback (or during the rollback but is queued waiting for the write 
lock on the job) then it will error out since the stage is no longer running. 
But this will cause all task updates in that batch to fail. This is not an 
error we currently handle so it can cause some cascading issues (including 
causing other jobs to stall out entirely since task updates are effectively 
dropped completely).
   > 
   
   Good catching. There is some gap here, I will take a look and fix it.
   In this PR, the executor lost handling and task statuses update will not 
interleave with each other, they should be no lock
   contentions.  All the updates to the TaskManager can only be triggered 
through QueryStageSchedulerEvents. The events are queued and processed one by 
one.  Either QueryStageSchedulerEvent::ExecutorLost happens first or 
QueryStageSchedulerEvent::TaskUpdating happens first. 
   If executor lost event is happened first, the 
QueryStageSchedulerEvent::TaskUpdating from that executor should be ignored. I 
already have some check in the task_status rpc call, there is some gap in the 
TaskManager::update_task_status, I will take a look.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to