jiangxb1987 edited a comment on issue #26975: [SPARK-26975][CORE] Stage retry 
and executor crash cause app hung up forever
URL: https://github.com/apache/spark/pull/26975#issuecomment-568607140
 
 
   I think I can confirm this is a bug and it's caused by we adding the 
`sched.markPartitionCompletedInAllTaskSets` logic, that when a task attempt 
from one TSM succeeded it shall mark the partition as completed for all the 
TSMs targeting the same Stage. Unfortunately the missing part is we didn't try 
to kill the running task attempts when we mark the partitions as completed, 
thus when the running task attetmpts failed with ExecutorLost it would revert 
the completed partition result (which is not necessary in this case).
   
   To me the best solution here would be to kill all the running task attempts 
in the TSM inside method `markPartitionCompleted`, this would resolve the issue 
without any side affect.
   
   Also cc @squito @cloud-fan @Ngone51 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to