seayoun edited a comment on issue #26975: [SPARK-26975][CORE] Stage retry and executor crash cause app hung up forever URL: https://github.com/apache/spark/pull/26975#issuecomment-568625184 > I think I can confirm this is a bug and it's caused by we adding the `sched.markPartitionCompletedInAllTaskSets` logic, that when a task attempt from one TSM succeeded it shall mark the partition as completed for all the TSMs targeting the same Stage. Unfortunately the missing part is we didn't try to kill the running task attempts when we mark the partitions as completed, thus when the running task attetmpts failed with ExecutorLost it would revert the completed partition result (which is not necessary in this case). > > To me the best solution here would be to kill all the running task attempts for the completed partition in the TSM inside method `markPartitionCompleted`, this would resolve the issue without any side affect. > > Also cc @squito @cloud-fan @Ngone51 Expecting your code review! @jiangxb1987 @squito @cloud-fan @Ngone51
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org