Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19145 Did you enable RM or NM recovery, can you please clarify it? Normally, if we assume there's are 2 containers running on this NM, after 10 minutes, RM will detect the failure of NM and relaunch 2 lost containers in other NMs, and the total number of executors should still be the same. But things will be different if we enabled NM recovery, because now the failure of NM will not lead to container lost.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org