liupc edited a comment on issue #23647: [SPARK-26712]Support multi directories for executor shuffle info recovery in yarn shuffle serivce URL: https://github.com/apache/spark/pull/23647#issuecomment-457995661 @vanzin @HyukjinKwon we once run into a similar problem on Spark2.0.1 when https://github.com/apache/spark/pull/14162 is not introduced. The disk broken of recovery path caused NM started without `YarnShuffleService`, so that executors scheduled on the node were unable to register with `YarnShuffleService`, and finally caused the application failure. Even though, we now have https://github.com/apache/spark/pull/14162 and the application level blacklist, but I think this PR still make sense for long running applications(for instance, Spark ThriftServer applications or spark streaming applications). For these type of applications, this case might not be a uncommon thing for they are running for a long time, and even if we suppose spark would recover with application level blacklist enabled, it will still cause resource waste, for shuffle will always fail on the node, not not mention that there are chances that the node is not blacklisted and will cause job failure. And the resource waste or job failure is unacceptable for a thriftServer or streaming applications. Hope my explanation can make you convinced.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org