liupc commented on issue #23647: [SPARK-26712]Support multi directories for 
executor shuffle info recovery in yarn shuffle serivce
URL: https://github.com/apache/spark/pull/23647#issuecomment-457995661
 
 
   @vanzin @HyukjinKwon 
   we once run into a similar problem on Spark2.0.1 when 
https://github.com/apache/spark/pull/14162 is not introduced. The disk broken 
of recovery path caused NM started without `YarnShuffleService`, so that 
executors scheduled on the node were unable to register with 
`YarnShuffleService`, and finally caused the application failure.
   Even though, we now have https://github.com/apache/spark/pull/14162 and the 
application level blacklist, but I think this PR still make sense for long 
running applications(for instance, Spark ThriftServer applications or spark 
streaming applications).
   For these type of applications, this case might not be a uncommon thing for 
they are running for a long time.
   and even if we suppose spark would recover with application level blacklist 
enabled, it will still cause resource waste, for shuffle will always fail on 
the node, not not mention that there are chances that the node is not 
blacklisted and will cause job failure.
   Hope my explanation can make you convinced.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to