liupc edited a comment on issue #23647: [SPARK-26712]Support multi directories 
for executor shuffle info recovery in yarn shuffle serivce
URL: https://github.com/apache/spark/pull/23647#issuecomment-457995661
 
 
   @vanzin @HyukjinKwon 
   we once run into a similar problem on Spark2.0.1 when 
https://github.com/apache/spark/pull/14162 is not introduced. The disk broken 
of recovery path caused NM started without `YarnShuffleService`, so that 
executors scheduled on the node were unable to register with 
`YarnShuffleService`, and finally caused the application failure.
   
   Even though, we now have https://github.com/apache/spark/pull/14162 and the 
application level blacklist, but I think this PR still make sense for long 
running applications(for instance, Spark ThriftServer applications or spark 
streaming applications).
   For these type of applications, this case might not be a uncommon thing for 
they are running for a long time, and even if we suppose spark would recover 
with application level blacklist enabled, it will still cause resource waste, 
for shuffle will always fail on the node, not not mention that there are 
chances that the node is not blacklisted and will cause job failure.
   
   Hope my explanation can make you convinced.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to