In my practice of spark application(almost Spark-SQL) , when there is a
complete node failure in my cluster, jobs which have shuffle blocks on the
node will completely fail after 4 task retries. It seems that data lineage
didn't work. What' more, our applications use multiple SQL statements for
Yes,shuffle service was already started in each NodeManager. What i mean
about node fails is the machine is down,all the service include nodemanager
process in this machine is down. So in this case, shuffle service is no
longer helpfull
--
Sent from:
In my use case, i run spark on yarn-client mode with dynamicAllocation
enabled, When a node shutting down abnormally, my spark application will
fails because of task fail to fetch shuffle blocks from that node 4 times.
Why spark do not leverage Alluxio(distributed in-memory filesystem) to write