Hi everyone,

I have a question about the shuffle mechanisms in Spark and the fault-tolerance 
I should expect. Suppose I have a simple job with two stages  – something like 
rdd.textFile().mapToPair().reduceByKey().saveAsTextFile().

The questions I have are,

  1.  Suppose I’m not using the external shuffle service. I’m running the job. 
The first stage succeeds. During the second stage, one of the executors is lost 
(for the simplest case, someone uses kill –9 on it and the job itself should 
have no problems completing otherwise). Should I expect the job to be able to 
recover and complete successfully? My understanding is that the lost shuffle 
files from that executor can still be re-computed and the job should be able to 
complete successfully.
  2.  Suppose I’m using the shuffle service. How does this change the result of 
question #1?
  3.  Suppose I’m using the shuffle service, and I’m using standalone mode. The 
first stage succeeds. During the second stage, I kill both the executor and the 
worker that spawned that executor. Now that the shuffle files associated with 
that worker’s shuffle service daemon have been lost, will the job be able to 
recompute the lost shuffle data? This is the scenario I’m running into most, 
where my tasks fail because they try to reach the shuffle service instead of 
trying to recompute the lost shuffle files.

Thanks,

-Matt Cheah

Reply via email to