Hi everyone, I have a question about the shuffle mechanisms in Spark and the fault-tolerance I should expect. Suppose I have a simple job with two stages – something like rdd.textFile().mapToPair().reduceByKey().saveAsTextFile().
The questions I have are, 1. Suppose I’m not using the external shuffle service. I’m running the job. The first stage succeeds. During the second stage, one of the executors is lost (for the simplest case, someone uses kill –9 on it and the job itself should have no problems completing otherwise). Should I expect the job to be able to recover and complete successfully? My understanding is that the lost shuffle files from that executor can still be re-computed and the job should be able to complete successfully. 2. Suppose I’m using the shuffle service. How does this change the result of question #1? 3. Suppose I’m using the shuffle service, and I’m using standalone mode. The first stage succeeds. During the second stage, I kill both the executor and the worker that spawned that executor. Now that the shuffle files associated with that worker’s shuffle service daemon have been lost, will the job be able to recompute the lost shuffle data? This is the scenario I’m running into most, where my tasks fail because they try to reach the shuffle service instead of trying to recompute the lost shuffle files. Thanks, -Matt Cheah