Are there scenarios where the developers have to be aware of how Spark's
fault tolerance works to implement correct programs?

It seems that if we want to maintain any sort of mutable state in each
worker through iterations, it can have some unintended effect once a
machine goes down.

E.g.,

while (true) {
  rdd.map((row : Array[Double]) => {
    row[numCols - 1] = computeSomething(row)
  }).reduce(...)
}

If it fails at some point, I'd imagine that the intermediate info being
stored in row[numCols - 1] will be lost. And unless Spark runs this whole
thing from the very first iteration, things will get out of sync.

I'd imagine that as long as we don't use mutable tricks inside of worker
tasks, we should be OK, but once we start doing that, things could get
ugly, unless we account for how Spark handles fault tolerance?

Reply via email to