If you invoke this, you will get at-least-once semantics on failure.
For instance, if a machine dies in the middle of executing the foreach
for a single partition, that will be re-executed on another machine.
It could even fully complete on one machine, but the machine dies
immediately before reporting the result back to the driver.

This means you need to make sure the side-effects are idempotent, or
use some transactional locking. Spark's own output operations, such as
saving to Hadoop, use such mechanisms. For instance, in the case of
Hadoop it uses the OutputCommitter classes.

- Patrick

On Fri, Mar 27, 2015 at 12:36 PM, Michal Klos <michal.klo...@gmail.com> wrote:
> Hi Spark group,
>
> We haven't been able to find clear descriptions of how Spark handles the
> resiliency of RDDs in relationship to executing actions with side-effects.
> If you do an `rdd.foreach(someSideEffect)`, then you are doing a side-effect
> for each element in the RDD. If a partition goes down -- the resiliency
> rebuilds the data,  but did it keep track of how far it go in the
> partition's set of data or will it start from the beginning again. So will
> it do at-least-once execution of foreach closures or at-most-once?
>
> thanks,
> Michal

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to