Hi Marco,
Yes, that answers my question. I just wanted to be sure as the API gave
me write access to the immutable data which means its up to the
developer to know not to modify the input parameters for these API's.
Thanks for the response.
Dave.
On 19/01/16 12:25, Marco wrote:
Hello,
RDD are immutable by design. The reasons, to quote Sean Owen in this
answer ( https://www.quora.com/Why-is-a-spark-RDD-immutable ), are the
following :
Immutability rules out a big set of potential problems due to
updates from multiple threads at once. Immutable data is
definitely safe to share across processes.
They're not just immutable but a deterministic function of their
input. This plus immutability also means the RDD's parts can be
recreated at any time. This makes caching, sharing and replication
easy.
These are significant design wins, at the cost of having to copy
data rather than mutate it in place. Generally, that's a decent
tradeoff to make: gaining the fault tolerance and correctness with
no developer effort is worth spending memory and CPU on, since the
latter are cheap.
A corollary: immutable data can as easily live in memory as on
disk. This makes it reasonable to easily move operations that hit
disk to instead use data in memory, and again, adding memory is
much easier than adding I/O bandwidth.
Of course, an RDD isn't really a collection of data, but just a
recipe for making data from other data. It is not literally
computed by materializing every RDD completely. That is, a lot of
the "copy" can be optimized away too.
I hope it answers your question.
Kind regards,
Marco
2016-01-19 13:14 GMT+01:00 ddav <dave.davo...@gmail.com
<mailto:dave.davo...@gmail.com>>:
Hi,
Certain API's (map, mapValues) give the developer access to the
data stored
in RDD's.
Am I correct in saying that these API's must never modify the data but
always return a new object with a copy of the data if the data
needs to be
updated for the returned RDD.
Thanks,
Dave.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-immutablility-tp26007.html
Sent from the Apache Spark User List mailing list archive at
Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: user-h...@spark.apache.org
<mailto:user-h...@spark.apache.org>