Hi. "All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently – for example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset." See section "RDD Operations" in https://spark.apache.org/docs/1.2.0/programming-guide.html
Thus, neither your myrdd2 nor myrdd will exist until you call the count. What is stored is just "how to create myrdd and myrdd2" so yes, this is safe.. When you run myrdd2.count the both RDDs are created, myrdd2 is counted and the count printed out. After the operation both RDDs are "destroyed" again. If you run the myrdd2.count again, both myrdd and myrdd2 are created again .. If your transformation is expensive, you may want to keep the data around and for that must use .persist() or .cache() etc. Regards, Gylfi. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-reference-for-RDD-is-safe-tp23843p23894.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org