Re: What RDD transformations trigger computations?

Josh Rosen Thu, 18 Dec 2014 11:55:32 -0800

Could you provide an example?  These operations are lazy, in the sense that 
they don’t trigger Spark jobs:



scala> val a = sc.parallelize(1 to 10000, 1).mapPartitions{ x => 
println("computed a!"); x}
a: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[14] at mapPartitions at 
<console>:18

scala> a.union(a)
res4: org.apache.spark.rdd.RDD[Int] = UnionRDD[15] at union at <console>:22

scala> a.map(x => (x, x)).groupByKey()
res5: org.apache.spark.rdd.RDD[(Int, Iterable[Int])] = ShuffledRDD[17] at 
groupByKey at <console>:22

scala> a.map(x => (x, x)).groupByKey().count()
computed a!
res6: Long = 10000


On December 18, 2014 at 1:04:54 AM, Alessandro Baretta (alexbare...@gmail.com) 
wrote:

All,  

I noticed that while some operations that return RDDs are very cheap, such  
as map and flatMap, some are quite expensive, such as union and groupByKey.  
I'm referring here to the cost of constructing the RDD scala value, not the  
cost of collecting the values contained in the RDD. This does not match my  
understanding that RDD transformations only set up a computation without  
actually running it. Oh, Spark developers, can you please provide some  
clarity?  

Alex

Re: What RDD transformations trigger computations?

Reply via email to