Alessandro was probably referring to some transformations whose
implementations depend on some actions. For example: sortByKey requires
sampling the data to get the histogram.

There is a ticket tracking this:
https://issues.apache.org/jira/browse/SPARK-2992






On Thu, Dec 18, 2014 at 11:52 AM, Josh Rosen <rosenvi...@gmail.com> wrote:
>
> Could you provide an example?  These operations are lazy, in the sense
> that they don’t trigger Spark jobs:
>
>
> scala> val a = sc.parallelize(1 to 10000, 1).mapPartitions{ x =>
> println("computed a!"); x}
> a: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[14] at mapPartitions
> at <console>:18
>
> scala> a.union(a)
> res4: org.apache.spark.rdd.RDD[Int] = UnionRDD[15] at union at <console>:22
>
> scala> a.map(x => (x, x)).groupByKey()
> res5: org.apache.spark.rdd.RDD[(Int, Iterable[Int])] = ShuffledRDD[17] at
> groupByKey at <console>:22
>
> scala> a.map(x => (x, x)).groupByKey().count()
> computed a!
> res6: Long = 10000
>
>
> On December 18, 2014 at 1:04:54 AM, Alessandro Baretta (
> alexbare...@gmail.com) wrote:
>
> All,
>
> I noticed that while some operations that return RDDs are very cheap, such
> as map and flatMap, some are quite expensive, such as union and groupByKey.
> I'm referring here to the cost of constructing the RDD scala value, not the
> cost of collecting the values contained in the RDD. This does not match my
> understanding that RDD transformations only set up a computation without
> actually running it. Oh, Spark developers, can you please provide some
> clarity?
>
> Alex
>

Reply via email to