The reason is that some operators get pipelined into a single stage. rdd.map(XX).filter(YY) - this executes in a single stage since there is no data movement needed in between these operations.
If you call toDeubgString on the final RDD it will give you some information about the exact lineage. In Spark 1.1 this will return information about stage boudnaries as well. On Wed, Aug 20, 2014 at 4:22 AM, Grzegorz Białek < grzegorz.bia...@codilime.com> wrote: > Hi, > > I am wondering why in web UI some stages (like join, filter) are not > visible. For example this code: > > val simple = sc.parallelize(Array.range(0,100)) > val simple2 = sc.parallelize(Array.range(0,100)) > > val toJoin = simple.map(x => (x, x.toString + x.toString)) > val rdd = simple2 > .map(x => (scala.util.Random.nextInt(100), x)) > .join(toJoin) > .map { case (r, (x, s)) => (r, x)} > .reduceByKey(_ + _) > .sortByKey() > .cache() > rdd.saveAsTextFile("output/1") > > val rdd2 = toJoin > .groupBy{ case (x, _) => x} > .filter{ case (x, _) => x < 10} > rdd2.saveAsTextFile("output/2") > > println(rdd2.join(toJoin).count()) > > in UI doesn't show join and filter stages and moreover it shows sortByKey > and reduceByKey twice. > Could anyone explain how it works? > > Thanks, > Grzegorz >