The reason is that some operators get pipelined into a single stage.
rdd.map(XX).filter(YY) - this executes in a single stage since there is no
data movement needed in between these operations.

If you call toDeubgString on the final RDD it will give you some
information about the exact lineage. In Spark 1.1 this will return
information about stage boudnaries as well.


On Wed, Aug 20, 2014 at 4:22 AM, Grzegorz Białek <
grzegorz.bia...@codilime.com> wrote:

> Hi,
>
> I am wondering why in web UI some stages (like join, filter) are not
> visible. For example this code:
>
> val simple = sc.parallelize(Array.range(0,100))
> val simple2 = sc.parallelize(Array.range(0,100))
>
>   val toJoin = simple.map(x => (x, x.toString + x.toString))
>   val rdd = simple2
>     .map(x => (scala.util.Random.nextInt(100), x))
>     .join(toJoin)
>     .map { case (r, (x, s)) => (r, x)}
>     .reduceByKey(_ + _)
>     .sortByKey()
>     .cache()
>   rdd.saveAsTextFile("output/1")
>
>   val rdd2 = toJoin
>     .groupBy{ case (x, _) => x}
>     .filter{ case (x, _) => x < 10}
>   rdd2.saveAsTextFile("output/2")
>
>   println(rdd2.join(toJoin).count())
>
> in UI doesn't show join and filter stages and moreover it shows sortByKey
> and reduceByKey twice.
> Could anyone explain how it works?
>
> Thanks,
> Grzegorz
>

Reply via email to