Dear Spark developers,
I am trying to understand how Spark UI displays operation with the cached RDD.
For example, the following code caches an rdd:
>> val rdd = sc.parallelize(1 to 5, 5).zipWithIndex.cache
>> rdd.count
The Jobs tab shows me that the RDD is evaluated:
: 1 count at <console>:24 2015/10/09 16:15:43 0.4 s
1/1
: 0 zipWithIndex at <console> :21 2015/10/09 16:15:38 0.6 s
1/1
An I can observe this rdd in the Storage tab of Spark UI:
: ZippedWithIndexRDD Memory Deserialized 1x Replicated
Then I want to make an operation over the cached RDD. I run the following code:
>> val g = rdd.groupByKey()
>> g.count
The Jobs tab shows me a new Job:
: 2 count at <console>:26
Inside this Job there are two stages:
: 3 count at <console>:26 +details 2015/10/09 16:16:18 0.2 s 5/5
: 2 zipWithIndex at <console>:21
It shows that zipWithIndex is executed again. It does not seem to be
reasonable, because the rdd is cached, and zipWithIndex is already executed
previously.
Could you explain why if I perform an operation followed by an action on a
cached RDD, then the last operation in the lineage of the cached RDD is shown
to be executed in the Spark UI?
Best regards, Alexander