Operations with cached RDD

Ulanov, Alexander Fri, 09 Oct 2015 16:39:39 -0700

Dear Spark developers,

I am trying to understand how Spark UI displays operation with the cached RDD.


For example, the following code caches an rdd:
>> val rdd = sc.parallelize(1 to 5, 5).zipWithIndex.cache
>> rdd.count
The Jobs tab shows me that the RDD is evaluated:
: 1 count at <console>:24              2015/10/09 16:15:43        0.4 s       
1/1
: 0 zipWithIndex at <console> :21             2015/10/09 16:15:38        0.6 s  
     1/1
An I can observe this rdd in the Storage tab of Spark UI:
: ZippedWithIndexRDD  Memory Deserialized 1x Replicated

Then I want to make an operation over the cached RDD. I run the following code:
>> val g = rdd.groupByKey()
>> g.count
The Jobs tab shows me a new Job:
: 2 count at <console>:26
Inside this Job there are two stages:
: 3 count at <console>:26 +details 2015/10/09 16:16:18   0.2 s       5/5
: 2 zipWithIndex at <console>:21
It shows that zipWithIndex is executed again. It does not seem to be 
reasonable, because the rdd is cached, and zipWithIndex is already executed 
previously.

Could you explain why if I perform an operation followed by an action on a 
cached RDD, then the last operation in the lineage of the cached RDD is shown 
to be executed in the Spark UI?


Best regards, Alexander

Operations with cached RDD

Reply via email to