Here's one simple spark example that I call RDD#count 2 times. The first time it would invoke 2 stages, but the second one only need 1 stage. Seems the first stage is cached. Is that true ? Any flag can I control whether the cache the intermediate stage
val data = sc.parallelize(1 to 10, 2).map(e=>(e%2,2)).reduceByKey(_ + _, 2) println(data.count()) println(data.count())