I think 
https://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence 
might shed some light on the behaviour you’re seeing.

Mark

From: canan chen [mailto:ccn...@gmail.com]
Sent: June-17-15 5:57 AM
To: spark users
Subject: Intermedate stage will be cached automatically ?

Here's one simple spark example that I call RDD#count 2 times. The first time 
it would invoke 2 stages, but the second one only need 1 stage. Seems the first 
stage is cached. Is that true ? Any flag can I control whether the cache the 
intermediate stage

    val data = sc.parallelize(1 to 10, 2).map(e=>(e%2,2)).reduceByKey(_ + _, 2)
    println(data.count())
    println(data.count())

Reply via email to