A question about spark checkpoint

[email protected] Tue, 28 Jul 2015 00:55:06 -0700

Hi,

I have following code that uses checkpoint to checkpoint the heavy ops,which 
works well that the last heavyOpRDD.foreach(println) will not recompute from 
the beginning.
But when I re-run this program, the rdd computing chain will be recomputed from 
the beginning, I thought that it will also read from the checkpoint directory 
since I have the data there when I last run it.


Do I misunderstand how checkpoint works or there are some configuration to make 
it work. Thanks



import org.apache.spark.{SparkConf, SparkContext} 

object CheckpointTest { 
def squareWithHeavyOp(x: Int) = { 
Thread.sleep(2000) 
println(s"squareWithHeavyOp $x") 
x * x 
} 

def main(args: Array[String]) { 
val conf = new SparkConf().setMaster("local").setAppName("CheckpointTest") 
val sc = new SparkContext(conf) 
sc.setCheckpointDir("file:///d:/checkpointDir") 
val rdd = sc.parallelize(List(1, 2, 3, 4, 5)) 
val heavyOpRDD = rdd.map(squareWithHeavyOp) 
heavyOpRDD.checkpoint() 
heavyOpRDD.foreach(println) 

println("Job 0 has been finished, press ENTER to do job 1") 
readLine() 
heavyOpRDD.foreach(println) 
} 
} 





[email protected]

A question about spark checkpoint

Reply via email to