Hi,
I have following code that uses checkpoint to checkpoint the heavy ops,which
works well that the last heavyOpRDD.foreach(println) will not recompute from
the beginning.
But when I re-run this program, the rdd computing chain will be recomputed from
the beginning, I thought that it will also read from the checkpoint directory
since I have the data there when I last run it.
Do I misunderstand how checkpoint works or there are some configuration to make
it work. Thanks
import org.apache.spark.{SparkConf, SparkContext}
object CheckpointTest {
def squareWithHeavyOp(x: Int) = {
Thread.sleep(2000)
println(s"squareWithHeavyOp $x")
x * x
}
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local").setAppName("CheckpointTest")
val sc = new SparkContext(conf)
sc.setCheckpointDir("file:///d:/checkpointDir")
val rdd = sc.parallelize(List(1, 2, 3, 4, 5))
val heavyOpRDD = rdd.map(squareWithHeavyOp)
heavyOpRDD.checkpoint()
heavyOpRDD.foreach(println)
println("Job 0 has been finished, press ENTER to do job 1")
readLine()
heavyOpRDD.foreach(println)
}
}
[email protected]