What’s your spark version?

 

发件人: wyphao.2007 [mailto:wyphao.2...@163.com] 
发送时间: 2015年11月26日 10:04
收件人: user
抄送: dev@spark.apache.org
主题: Spark checkpoint problem

 

 

 

I am test checkpoint to understand how it works, My code as following:

 

scala> val data = sc.parallelize(List("a", "b", "c"))

data: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at
parallelize at <console>:15

 

scala> sc.setCheckpointDir("/tmp/checkpoint")

15/11/25 18:09:07 WARN spark.SparkContext: Checkpoint directory must be
non-local if Spark is running on a cluster: /tmp/checkpoint1

 

scala> data.checkpoint

 

scala> val temp = data.map(item => (item, 1))

temp: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[3] at map
at <console>:17

 

scala> temp.checkpoint

 

scala> temp.count

 

but I found that only the temp RDD is checkpont in the /tmp/checkpoint
directory, The data RDD is not checkpointed! I found the doCheckpoint
function  in the org.apache.spark.rdd.RDD class:

 

  private[spark] def doCheckpoint(): Unit = {

    RDDOperationScope.withScope(sc, "checkpoint", allowNesting = false,
ignoreParent = true) {

      if (!doCheckpointCalled) {

        doCheckpointCalled = true

        if (checkpointData.isDefined) {

          checkpointData.get.checkpoint()

        } else {

          dependencies.foreach(_.rdd.doCheckpoint())

        }

      }

    }

  }

 

from the code above, Only the last RDD(In my case is temp) will be
checkpointed, My question : Is deliberately designed or this is a bug?

 

Thank you.

 

 

 

 

 

 

 

Reply via email to