Spark 1.5.2.

在 2015-11-26 13:19:39,"张志强(旺轩)" <zzq98...@alibaba-inc.com> 写道:


What’s your spark version?

发件人: wyphao.2007 [mailto:wyphao.2...@163.com]
发送时间: 2015年11月26日 10:04
收件人: user
抄送:dev@spark.apache.org
主题: Spark checkpoint problem

I am test checkpoint to understand how it works, My code as following:

 

scala> val data = sc.parallelize(List("a", "b", "c"))

data: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at 
parallelize at <console>:15

 

scala> sc.setCheckpointDir("/tmp/checkpoint")

15/11/25 18:09:07 WARN spark.SparkContext: Checkpoint directory must be 
non-local if Spark is running on a cluster: /tmp/checkpoint1

 

scala> data.checkpoint

 

scala> val temp = data.map(item => (item, 1))

temp: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[3] at map at 
<console>:17

 

scala> temp.checkpoint

 

scala> temp.count

 

but I found that only the temp RDD is checkpont in the /tmp/checkpoint 
directory, The data RDD is not checkpointed! I found the doCheckpoint function  
in the org.apache.spark.rdd.RDD class:

 

  private[spark] def doCheckpoint(): Unit = {

    RDDOperationScope.withScope(sc, "checkpoint", allowNesting = false, 
ignoreParent = true) {

      if (!doCheckpointCalled) {

        doCheckpointCalled = true

        if (checkpointData.isDefined) {

          checkpointData.get.checkpoint()

        } else {

          dependencies.foreach(_.rdd.doCheckpoint())

        }

      }

    }

  }

 

from the code above, Only the last RDD(In my case is temp) will be 
checkpointed, My question : Is deliberately designed or this is a bug?

 

Thank you.

 

 

 

 

 

 

 

Reply via email to