Re: RE: Spark checkpoint problem
I don't think it is a deliberate design. So you may need do action on the RDD before the action of RDD, if you want to explicitly checkpoint RDD. 2015-11-26 13:23 GMT+08:00 wyphao.2007: > Spark 1.5.2. > > 在 2015-11-26 13:19:39,"张志强(旺轩)" 写道: > > What’s your spark version? > > *发件人:* wyphao.2007 [mailto:wyphao.2...@163.com] > *发送时间:* 2015年11月26日 10:04 > *收件人:* user > *抄送:* dev@spark.apache.org > *主题:* Spark checkpoint problem > > I am test checkpoint to understand how it works, My code as following: > > > > scala> val data = sc.parallelize(List("a", "b", "c")) > > data: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at > parallelize at :15 > > > > scala> sc.setCheckpointDir("/tmp/checkpoint") > > 15/11/25 18:09:07 WARN spark.SparkContext: Checkpoint directory must be > non-local if Spark is running on a cluster: /tmp/checkpoint1 > > > > scala> data.checkpoint > > > > scala> val temp = data.map(item => (item, 1)) > > temp: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[3] at map > at :17 > > > > scala> temp.checkpoint > > > > scala> temp.count > > > > but I found that only the temp RDD is checkpont in the /tmp/checkpoint > directory, The data RDD is not checkpointed! I found the doCheckpoint > function in the org.apache.spark.rdd.RDD class: > > > > private[spark] def doCheckpoint(): Unit = { > > RDDOperationScope.withScope(sc, "checkpoint", allowNesting = false, > ignoreParent = true) { > > if (!doCheckpointCalled) { > > doCheckpointCalled = true > > if (checkpointData.isDefined) { > > checkpointData.get.checkpoint() > > } else { > > dependencies.foreach(_.rdd.doCheckpoint()) > > } > > } > > } > > } > > > > from the code above, Only the last RDD(In my case is temp) will be > checkpointed, My question : Is deliberately designed or this is a bug? > > > > Thank you. > > > > > > > > > > > > > > > > > > > -- 王海华
WebUI shows poor locality when task scheduling
Hi developers, I have sent to user mail list but no response... When running a exprimental KMeans job for expriment, the Cached RDD is original Points data. I saw poor locality in Task details from WebUI. Almost one half of the input of task is Network instead of Memory. And Task with network input consumes almost the same time compare with the task with Hadoop(Disk) input, and twice with task(Memory input). e.g Task(Memory): 16s Task(Network): 9s Task(Hadoop): 9s I see fectching RDD with 30MB form remote node consumes 5 seconds in executor logs like below: 15/03/31 04:08:52 INFO CoarseGrainedExecutorBackend: Got assigned task 58 15/03/31 04:08:52 INFO Executor: Running task 15.0 in stage 1.0 (TID 58) 15/03/31 04:08:52 INFO HadoopRDD: Input split: hdfs://master:8000/kmeans/data-Kmeans-5.3g:2013265920+134217728 15/03/31 04:08:52 INFO BlockManager: Found block rdd_3_15 locally 15/03/31 04:08:58 INFO Executor: Finished task 15.0 in stage 1.0 (TID 58). 1920 bytes result sent to driver 15/03/31 04:08:58 INFO CoarseGrainedExecutorBackend: Got assigned task 60 -Task60 15/03/31 04:08:58 INFO Executor: Running task 17.0 in stage 1.0 (TID 60) 15/03/31 04:08:58 INFO HadoopRDD: Input split: hdfs://master:8000/kmeans/data-Kmeans-5.3g:2281701376+134217728 15/03/31 04:09:02 INFO BlockManager: Found block rdd_3_17 remotely 15/03/31 04:09:12 INFO Executor: Finished task 17.0 in stage 1.0 (TID 60). 1920 bytes result sent to driver So 1)is that means i should use RDD with cache(MEMORY_AND_DISK) instead of Memory only? 2)And should i expand Network capacity or turn Schduling locality parameter? i set spark.locality.wait up to 15000, but no effect seems to increase the Memory input percentage Any suggestion will be appreciated. --Env info--- Cluster: 4 worker, with 1 Cores and 2G executor memory Spark version: 1.1.0 Network: 30MB/s -Submit shell--- bin/spark-submit --class org.apache.spark.examples.mllib.JavaKMeans --master spark://master:7077 --executor-memory 1g lib/spark-examples-1.1.0-hadoop2.3.0.jar hdfs://master:8000/kmeans/data-Kmeans-7g 8 1 Thanks very much and forgive for my poor English. -- Wang Haihua -- 王海华 - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org