Hi, I noticed that when you checkpoint a given RDD, it results in performing the action twice as I can see 2 jobs being executed in the Spark UI.
Example: val logFile = "/data/pagecounts" sc.setCheckpointDir("/checkpoints") val logData = sc.textFile(logFile, 2) val as = logData.filter(line => line.contains("a")) Scenario #1: as.count() // Only 1 job. But, if I change the above code to below: Scenario #2: as.cache() as.checkpoint() as.count() Here, there are 2 jobs being executed as shown in the Spark UI, with duration 0.9s and 0.4s Why are there 2 jobs in scenario #2? In Spark source code, the comment for RDD.checkpoint() says the following - "This function must be called before any job has been executed on this RDD. It is strongly recommended that this RDD is persisted in memory, otherwise saving it on a file will require recompilation." In my example above, I am calling cache() before checkpoint(), so RDD will be persisted in memory. Also, both of the above calls are before the count() action, so checkpoint() is called before any job execution. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Checkpointing-calls-the-job-twice-tp25110.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org