Hi, I started doing the amp-camp 5 exercises <http://ampcamp.berkeley.edu/5/exercises/data-exploration-using-spark.html> . I tried the following 2 scenarios:
*Scenario #1* val pagecounts = sc.textFile("data/pagecounts") pagecounts.checkpoint pagecounts.count *Scenario #2* val pagecounts = sc.textFile("data/pagecounts") pagecounts.checkpoint The total time show in the Spark shell Application UI was different for both scenarios. /Scenario #1 took 0.5 seconds, while scenario #2 took only 0.2 s/. *Questions:* 1. In scenario #1, checkpoint command does nothing, it's neither a transformation nor an action. It's saying that once the RDD materializes after an action, go ahead and save to disk. Am I missing something here? 2. I understand that scenario #1 is taking more time, because the RDD is check-pointed (written to disk). Is there a way I can know the amount taken to checkpoint, from the total time? The Spark shell Application UI shows the following - Scheduler delay, Task Deserialization time, GC time, Result serialization time, getting result time. But, doesn't show the breakdown for checkpointing. 3. Is there a way to access the above metrics e.g. scheduler delay, GC time and save them programmatically? I want to log some of the above metrics for every action invoked on an RDD. Please let me know if you need more information. Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Query-about-checkpointing-time-tp24884.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org