Hi,

I started doing the amp-camp 5 exercises
<http://ampcamp.berkeley.edu/5/exercises/data-exploration-using-spark.html>.
I tried the following 2 scenarios:

*Scenario #1*
val pagecounts = sc.textFile("data/pagecounts")
pagecounts.checkpoint
pagecounts.count

*Scenario #2*
val pagecounts = sc.textFile("data/pagecounts")
pagecounts.checkpoint

The total time show in the Spark shell Application UI was different for
both scenarios. Scenario #1 took 0.5 seconds, while scenario #2 took only
0.2 s.

*Questions:*
1. In scenario #1, checkpoint command does nothing, it's neither a
transformation nor an action. It's saying that once the RDD materializes
after an action, go ahead and save to disk. Am I missing something here?

2. I understand that scenario #1 is taking more time, because the RDD is
check-pointed (written to disk). Is there a way I can know the amount taken
to checkpoint, from the total time?
The Spark shell Application UI shows the following - Scheduler delay, Task
Deserialization time, GC time, Result serialization time, getting result
time. But, doesn't show the breakdown for checkpointing.

3. Is there a way to access the above metrics e.g. scheduler delay, GC time
and save them programmatically? I want to log some of the above metrics for
every action invoked on an RDD.

Please let me know if you need more information.
Thanks
—
Jatin

Reply via email to