How can I access data on RDDs?

2015-10-05 Thread jatinganhotra
Consider the following 2 scenarios:

*Scenario #1*
val pagecounts = sc.textFile("data/pagecounts")
pagecounts.checkpoint
pagecounts.count

*Scenario #2*
val pagecounts = sc.textFile("data/pagecounts")
pagecounts.count

The total time show in the Spark shell Application UI was different for both
scenarios. /Scenario #1 took 0.5 seconds, while scenario #2 took only 0.2
s/.

*Questions:*
1. I understand that scenario #1 is taking more time, because the RDD is
check-pointed (written to disk). Is there a way I can know the time taken
for checkpoint, from the total time?  

The Spark shell Application UI shows the following - Scheduler delay, Task
Deserialization time, GC time, Result serialization time, getting result
time. But, doesn't show the breakdown for checkpointing.  

2. Is there a way to access the above metrics e.g. scheduler delay, GC time
and save them programmatically? I want to log some of the above metrics for
every action invoked on an RDD.  

3. How can I programmatically access the following information:  
- Size of an RDD, when persisted to disk on checkpointing?  
- How much percentage of an RDD is in memory currently?  
- Overall time taken for computing an RDD?  

Please let me know if you need more information.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/How-can-I-access-data-on-RDDs-tp14475.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Checkpointing RDD calls the job twice?

2015-10-17 Thread jatinganhotra
Hi,

I noticed that when you checkpoint a given RDD, it results in performing the
action twice as I can see 2 jobs being executed in the Spark UI.

Example:
val logFile = "/data/pagecounts"
sc.setCheckpointDir("/checkpoints")
val logData = sc.textFile(logFile, 2)
val as = logData.filter(line => line.contains("a"))

Scenario #1:
as.count()// Only 1 job.

But, if I change the above code to below:

Scenario #2:
as.cache()
as.checkpoint()
as.count()

Here, there are 2 jobs being executed as shown in the Spark UI, with
duration 0.9s and 0.4s

Why are there 2 jobs in scenario #2? In Spark source code, the comment for
RDD.checkpoint() says the following - 
"This function must be called before any job has been executed on this RDD.
It is strongly recommended that this RDD is persisted in memory, otherwise
saving it on a file will require recompilation."

In my example above, I am calling cache() before checkpoint(), so RDD will
be persisted in memory. Also, both of the above calls are before the count()
action, so checkpoint() is called before any job execution.

What am I missing here? I have looked further into source code to understand
what could be wrong, but found nothing.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Checkpointing-RDD-calls-the-job-twice-tp14666.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



How to debug Spark source using IntelliJ/ Eclipse

2015-12-05 Thread jatinganhotra
Hi,

I am trying to understand Spark internal code and wanted to debug Spark
source, to add a new feature. I have tried the steps lined out here on the 
Spark Wiki page IDE setup

 
, but they don't work.

I also found other posts in the Dev mailing list such as - 

1.  Spark-1-5-0-setting-up-debug-env

 
, and

2.  using-IntelliJ-to-debug-SPARK-1-1-Apps-with-mvn-sbt-for-beginners

  

But, I found many issues with both the links. I have tried both these
articles many times, often re-starting the whole process from scratch after
deleting everything and re-installing again, but I always face some
dependency issues.

It would be great if someone from the Spark developers group could point me
to the steps for setting up Spark debug environment.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-debug-Spark-source-using-IntelliJ-Eclipse-tp15477.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org