How to debug Spark source using IntelliJ/ Eclipse

2015-12-05 Thread jatinganhotra
Hi,

I am trying to understand Spark internal code and wanted to debug Spark
source, to add a new feature. I have tried the steps lined out here on the 
Spark Wiki page IDE setup

 
, but they don't work.

I also found other posts in the Dev mailing list such as - 

1.  Spark-1-5-0-setting-up-debug-env

 
, and

2.  using-IntelliJ-to-debug-SPARK-1-1-Apps-with-mvn-sbt-for-beginners

  

But, I found many issues with both the links. I have tried both these
articles many times, often re-starting the whole process from scratch after
deleting everything and re-installing again, but I always face some
dependency issues.

It would be great if someone from the Spark users group could point me to
the steps for setting up Spark debug environment.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-debug-Spark-source-using-IntelliJ-Eclipse-tp25577.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Checkpointing calls the job twice?

2015-10-17 Thread jatinganhotra
Hi,

I noticed that when you checkpoint a given RDD, it results in performing the
action twice as I can see 2 jobs being executed in the Spark UI.

Example:
val logFile = "/data/pagecounts"
sc.setCheckpointDir("/checkpoints")
val logData = sc.textFile(logFile, 2)
val as = logData.filter(line => line.contains("a"))

Scenario #1:
as.count()// Only 1 job.

But, if I change the above code to below:

Scenario #2:
as.cache()
as.checkpoint()
as.count()

Here, there are 2 jobs being executed as shown in the Spark UI, with
duration 0.9s and 0.4s

Why are there 2 jobs in scenario #2? In Spark source code, the comment for
RDD.checkpoint() says the following - 
"This function must be called before any job has been executed on this RDD.
It is strongly recommended that this RDD is persisted in memory, otherwise
saving it on a file will require recompilation."

In my example above, I am calling cache() before checkpoint(), so RDD will
be persisted in memory. Also, both of the above calls are before the count()
action, so checkpoint() is called before any job execution.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Checkpointing-calls-the-job-twice-tp25110.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Query about checkpointing time

2015-09-30 Thread jatinganhotra
Hi,

I started doing the  amp-camp 5 exercises
 
. I tried the following 2 scenarios:

*Scenario #1*
val pagecounts = sc.textFile("data/pagecounts")
pagecounts.checkpoint
pagecounts.count

*Scenario #2*
val pagecounts = sc.textFile("data/pagecounts")
pagecounts.checkpoint

The total time show in the Spark shell Application UI was different for both
scenarios. /Scenario #1 took 0.5 seconds, while scenario #2 took only 0.2
s/.

*Questions:*
1. In scenario #1, checkpoint command does nothing, it's neither a
transformation nor an action. It's saying that once the RDD materializes
after an action, go ahead and save to disk. Am I missing something here?

2. I understand that scenario #1 is taking more time, because the RDD is
check-pointed (written to disk). Is there a way I can know the amount taken
to checkpoint, from the total time?
The Spark shell Application UI shows the following - Scheduler delay, Task
Deserialization time, GC time, Result serialization time, getting result
time. But, doesn't show the breakdown for checkpointing.

3. Is there a way to access the above metrics e.g. scheduler delay, GC time
and save them programmatically? I want to log some of the above metrics for
every action invoked on an RDD.

Please let me know if you need more information.
Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Query-about-checkpointing-time-tp24884.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org