RE: RDD.join vs spark SQL join

2015-08-15 Thread Xiao JIANG
Thank you Akhil!

Date: Fri, 14 Aug 2015 14:51:56 +0530
Subject: Re: RDD.join vs spark SQL join
From: ak...@sigmoidanalytics.com
To: jiangxia...@outlook.com
CC: user@spark.apache.org

Both works the same way, but with SparkSQL you will get the optimization etc 
done by the catalyst. One important thing to consider is the # partitions and 
the key distribution (when you are doing RDD.join), If the keys are not evenly 
distributed across machines then you can see the process chocking on a single 
task (more like it takes hell lot of time for one task to execute compared to 
others in that stage).ThanksBest Regards

On Fri, Aug 14, 2015 at 1:25 AM, Xiao JIANG jiangxia...@outlook.com wrote:



Hi,May I know the performance difference the rdd.join function and spark SQL 
join operation. If I want to join several big Rdds, how should I decide which 
one I should use? What are the factors to consider here?  Thanks!   
   

  

RDD.join vs spark SQL join

2015-08-13 Thread Xiao JIANG
Hi,May I know the performance difference the rdd.join function and spark SQL 
join operation. If I want to join several big Rdds, how should I decide which 
one I should use? What are the factors to consider here?  Thanks!   
   

How to get total CPU consumption for Spark job

2015-08-07 Thread Xiao JIANG
Hi all,
I was running some Hive/spark job on hadoop cluster.  I want to see how spark 
helps improve not only the elapsed time but also the total CPU consumption.
For Hive, I can get the 'Total MapReduce CPU Time Spent' from the log when the 
job finishes. But I didn't find any CPU stats for Spark jobs from either spark 
log or web UI. Is there any place I can find the total CPU consumption for my 
spark job? Thanks!
Here is the version info: Spark version 1.3.0 Using Scala version 2.10.4, Java 
1.7.0_67
Thanks!Xiao