Thank you Akhil!
Date: Fri, 14 Aug 2015 14:51:56 +0530
Subject: Re: RDD.join vs spark SQL join
From: ak...@sigmoidanalytics.com
To: jiangxia...@outlook.com
CC: user@spark.apache.org
Both works the same way, but with SparkSQL you will get the optimization etc
done by the catalyst. One important thing to consider is the # partitions and
the key distribution (when you are doing RDD.join), If the keys are not evenly
distributed across machines then you can see the process chocking on a single
task (more like it takes hell lot of time for one task to execute compared to
others in that stage).ThanksBest Regards
On Fri, Aug 14, 2015 at 1:25 AM, Xiao JIANG jiangxia...@outlook.com wrote:
Hi,May I know the performance difference the rdd.join function and spark SQL
join operation. If I want to join several big Rdds, how should I decide which
one I should use? What are the factors to consider here? Thanks!