RE: RDD.join vs spark SQL join
Thank you Akhil! Date: Fri, 14 Aug 2015 14:51:56 +0530 Subject: Re: RDD.join vs spark SQL join From: ak...@sigmoidanalytics.com To: jiangxia...@outlook.com CC: user@spark.apache.org Both works the same way, but with SparkSQL you will get the optimization etc done by the catalyst. One important thing to consider is the # partitions and the key distribution (when you are doing RDD.join), If the keys are not evenly distributed across machines then you can see the process chocking on a single task (more like it takes hell lot of time for one task to execute compared to others in that stage).ThanksBest Regards On Fri, Aug 14, 2015 at 1:25 AM, Xiao JIANG jiangxia...@outlook.com wrote: Hi,May I know the performance difference the rdd.join function and spark SQL join operation. If I want to join several big Rdds, how should I decide which one I should use? What are the factors to consider here? Thanks!
Re: RDD.join vs spark SQL join
Both works the same way, but with SparkSQL you will get the optimization etc done by the catalyst. One important thing to consider is the # partitions and the key distribution (when you are doing RDD.join), If the keys are not evenly distributed across machines then you can see the process chocking on a single task (more like it takes hell lot of time for one task to execute compared to others in that stage). Thanks Best Regards On Fri, Aug 14, 2015 at 1:25 AM, Xiao JIANG jiangxia...@outlook.com wrote: Hi, May I know the performance difference the rdd.join function and spark SQL join operation. If I want to join several big Rdds, how should I decide which one I should use? What are the factors to consider here? Thanks!
RDD.join vs spark SQL join
Hi,May I know the performance difference the rdd.join function and spark SQL join operation. If I want to join several big Rdds, how should I decide which one I should use? What are the factors to consider here? Thanks!