if you specify the same partitioner (custom or otherwise) for both partitionBy and groupBy, then may be it will help. The fundamental problem is groupByKey, that takes a lot of working memory. 1. Try to avoid groupByKey. What is it that you want to after sorting the list of grouped events? can you do that operation with a reduceByKey? 2. If not, use more partitions. That would cause lesser data in each partition, so less spilling. 3. You can control the amount memory allocated for shuffles by changing the configuration spark.shuffle.memoryFraction . More fraction would cause less spilling.
On Tue, Oct 27, 2015 at 6:35 PM, swetha kasireddy <swethakasire...@gmail.com > wrote: > So, Wouldn't using a customPartitioner on the rdd upon which the > groupByKey or reduceByKey is performed avoid shuffles and improve > performance? My code does groupByAndSort and reduceByKey on different > datasets as shown below. Would using a custom partitioner on those datasets > before using a groupByKey or reduceByKey improve performance? My idea is > to avoid shuffles and improve performance. Also, right now I see a lot of > spills when there is a very large dataset for groupByKey and reduceByKey. I > think the memory is not sufficient. We need to group by sessionId and then > sort the Jsons based on the timeStamp as shown in the below code. > > > What is the alternative to using groupByKey for better performance? And in > case of reduceByKey, would using a customPartitioner on the RDD upon which > the reduceByKey is performed would reduce the shuffles and improve the > performance? > > > rdd.partitionBy(customPartitioner) > > def getGrpdAndSrtdRecs(rdd: RDD[(String, (Long, String))]): RDD[(String, > List[(Long, String)])] = > { val grpdRecs = rdd.groupByKey(); val srtdRecs = > grpdRecs.mapValues[(List[(Long, String)])](iter => > iter.toList.sortBy(_._1)) srtdRecs } > > rdd.reduceByKey((a, b) => { > (Math.max(a._1, b._1), (a._2 ++ b._2)) > }) > > > > On Tue, Oct 27, 2015 at 5:07 PM, Tathagata Das <t...@databricks.com> > wrote: > >> If you just want to control the number of reducers, then setting the >> numPartitions is sufficient. If you want to control how exact partitioning >> scheme (that is some other scheme other than hash-based) then you need to >> implement a custom partitioner. It can be used to improve data skews, etc. >> which ultimately improves performance. >> >> On Tue, Oct 27, 2015 at 1:20 PM, swetha <swethakasire...@gmail.com> >> wrote: >> >>> Hi, >>> >>> We currently use reduceByKey to reduce by a particular metric name in our >>> Streaming/Batch job. It seems to be doing a lot of shuffles and it has >>> impact on performance. Does using a custompartitioner before calling >>> reduceByKey improve performance? >>> >>> >>> Thanks, >>> Swetha >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Does-using-Custom-Partitioner-before-calling-reduceByKey-improve-performance-tp25214.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >> >