After sorting the list of grouped events I would need to have an RDD that
has a key which is nothing but the  sessionId and a list of values that are
sorted by timeStamp for each input Json. So basically the return type would
be RDD[(String, List[(Long, String)]  where the key is the sessionId and  a
list of tuples that has a timeStamp and Json as the values. I will need to
use groupByKey to do a groupBy sessionId and secondary sort by timeStamp
and then get the list of JsonValues in a sorted order. Is there any
alternative for that? Please find the code below that I used for the same.


Also, does using a customPartitioner for a reduceByKey improve performance?


def getGrpdAndSrtdRecs(rdd: RDD[(String, (Long, String))]): RDD[(String,
List[(Long, String)])] =
{ val grpdRecs = rdd.groupByKey(); val srtdRecs =
grpdRecs.mapValues[(List[(Long, String)])](iter =>
iter.toList.sortBy(_._1)) srtdRecs }


On Tue, Oct 27, 2015 at 6:47 PM, Tathagata Das <t...@databricks.com> wrote:

> if you specify the same partitioner (custom or otherwise) for both
> partitionBy and groupBy, then may be it will help. The fundamental problem
> is groupByKey, that takes a lot of working memory.
> 1. Try to avoid groupByKey. What is it that you want to after sorting the
> list of grouped events? can you do that operation with a reduceByKey?
> 2. If not, use more partitions. That would cause lesser data in each
> partition, so less spilling.
> 3. You can control the amount memory allocated for shuffles by changing
> the configuration spark.shuffle.memoryFraction . More fraction would cause
> less spilling.
>
>
> On Tue, Oct 27, 2015 at 6:35 PM, swetha kasireddy <
> swethakasire...@gmail.com> wrote:
>
>> So, Wouldn't  using a customPartitioner on the rdd upon which  the
>> groupByKey  or reduceByKey is performed avoid shuffles and improve
>> performance? My code does groupByAndSort and reduceByKey on different
>> datasets as shown below. Would using a custom partitioner on those datasets
>> before using a  groupByKey or reduceByKey improve performance? My idea is
>> to avoid shuffles and improve performance. Also, right now I see a lot of
>> spills when there is a very large dataset for groupByKey and reduceByKey. I
>> think the memory is not sufficient. We need to group by sessionId and then
>> sort the Jsons based on the timeStamp as shown in the below code.
>>
>>
>> What is the alternative to using groupByKey for better performance? And
>> in case of reduceByKey, would using a customPartitioner on the RDD upon
>> which the reduceByKey is performed would reduce the shuffles and improve
>> the performance?
>>
>>
>> rdd.partitionBy(customPartitioner)
>>
>> def getGrpdAndSrtdRecs(rdd: RDD[(String, (Long, String))]): RDD[(String,
>> List[(Long, String)])] =
>> { val grpdRecs = rdd.groupByKey(); val srtdRecs =
>> grpdRecs.mapValues[(List[(Long, String)])](iter =>
>> iter.toList.sortBy(_._1)) srtdRecs }
>>
>> rdd.reduceByKey((a, b) => {
>>   (Math.max(a._1, b._1), (a._2 ++ b._2))
>> })
>>
>>
>>
>> On Tue, Oct 27, 2015 at 5:07 PM, Tathagata Das <t...@databricks.com>
>> wrote:
>>
>>> If you just want to control the number of reducers, then setting the
>>> numPartitions is sufficient. If you want to control how exact partitioning
>>> scheme (that is some other scheme other than hash-based) then you need to
>>> implement a custom partitioner. It can be used to improve data skews, etc.
>>> which ultimately improves performance.
>>>
>>> On Tue, Oct 27, 2015 at 1:20 PM, swetha <swethakasire...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> We currently use reduceByKey to reduce by a particular metric name in
>>>> our
>>>> Streaming/Batch job. It seems to be doing a lot of shuffles and it has
>>>> impact on performance. Does using a custompartitioner before calling
>>>> reduceByKey improve performance?
>>>>
>>>>
>>>> Thanks,
>>>> Swetha
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Does-using-Custom-Partitioner-before-calling-reduceByKey-improve-performance-tp25214.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>>
>>
>

Reply via email to