aggregateByKey - external combine function

Nirav Patel Thu, 28 Apr 2016 10:13:19 -0700

Hi,

I tried to convert a groupByKey operation to aggregateByKey in a hope to
avoid memory and high gc issue when dealing with 200GB of data.
I needed to create a Collection of resulting key-value pairs which
represent all combinations of given key.


My merge fun definition is as follows:

private def processDataMerge(map1: collection.mutable.Map[String,
UserDataSet],
                                              map2:
collection.mutable.Map[String, UserDataSet])
: collection.mutable.Map[String, UserDataSet] = {

//psuedo code

map1 + map2
(Set[combEle1], Set[combEle2] ... ) = map1.map(...extract all elements here)
comb1 = cominatorics(Set[CombELe1])
..
totalcombinations = comb1 + comb2 + ..

map1 + totalcombinations.map(comb => (comb -> UserDataSet))

}


Output of one merge(or seq) is basically combinations of input collection
elements and so and so on. So finally you get all combinations for given
key.

Its performing worst using aggregateByKey then groupByKey with same
configuration. GroupByKey used to halt at last 9 partitions out of 4000.
This one is halting even earlier. (halting due to high GC). I kill the job
after it halts for hours on same task.

I give 25GB executor memory and 4GB overhead. My cluster can't allocate
more than 32GB per executor.

I thought of custom partitioning my keys so there's less data per key and
hence less combination. that will help with data skew but wouldn't in the
end it would come to same thing? Like at some point it will need to merge
key-values spread across different salt and it will come to memory issue at
that point!

Any pointer to resolve this? perhaps an external merge ?

Thanks
Nirav



Thanks

-- 


[image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>

<https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn] 
<https://www.linkedin.com/company/xactly-corporation>  [image: Twitter] 
<https://twitter.com/Xactly>  [image: Facebook] 
<https://www.facebook.com/XactlyCorp>  [image: YouTube] 
<http://www.youtube.com/xactlycorporation>

aggregateByKey - external combine function

Reply via email to