Hi, I tried to convert a groupByKey operation to aggregateByKey in a hope to avoid memory and high gc issue when dealing with 200GB of data. I needed to create a Collection of resulting key-value pairs which represent all combinations of given key.
My merge fun definition is as follows: private def processDataMerge(map1: collection.mutable.Map[String, UserDataSet], map2: collection.mutable.Map[String, UserDataSet]) : collection.mutable.Map[String, UserDataSet] = { //psuedo code map1 + map2 (Set[combEle1], Set[combEle2] ... ) = map1.map(...extract all elements here) comb1 = cominatorics(Set[CombELe1]) .. totalcombinations = comb1 + comb2 + .. map1 + totalcombinations.map(comb => (comb -> UserDataSet)) } Output of one merge(or seq) is basically combinations of input collection elements and so and so on. So finally you get all combinations for given key. Its performing worst using aggregateByKey then groupByKey with same configuration. GroupByKey used to halt at last 9 partitions out of 4000. This one is halting even earlier. (halting due to high GC). I kill the job after it halts for hours on same task. I give 25GB executor memory and 4GB overhead. My cluster can't allocate more than 32GB per executor. I thought of custom partitioning my keys so there's less data per key and hence less combination. that will help with data skew but wouldn't in the end it would come to same thing? Like at some point it will need to merge key-values spread across different salt and it will come to memory issue at that point! Any pointer to resolve this? perhaps an external merge ? Thanks Nirav Thanks -- [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] <https://www.linkedin.com/company/xactly-corporation> [image: Twitter] <https://twitter.com/Xactly> [image: Facebook] <https://www.facebook.com/XactlyCorp> [image: YouTube] <http://www.youtube.com/xactlycorporation>