[ https://issues.apache.org/jira/browse/HIVE-20108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16540599#comment-16540599 ]
Sahil Takiar commented on HIVE-20108: ------------------------------------- Another limitation of {{groupByKey}} is that it can't push any aggregation into any of the shuffle logic. What we probably want here is something like {{combineByKey}} or {{reduceByKey}}. These functions allow specifying an {{Aggregator}} which is called in {{ExternalAppendOnlyMap}} to push aggregation into the shuffle-reader (e.g. {{BlockStoreShuffleReader}}). This allows aggregating data in memory, which avoids having to spill much data to disk. {{groupByKey}} doesn't have this functionality and all data has to be stored in the {{ExternalAppendOnlyMap}} before any aggregation by Hive can start, this can result in a lot more spilled data. Getting {{combineByKey}} / {{reduceByKey}} to work with HoS looks tricky, we basically have to wrap the {{GroupByOperator}} into a function that Spark can call. This should significantly decrease the chance of OOM that has been seen in {{groupByKey}} since we are pushing the aggregation as far down as possible which results in less data being stored in memory and spilled to disk. > Investigate alternatives to groupByKey > -------------------------------------- > > Key: HIVE-20108 > URL: https://issues.apache.org/jira/browse/HIVE-20108 > Project: Hive > Issue Type: Improvement > Components: Spark > Reporter: Sahil Takiar > Assignee: Sahil Takiar > Priority: Major > > We use {{groupByKey}} for aggregations (or if > {{hive.spark.use.groupby.shuffle}} is false we use > {{repartitionAndSortWithinPartitions}}). > {{groupByKey}} has its drawbacks because it can't spill records within a > single key group. It also seems to be doing some unnecessary work in Spark's > {{Aggregator}} (not positive about this part). > {{repartitionAndSortWithinPartitions}} is better, but the sorting within > partitions isn't necessary for aggregations. -- This message was sent by Atlassian JIRA (v7.6.3#76005)