Sahil Takiar created HIVE-20141:
-----------------------------------
Summary: Turn hive.spark.use.groupby.shuffle off by default
Key: HIVE-20141
URL: https://issues.apache.org/jira/browse/HIVE-20141
Project: Hive
Issue Type: Task
Components: Spark
Reporter: Sahil Takiar
Assignee: Sahil Takiar
[~xuefuz] any thoughts on this? I think it would provide better out of the box
behavior for Hive-on-Spark users, especially for users who are migrating from
Hive-on-MR to HoS. Wondering what your experience with this config has been?
I've done a bunch of performance profiling with this config turned on vs. off,
and for TPC-DS queries it doesn't make a significant difference. The main
difference I can see is that when a Spark stage has to spill to disk,
{{repartitionAndSortWithinPartitions}} spills more data to disk than
{{groupByKey}} - my guess is that this happens because {{groupByKey}} stores
everything in Spark's {{ExternalAppendOnlyMap}} (which only stores a single
copy of the key for potentially multiple values) whereas
{{repartitionAndSortWithinPartitions}} uses Spark's {{ExternalSorter}} which
sorts all the K, V pairs (and thus doesn't de-duplicate keys, which results in
more data being spilled to disk).
My understanding is that using {{repartitionAndSortWithinPartitions}} for Hive
GROUP BYs is similar to what Hive-on-MR does. So disabling this config would
provide a similar experience to HoMR. Furthermore, last I checked,
{{groupByKey}} still can't spill within a row group.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)