[ https://issues.apache.org/jira/browse/HIVE-26184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zoltan Haindrich resolved HIVE-26184. ------------------------------------- Fix Version/s: 4.0.0-alpha-2 Resolution: Fixed merged into master. Thank you [~okumin] ! > COLLECT_SET with GROUP BY is very slow when some keys are highly skewed > ----------------------------------------------------------------------- > > Key: HIVE-26184 > URL: https://issues.apache.org/jira/browse/HIVE-26184 > Project: Hive > Issue Type: Bug > Components: Hive > Affects Versions: 2.3.8, 3.1.3 > Reporter: okumin > Assignee: okumin > Priority: Major > Labels: pull-request-available > Fix For: 4.0.0-alpha-2 > > Time Spent: 1.5h > Remaining Estimate: 0h > > I observed some reducers spend 98% of CPU time in invoking > `java.util.HashMap#clear`. > Looking the detail, I found COLLECT_SET reuses a LinkedHashSet and its > `clear` can be quite heavy when a relation has a small number of highly > skewed keys. > > To reproduce the issue, first, we will create rows with a skewed key. > {code:java} > INSERT INTO test_collect_set > SELECT '00000000-0000-0000-0000-000000000000' AS key, CAST(UUID() AS VARCHAR) > AS value > FROM table_with_many_rows > LIMIT 100000;{code} > Then, we will create many non-skewed rows. > {code:java} > INSERT INTO test_collect_set > SELECT UUID() AS key, UUID() AS value > FROM table_with_many_rows > LIMIT 5000000;{code} > We can observe the issue when we aggregate values by `key`. > {code:java} > SELECT key, COLLECT_SET(value) FROM group_by_skew GROUP BY key{code} -- This message was sent by Atlassian Jira (v8.20.7#820007)