okumin created HIVE-26184: ----------------------------- Summary: COLLECT_SET with GROUP BY is very slow when some keys are highly skewed Key: HIVE-26184 URL: https://issues.apache.org/jira/browse/HIVE-26184 Project: Hive Issue Type: Bug Components: Hive Affects Versions: 3.1.3, 2.3.8 Reporter: okumin Assignee: okumin
I observed some reducers spend 98% of CPU time in invoking `java.util.HashMap#clear`. Looking the detail, I found COLLECT_SET reuses a LinkedHashSet and its `clear` can be quite heavy when a relation has a small number of highly skewed keys. To reproduce the issue, first, we will create rows with a skewed key. {code:java} INSERT INTO test_collect_set SELECT '00000000-0000-0000-0000-000000000000' AS key, CAST(UUID() AS VARCHAR) AS value FROM table_with_many_rows LIMIT 100000;{code} Then, we will create many non-skewed rows. {code:java} INSERT INTO test_collect_set SELECT UUID() AS key, UUID() AS value FROM sample_datasets.nasdaq LIMIT 5000000;{code} We can observe the issue when we aggregate values by `key`. {code:java} SELECT key, COLLECT_SET(value) FROM group_by_skew GROUP BY key{code} -- This message was sent by Atlassian Jira (v8.20.7#820007)