[
https://issues.apache.org/jira/browse/HIVE-26184?focusedWorklogId=763448&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-763448
]
ASF GitHub Bot logged work on HIVE-26184:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 28/Apr/22 11:44
Start Date: 28/Apr/22 11:44
Worklog Time Spent: 10m
Work Description: okumin commented on code in PR #3253:
URL: https://github.com/apache/hive/pull/3253#discussion_r860788021
##########
ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFMkCollectionEvaluator.java:
##########
@@ -95,11 +95,27 @@ public MkArrayAggregationBuffer() {
throw new RuntimeException("Buffer type unknown");
}
}
+
+ private void reset() {
+ if (bufferType == BufferType.LIST) {
+ container.clear();
+ } else if (bufferType == BufferType.SET) {
+ // Don't reuse a container because HashSet#clear can be very slow. The
operation takes O(N)
Review Comment:
@kgyrtkirk Thanks for your quick review!
I wanted to mean skew of GROUP BY keys here, not elements of HashSet. Let me
illustrate that with the following query, mapping articles into their comments.
If a certain article accidentally gets very popular, it has much more comments
than the others. My `skew` means such situation.
```
SELECT article_id, COLLECT_SET(comment) FROM comments GROUP BY article_id
```
The capacity of the internal hash table of
`MkArrayAggregationBuffer#container` will eventually grow as much as it can
retain all comments tied to the most skewed article so far. Also, the internal
hash table will never get smaller because resizing happens only when new
entries are added(precisely, this point depends on the implementation of JDK).
From the nature of hash tables, the duration of `HashSet#clear` relies on
the capacity of an internal hash table. It's an operation to fill all cells
with NULLs.
Because of the two points, GroupByOperator suddenly slows down once it
processes a skewed key. For example, assuming the first `article_id=1` has
1,000,000 comments, `GenericUDAFMkCollectionEvaluator#reset` has to fill a very
big hash table with many NULLs every time even if all following
articles(`article_id=2`, `article_id=3`, ...) have 0 comments.
Issue Time Tracking
-------------------
Worklog Id: (was: 763448)
Time Spent: 40m (was: 0.5h)
> COLLECT_SET with GROUP BY is very slow when some keys are highly skewed
> -----------------------------------------------------------------------
>
> Key: HIVE-26184
> URL: https://issues.apache.org/jira/browse/HIVE-26184
> Project: Hive
> Issue Type: Bug
> Components: Hive
> Affects Versions: 2.3.8, 3.1.3
> Reporter: okumin
> Assignee: okumin
> Priority: Major
> Labels: pull-request-available
> Time Spent: 40m
> Remaining Estimate: 0h
>
> I observed some reducers spend 98% of CPU time in invoking
> `java.util.HashMap#clear`.
> Looking the detail, I found COLLECT_SET reuses a LinkedHashSet and its
> `clear` can be quite heavy when a relation has a small number of highly
> skewed keys.
>
> To reproduce the issue, first, we will create rows with a skewed key.
> {code:java}
> INSERT INTO test_collect_set
> SELECT '00000000-0000-0000-0000-000000000000' AS key, CAST(UUID() AS VARCHAR)
> AS value
> FROM table_with_many_rows
> LIMIT 100000;{code}
> Then, we will create many non-skewed rows.
> {code:java}
> INSERT INTO test_collect_set
> SELECT UUID() AS key, UUID() AS value
> FROM table_with_many_rows
> LIMIT 5000000;{code}
> We can observe the issue when we aggregate values by `key`.
> {code:java}
> SELECT key, COLLECT_SET(value) FROM group_by_skew GROUP BY key{code}
--
This message was sent by Atlassian Jira
(v8.20.7#820007)