[jira] [Work logged] (HIVE-26184) COLLECT_SET with GROUP BY is very slow when some keys are highly skewed
[ https://issues.apache.org/jira/browse/HIVE-26184?focusedWorklogId=780865&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-780865 ] ASF GitHub Bot logged work on HIVE-26184: - Author: ASF GitHub Bot Created on: 13/Jun/22 15:26 Start Date: 13/Jun/22 15:26 Worklog Time Spent: 10m Work Description: kgyrtkirk merged PR #3253: URL: https://github.com/apache/hive/pull/3253 Issue Time Tracking --- Worklog Id: (was: 780865) Time Spent: 1.5h (was: 1h 20m) > COLLECT_SET with GROUP BY is very slow when some keys are highly skewed > --- > > Key: HIVE-26184 > URL: https://issues.apache.org/jira/browse/HIVE-26184 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 2.3.8, 3.1.3 >Reporter: okumin >Assignee: okumin >Priority: Major > Labels: pull-request-available > Time Spent: 1.5h > Remaining Estimate: 0h > > I observed some reducers spend 98% of CPU time in invoking > `java.util.HashMap#clear`. > Looking the detail, I found COLLECT_SET reuses a LinkedHashSet and its > `clear` can be quite heavy when a relation has a small number of highly > skewed keys. > > To reproduce the issue, first, we will create rows with a skewed key. > {code:java} > INSERT INTO test_collect_set > SELECT '----' AS key, CAST(UUID() AS VARCHAR) > AS value > FROM table_with_many_rows > LIMIT 10;{code} > Then, we will create many non-skewed rows. > {code:java} > INSERT INTO test_collect_set > SELECT UUID() AS key, UUID() AS value > FROM table_with_many_rows > LIMIT 500;{code} > We can observe the issue when we aggregate values by `key`. > {code:java} > SELECT key, COLLECT_SET(value) FROM group_by_skew GROUP BY key{code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (HIVE-26184) COLLECT_SET with GROUP BY is very slow when some keys are highly skewed
[ https://issues.apache.org/jira/browse/HIVE-26184?focusedWorklogId=767789&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-767789 ] ASF GitHub Bot logged work on HIVE-26184: - Author: ASF GitHub Bot Created on: 09/May/22 05:38 Start Date: 09/May/22 05:38 Worklog Time Spent: 10m Work Description: okumin commented on PR #3253: URL: https://github.com/apache/hive/pull/3253#issuecomment-1120663054 CI failed but I think it's not apparently caused by this PR. ``` [2022-05-08T14:33:40.267Z] [ERROR] Failures: [2022-05-08T14:33:40.267Z] [ERROR] TestRpc.testServerPort:234 Port should match configured one:22 expected:<32951> but was:<22> ``` Issue Time Tracking --- Worklog Id: (was: 767789) Time Spent: 1h 20m (was: 1h 10m) > COLLECT_SET with GROUP BY is very slow when some keys are highly skewed > --- > > Key: HIVE-26184 > URL: https://issues.apache.org/jira/browse/HIVE-26184 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 2.3.8, 3.1.3 >Reporter: okumin >Assignee: okumin >Priority: Major > Labels: pull-request-available > Time Spent: 1h 20m > Remaining Estimate: 0h > > I observed some reducers spend 98% of CPU time in invoking > `java.util.HashMap#clear`. > Looking the detail, I found COLLECT_SET reuses a LinkedHashSet and its > `clear` can be quite heavy when a relation has a small number of highly > skewed keys. > > To reproduce the issue, first, we will create rows with a skewed key. > {code:java} > INSERT INTO test_collect_set > SELECT '----' AS key, CAST(UUID() AS VARCHAR) > AS value > FROM table_with_many_rows > LIMIT 10;{code} > Then, we will create many non-skewed rows. > {code:java} > INSERT INTO test_collect_set > SELECT UUID() AS key, UUID() AS value > FROM table_with_many_rows > LIMIT 500;{code} > We can observe the issue when we aggregate values by `key`. > {code:java} > SELECT key, COLLECT_SET(value) FROM group_by_skew GROUP BY key{code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (HIVE-26184) COLLECT_SET with GROUP BY is very slow when some keys are highly skewed
[ https://issues.apache.org/jira/browse/HIVE-26184?focusedWorklogId=765823&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-765823 ] ASF GitHub Bot logged work on HIVE-26184: - Author: ASF GitHub Bot Created on: 04/May/22 05:01 Start Date: 04/May/22 05:01 Worklog Time Spent: 10m Work Description: dengzhhu653 commented on code in PR #3253: URL: https://github.com/apache/hive/pull/3253#discussion_r864458003 ## ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFMkCollectionEvaluator.java: ## @@ -95,11 +95,27 @@ public MkArrayAggregationBuffer() { throw new RuntimeException("Buffer type unknown"); } } + +private void reset() { + if (bufferType == BufferType.LIST) { +container.clear(); Review Comment: get it, thanks for the explanation! Issue Time Tracking --- Worklog Id: (was: 765823) Time Spent: 1h 10m (was: 1h) > COLLECT_SET with GROUP BY is very slow when some keys are highly skewed > --- > > Key: HIVE-26184 > URL: https://issues.apache.org/jira/browse/HIVE-26184 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 2.3.8, 3.1.3 >Reporter: okumin >Assignee: okumin >Priority: Major > Labels: pull-request-available > Time Spent: 1h 10m > Remaining Estimate: 0h > > I observed some reducers spend 98% of CPU time in invoking > `java.util.HashMap#clear`. > Looking the detail, I found COLLECT_SET reuses a LinkedHashSet and its > `clear` can be quite heavy when a relation has a small number of highly > skewed keys. > > To reproduce the issue, first, we will create rows with a skewed key. > {code:java} > INSERT INTO test_collect_set > SELECT '----' AS key, CAST(UUID() AS VARCHAR) > AS value > FROM table_with_many_rows > LIMIT 10;{code} > Then, we will create many non-skewed rows. > {code:java} > INSERT INTO test_collect_set > SELECT UUID() AS key, UUID() AS value > FROM table_with_many_rows > LIMIT 500;{code} > We can observe the issue when we aggregate values by `key`. > {code:java} > SELECT key, COLLECT_SET(value) FROM group_by_skew GROUP BY key{code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (HIVE-26184) COLLECT_SET with GROUP BY is very slow when some keys are highly skewed
[ https://issues.apache.org/jira/browse/HIVE-26184?focusedWorklogId=765818&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-765818 ] ASF GitHub Bot logged work on HIVE-26184: - Author: ASF GitHub Bot Created on: 04/May/22 04:18 Start Date: 04/May/22 04:18 Worklog Time Spent: 10m Work Description: okumin commented on code in PR #3253: URL: https://github.com/apache/hive/pull/3253#discussion_r864445686 ## ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFMkCollectionEvaluator.java: ## @@ -95,11 +95,27 @@ public MkArrayAggregationBuffer() { throw new RuntimeException("Buffer type unknown"); } } + +private void reset() { + if (bufferType == BufferType.LIST) { +container.clear(); Review Comment: @dengzhhu653 The short answer is no. `ArrayList#clear` takes `O({the number of elements, not capacity})`. Actually, I don't reproduce the same issue with `COLLECT_LIST` at least in our environment. This is a dummy code to illustrate the behavior of `ArrayList#clear`. Even if the length of `elementsContainer` is 1 million, it only updates the first `ArrayList#size`. ``` // elementsContainer is Object[] with its length enlarged based on the maximum ArrayList#size so far // Note that this.size is ArrayList#size, not the length of elementsContainer for (int i = 0; i < this.size; i++) { elementsContainer[i] = null; } ``` In case of HashMap, the complexity depends on its capacity. That's because `clear` doesn't know which indexes have values because of hash table's nature. ``` // table is Node[] with its length enlarged based on the maximum HashMap#size so far // Note that table.length is not HashMap#size for (int i = 0; i < table.length; i++) { table[i] = null; } ``` Issue Time Tracking --- Worklog Id: (was: 765818) Time Spent: 1h (was: 50m) > COLLECT_SET with GROUP BY is very slow when some keys are highly skewed > --- > > Key: HIVE-26184 > URL: https://issues.apache.org/jira/browse/HIVE-26184 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 2.3.8, 3.1.3 >Reporter: okumin >Assignee: okumin >Priority: Major > Labels: pull-request-available > Time Spent: 1h > Remaining Estimate: 0h > > I observed some reducers spend 98% of CPU time in invoking > `java.util.HashMap#clear`. > Looking the detail, I found COLLECT_SET reuses a LinkedHashSet and its > `clear` can be quite heavy when a relation has a small number of highly > skewed keys. > > To reproduce the issue, first, we will create rows with a skewed key. > {code:java} > INSERT INTO test_collect_set > SELECT '----' AS key, CAST(UUID() AS VARCHAR) > AS value > FROM table_with_many_rows > LIMIT 10;{code} > Then, we will create many non-skewed rows. > {code:java} > INSERT INTO test_collect_set > SELECT UUID() AS key, UUID() AS value > FROM table_with_many_rows > LIMIT 500;{code} > We can observe the issue when we aggregate values by `key`. > {code:java} > SELECT key, COLLECT_SET(value) FROM group_by_skew GROUP BY key{code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (HIVE-26184) COLLECT_SET with GROUP BY is very slow when some keys are highly skewed
[ https://issues.apache.org/jira/browse/HIVE-26184?focusedWorklogId=765792&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-765792 ] ASF GitHub Bot logged work on HIVE-26184: - Author: ASF GitHub Bot Created on: 04/May/22 01:29 Start Date: 04/May/22 01:29 Worklog Time Spent: 10m Work Description: dengzhhu653 commented on code in PR #3253: URL: https://github.com/apache/hive/pull/3253#discussion_r864405523 ## ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFMkCollectionEvaluator.java: ## @@ -95,11 +95,27 @@ public MkArrayAggregationBuffer() { throw new RuntimeException("Buffer type unknown"); } } + +private void reset() { + if (bufferType == BufferType.LIST) { +container.clear(); Review Comment: If I understand it, the `clear()` of ArrayList should have the same problem, right? Issue Time Tracking --- Worklog Id: (was: 765792) Time Spent: 50m (was: 40m) > COLLECT_SET with GROUP BY is very slow when some keys are highly skewed > --- > > Key: HIVE-26184 > URL: https://issues.apache.org/jira/browse/HIVE-26184 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 2.3.8, 3.1.3 >Reporter: okumin >Assignee: okumin >Priority: Major > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > I observed some reducers spend 98% of CPU time in invoking > `java.util.HashMap#clear`. > Looking the detail, I found COLLECT_SET reuses a LinkedHashSet and its > `clear` can be quite heavy when a relation has a small number of highly > skewed keys. > > To reproduce the issue, first, we will create rows with a skewed key. > {code:java} > INSERT INTO test_collect_set > SELECT '----' AS key, CAST(UUID() AS VARCHAR) > AS value > FROM table_with_many_rows > LIMIT 10;{code} > Then, we will create many non-skewed rows. > {code:java} > INSERT INTO test_collect_set > SELECT UUID() AS key, UUID() AS value > FROM table_with_many_rows > LIMIT 500;{code} > We can observe the issue when we aggregate values by `key`. > {code:java} > SELECT key, COLLECT_SET(value) FROM group_by_skew GROUP BY key{code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (HIVE-26184) COLLECT_SET with GROUP BY is very slow when some keys are highly skewed
[ https://issues.apache.org/jira/browse/HIVE-26184?focusedWorklogId=763448&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-763448 ] ASF GitHub Bot logged work on HIVE-26184: - Author: ASF GitHub Bot Created on: 28/Apr/22 11:44 Start Date: 28/Apr/22 11:44 Worklog Time Spent: 10m Work Description: okumin commented on code in PR #3253: URL: https://github.com/apache/hive/pull/3253#discussion_r860788021 ## ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFMkCollectionEvaluator.java: ## @@ -95,11 +95,27 @@ public MkArrayAggregationBuffer() { throw new RuntimeException("Buffer type unknown"); } } + +private void reset() { + if (bufferType == BufferType.LIST) { +container.clear(); + } else if (bufferType == BufferType.SET) { +// Don't reuse a container because HashSet#clear can be very slow. The operation takes O(N) Review Comment: @kgyrtkirk Thanks for your quick review! I wanted to mean skew of GROUP BY keys here, not elements of HashSet. Let me illustrate that with the following query, mapping articles into their comments. If a certain article accidentally gets very popular, it has much more comments than the others. My `skew` means such situation. ``` SELECT article_id, COLLECT_SET(comment) FROM comments GROUP BY article_id ``` The capacity of the internal hash table of `MkArrayAggregationBuffer#container` will eventually grow as much as it can retain all comments tied to the most skewed article so far. Also, the internal hash table will never get smaller because resizing happens only when new entries are added(precisely, this point depends on the implementation of JDK). From the nature of hash tables, the duration of `HashSet#clear` relies on the capacity of an internal hash table. It's an operation to fill all cells with NULLs. Because of the two points, GroupByOperator suddenly slows down once it processes a skewed key. For example, assuming the first `article_id=1` has 1,000,000 comments, `GenericUDAFMkCollectionEvaluator#reset` has to fill a very big hash table with many NULLs every time even if all following articles(`article_id=2`, `article_id=3`, ...) have 0 comments. Issue Time Tracking --- Worklog Id: (was: 763448) Time Spent: 40m (was: 0.5h) > COLLECT_SET with GROUP BY is very slow when some keys are highly skewed > --- > > Key: HIVE-26184 > URL: https://issues.apache.org/jira/browse/HIVE-26184 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 2.3.8, 3.1.3 >Reporter: okumin >Assignee: okumin >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > I observed some reducers spend 98% of CPU time in invoking > `java.util.HashMap#clear`. > Looking the detail, I found COLLECT_SET reuses a LinkedHashSet and its > `clear` can be quite heavy when a relation has a small number of highly > skewed keys. > > To reproduce the issue, first, we will create rows with a skewed key. > {code:java} > INSERT INTO test_collect_set > SELECT '----' AS key, CAST(UUID() AS VARCHAR) > AS value > FROM table_with_many_rows > LIMIT 10;{code} > Then, we will create many non-skewed rows. > {code:java} > INSERT INTO test_collect_set > SELECT UUID() AS key, UUID() AS value > FROM table_with_many_rows > LIMIT 500;{code} > We can observe the issue when we aggregate values by `key`. > {code:java} > SELECT key, COLLECT_SET(value) FROM group_by_skew GROUP BY key{code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (HIVE-26184) COLLECT_SET with GROUP BY is very slow when some keys are highly skewed
[ https://issues.apache.org/jira/browse/HIVE-26184?focusedWorklogId=763444&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-763444 ] ASF GitHub Bot logged work on HIVE-26184: - Author: ASF GitHub Bot Created on: 28/Apr/22 11:42 Start Date: 28/Apr/22 11:42 Worklog Time Spent: 10m Work Description: okumin commented on code in PR #3253: URL: https://github.com/apache/hive/pull/3253#discussion_r860788021 ## ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFMkCollectionEvaluator.java: ## @@ -95,11 +95,27 @@ public MkArrayAggregationBuffer() { throw new RuntimeException("Buffer type unknown"); } } + +private void reset() { + if (bufferType == BufferType.LIST) { +container.clear(); + } else if (bufferType == BufferType.SET) { +// Don't reuse a container because HashSet#clear can be very slow. The operation takes O(N) Review Comment: @kgyrtkirk Thanks for your quick review! I wanted to mean skew of GROUP BY keys here, not elements of HashSet. Let me illustrate that with the following query, mapping articles into their comments. If a certain article accidentally gets very popular, it has much more comments than the others. My `skew` means such situation. ``` SELECT article_id, COLLECT_SET(comment) FROM comments GROUP BY article_id ``` The capacity of the internal hash table of `MkArrayAggregationBuffer#container` will eventually grow as much as it can retain all comments tied to the most skewed article so far. Also, the internal hash table will never get smaller because resizing happens only when new entries are added(precisely, this point depends on the implementation of JDK). From the nature of hash tables, the duration of `HashSet#clear` relies on the capacity of an internal hash table. It's an operation to fill all cells with NULLs. Because of the two points, GroupByOperator suddenly slows down once it processes a skewed key. For example, assuming the first `article_id=1` has 1,000,000 comments, `GenericUDAFMkCollectionEvaluator#reset` has to fill a very big hash table with many NULLs every time even if all following articles have 0 comments. Issue Time Tracking --- Worklog Id: (was: 763444) Time Spent: 0.5h (was: 20m) > COLLECT_SET with GROUP BY is very slow when some keys are highly skewed > --- > > Key: HIVE-26184 > URL: https://issues.apache.org/jira/browse/HIVE-26184 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 2.3.8, 3.1.3 >Reporter: okumin >Assignee: okumin >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > I observed some reducers spend 98% of CPU time in invoking > `java.util.HashMap#clear`. > Looking the detail, I found COLLECT_SET reuses a LinkedHashSet and its > `clear` can be quite heavy when a relation has a small number of highly > skewed keys. > > To reproduce the issue, first, we will create rows with a skewed key. > {code:java} > INSERT INTO test_collect_set > SELECT '----' AS key, CAST(UUID() AS VARCHAR) > AS value > FROM table_with_many_rows > LIMIT 10;{code} > Then, we will create many non-skewed rows. > {code:java} > INSERT INTO test_collect_set > SELECT UUID() AS key, UUID() AS value > FROM sample_datasets.nasdaq > LIMIT 500;{code} > We can observe the issue when we aggregate values by `key`. > {code:java} > SELECT key, COLLECT_SET(value) FROM group_by_skew GROUP BY key{code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (HIVE-26184) COLLECT_SET with GROUP BY is very slow when some keys are highly skewed
[ https://issues.apache.org/jira/browse/HIVE-26184?focusedWorklogId=763422&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-763422 ] ASF GitHub Bot logged work on HIVE-26184: - Author: ASF GitHub Bot Created on: 28/Apr/22 10:52 Start Date: 28/Apr/22 10:52 Worklog Time Spent: 10m Work Description: kgyrtkirk commented on code in PR #3253: URL: https://github.com/apache/hive/pull/3253#discussion_r860749534 ## ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFMkCollectionEvaluator.java: ## @@ -95,11 +95,27 @@ public MkArrayAggregationBuffer() { throw new RuntimeException("Buffer type unknown"); } } + +private void reset() { + if (bufferType == BufferType.LIST) { +container.clear(); + } else if (bufferType == BufferType.SET) { +// Don't reuse a container because HashSet#clear can be very slow. The operation takes O(N) Review Comment: why did the entries got skewed in the firstplace? don't we miss or have incorrect implementation of some `hashCode()` method? could you please add a testcase which reproduces the issue? maybe you could probably write a test against the UDF itself.. Issue Time Tracking --- Worklog Id: (was: 763422) Time Spent: 20m (was: 10m) > COLLECT_SET with GROUP BY is very slow when some keys are highly skewed > --- > > Key: HIVE-26184 > URL: https://issues.apache.org/jira/browse/HIVE-26184 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 2.3.8, 3.1.3 >Reporter: okumin >Assignee: okumin >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > I observed some reducers spend 98% of CPU time in invoking > `java.util.HashMap#clear`. > Looking the detail, I found COLLECT_SET reuses a LinkedHashSet and its > `clear` can be quite heavy when a relation has a small number of highly > skewed keys. > > To reproduce the issue, first, we will create rows with a skewed key. > {code:java} > INSERT INTO test_collect_set > SELECT '----' AS key, CAST(UUID() AS VARCHAR) > AS value > FROM table_with_many_rows > LIMIT 10;{code} > Then, we will create many non-skewed rows. > {code:java} > INSERT INTO test_collect_set > SELECT UUID() AS key, UUID() AS value > FROM sample_datasets.nasdaq > LIMIT 500;{code} > We can observe the issue when we aggregate values by `key`. > {code:java} > SELECT key, COLLECT_SET(value) FROM group_by_skew GROUP BY key{code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (HIVE-26184) COLLECT_SET with GROUP BY is very slow when some keys are highly skewed
[ https://issues.apache.org/jira/browse/HIVE-26184?focusedWorklogId=763368&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-763368 ] ASF GitHub Bot logged work on HIVE-26184: - Author: ASF GitHub Bot Created on: 28/Apr/22 08:39 Start Date: 28/Apr/22 08:39 Worklog Time Spent: 10m Work Description: okumin opened a new pull request, #3253: URL: https://github.com/apache/hive/pull/3253 ### What changes were proposed in this pull request? This would reduce the time complexity of `COLLECT_SET` from `O({maximum length} * {num rows})` into `O({maximum length} + {num rows})`. https://issues.apache.org/jira/browse/HIVE-26184 ### Why are the changes needed? I'm observing some reducers take much time due to this issue. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? I have run the reproduction case in HIVE-26184 with this patch and confirmed the reduce vertex finished more than 30x faster. Issue Time Tracking --- Worklog Id: (was: 763368) Remaining Estimate: 0h Time Spent: 10m > COLLECT_SET with GROUP BY is very slow when some keys are highly skewed > --- > > Key: HIVE-26184 > URL: https://issues.apache.org/jira/browse/HIVE-26184 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 2.3.8, 3.1.3 >Reporter: okumin >Assignee: okumin >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > I observed some reducers spend 98% of CPU time in invoking > `java.util.HashMap#clear`. > Looking the detail, I found COLLECT_SET reuses a LinkedHashSet and its > `clear` can be quite heavy when a relation has a small number of highly > skewed keys. > > To reproduce the issue, first, we will create rows with a skewed key. > {code:java} > INSERT INTO test_collect_set > SELECT '----' AS key, CAST(UUID() AS VARCHAR) > AS value > FROM table_with_many_rows > LIMIT 10;{code} > Then, we will create many non-skewed rows. > {code:java} > INSERT INTO test_collect_set > SELECT UUID() AS key, UUID() AS value > FROM sample_datasets.nasdaq > LIMIT 500;{code} > We can observe the issue when we aggregate values by `key`. > {code:java} > SELECT key, COLLECT_SET(value) FROM group_by_skew GROUP BY key{code} -- This message was sent by Atlassian Jira (v8.20.7#820007)