[ https://issues.apache.org/jira/browse/HIVE-28548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Shohei Okumiya updated HIVE-28548: ---------------------------------- Summary: Subdivide memory size allocated to parallel operators (was: Merged hash aggregations can consume memory more than assigned) > Subdivide memory size allocated to parallel operators > ----------------------------------------------------- > > Key: HIVE-28548 > URL: https://issues.apache.org/jira/browse/HIVE-28548 > Project: Hive > Issue Type: Sub-task > Components: Physical Optimizer > Affects Versions: 4.0.0 > Reporter: Shohei Okumiya > Assignee: Shohei Okumiya > Priority: Major > > SharedWorkOptimizer can merge operators which set up hash tables. Currently, > each operator assumes it can use \{container memory} * > \{hive.map.aggr.hash.percentmemory}. This assumption can cause OOM since > merged operators run in parallel. It is unrealistic to avoid the issue by > tuning `hive.map.aggr.hash.percentmemory` in this case because users can't > know how many operators are likely merged ahead of time. > > Tez has some mechanism to allocate container's memory to multiple edges. Hive > can also have something similar. > [https://github.com/apache/tez/blob/rel/release-0.10.4/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/resources/WeightedScalingMemoryDistributor.java] > > *How to reproduce* > Dummy data. The 10% of constants prevent optimization through > `hive.map.aggr.hash.min.reduction`. > {code:java} > INSERT INTO test_data SELECT IF(RAND() < 0.1, > '00000000-00000000-00000000-00000000', UUID()) AS key, UUID() AS value FROM > {large table} {code} > Query. You should add more joins if your container is large. > {code:java} > SELECT * > FROM (SELECT key, MAX(value) FROM test_data WHERE key != '1' GROUP BY key) t1 > LEFT OUTER JOIN (SELECT key, MAX(value) FROM test_data WHERE key != '2' GROUP > BY key) t2 ON t1.key = t2.key > LEFT OUTER JOIN (SELECT key, MAX(value) FROM test_data WHERE key != '3' GROUP > BY key) t3 ON t1.key = t3.key > LEFT OUTER JOIN (SELECT key, MAX(value) FROM test_data WHERE key != '4' GROUP > BY key) t4 ON t1.key = t4.key > LEFT OUTER JOIN (SELECT key, MAX(value) FROM test_data WHERE key != '5' GROUP > BY key) t5 ON t1.key = t5.key {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)