[ 
https://issues.apache.org/jira/browse/HIVE-28548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shohei Okumiya updated HIVE-28548:
----------------------------------
    Summary: Subdivide memory size allocated to parallel operators  (was: 
Merged hash aggregations can consume memory more than assigned)

> Subdivide memory size allocated to parallel operators
> -----------------------------------------------------
>
>                 Key: HIVE-28548
>                 URL: https://issues.apache.org/jira/browse/HIVE-28548
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Physical Optimizer
>    Affects Versions: 4.0.0
>            Reporter: Shohei Okumiya
>            Assignee: Shohei Okumiya
>            Priority: Major
>
> SharedWorkOptimizer can merge operators which set up hash tables. Currently, 
> each operator assumes it can use \{container memory} * 
> \{hive.map.aggr.hash.percentmemory}. This assumption can cause OOM since 
> merged operators run in parallel. It is unrealistic to avoid the issue by 
> tuning `hive.map.aggr.hash.percentmemory` in this case because users can't 
> know how many operators are likely merged ahead of time.
>  
> Tez has some mechanism to allocate container's memory to multiple edges. Hive 
> can also have something similar.
> [https://github.com/apache/tez/blob/rel/release-0.10.4/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/resources/WeightedScalingMemoryDistributor.java]
>  
> *How to reproduce*
> Dummy data. The 10% of constants prevent optimization through 
> `hive.map.aggr.hash.min.reduction`.
> {code:java}
> INSERT INTO test_data SELECT IF(RAND() < 0.1, 
> '00000000-00000000-00000000-00000000', UUID()) AS key, UUID() AS value FROM 
> {large table} {code}
> Query. You should add more joins if your container is large.
> {code:java}
> SELECT *
> FROM (SELECT key, MAX(value) FROM test_data WHERE key != '1' GROUP BY key) t1
> LEFT OUTER JOIN (SELECT key, MAX(value) FROM test_data WHERE key != '2' GROUP 
> BY key) t2 ON t1.key = t2.key
> LEFT OUTER JOIN (SELECT key, MAX(value) FROM test_data WHERE key != '3' GROUP 
> BY key) t3 ON t1.key = t3.key
> LEFT OUTER JOIN (SELECT key, MAX(value) FROM test_data WHERE key != '4' GROUP 
> BY key) t4 ON t1.key = t4.key
> LEFT OUTER JOIN (SELECT key, MAX(value) FROM test_data WHERE key != '5' GROUP 
> BY key) t5 ON t1.key = t5.key {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to