[ https://issues.apache.org/jira/browse/IMPALA-2945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Armstrong reassigned IMPALA-2945: ------------------------------------- Assignee: Tim Armstrong > Pre-aggregation cardinality estimates do not take into account data > distribution > -------------------------------------------------------------------------------- > > Key: IMPALA-2945 > URL: https://issues.apache.org/jira/browse/IMPALA-2945 > Project: IMPALA > Issue Type: Improvement > Components: Frontend > Affects Versions: Impala 2.0, Impala 2.3.0, Impala 2.5.0, Impala 2.4.0, > Impala 2.6.0, Impala 2.7.0, Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, Impala > 2.11.0, Impala 2.12.0 > Reporter: Mostafa Mokhtar > Assignee: Tim Armstrong > Priority: Major > Labels: planner, resource-management > > When computing the per-host memory estimate for local aggregations, the > planner does not take into account that data is randomly distributed across > nodes leading to significant underestimation in some cases. The suggested fix > is to use min(agg input cardinality, NDV * #hosts) as the per-node > cardinality used for the per-node memory estimate. > Impact: In the query below, the planner significantly underestimates the > per-node memory of agg node 03 to be 3.8GB but the actual is 24.77. > Query > {code} > select sum(l_extendedprice) / 7.0 as avg_yearly > from > lineitem, > part > where > p_partkey = l_partkey > and p_brand = 'Brand#23' > and p_container = 'MED BOX' > and l_quantity < ( > select > 0.2 * avg(l_quantity) > from > lineitem > where > l_partkey = p_partkey > ) > {code} > Plan > {code} > 12:AGGREGATE [FINALIZE] > | output: sum:merge(l_extendedprice) > | hosts=20 per-host-mem=unavailable > | tuple-ids=6 row-size=16B cardinality=1 > | > 11:EXCHANGE [UNPARTITIONED] > | hosts=20 per-host-mem=unavailable > | tuple-ids=6 row-size=16B cardinality=1 > | > 06:AGGREGATE > | output: sum(l_extendedprice) > | hosts=20 per-host-mem=10.00MB > | tuple-ids=6 row-size=16B cardinality=1 > | > 05:HASH JOIN [RIGHT SEMI JOIN, PARTITIONED] > | hash predicates: l_partkey = p_partkey > | other join predicates: l_quantity < 0.2 * avg(l_quantity) > | hosts=20 per-host-mem=125.18MB > | tuple-ids=0,1 row-size=80B cardinality=29992141 > | > |--10:EXCHANGE [HASH(p_partkey)] > | | hosts=20 per-host-mem=0B > | | tuple-ids=0,1 row-size=80B cardinality=29992141 > | | > | 04:HASH JOIN [INNER JOIN, BROADCAST] > | | hash predicates: l_partkey = p_partkey > | | hosts=20 per-host-mem=58.30MB > | | tuple-ids=0,1 row-size=80B cardinality=29992141 > | | > | |--09:EXCHANGE [BROADCAST] > | | | hosts=20 per-host-mem=0B > | | | tuple-ids=1 row-size=56B cardinality=1000000 > | | | > | | 01:SCAN HDFS [tpch_1000_decimal_parquet.part, RANDOM] > | | partitions=1/1 files=40 size=6.38GB > | | predicates: p_brand = 'Brand#23', p_container = 'MED BOX' > | | table stats: 200000000 rows total > | | column stats: all > | | hosts=20 per-host-mem=264.00MB > | | tuple-ids=1 row-size=56B cardinality=1000000 > | | > | 00:SCAN HDFS [tpch_1000_decimal_parquet.lineitem, RANDOM] > | partitions=1/1 files=880 size=216.61GB > | table stats: 5999989709 rows total > | column stats: all > | hosts=20 per-host-mem=264.00MB > | tuple-ids=0 row-size=24B cardinality=5999989709 > | > 08:AGGREGATE [FINALIZE] > | output: avg:merge(l_quantity) > | group by: l_partkey > | hosts=20 per-host-mem=167.89MB > | tuple-ids=4 row-size=16B cardinality=200052064 > | > 07:EXCHANGE [HASH(l_partkey)] > | hosts=20 per-host-mem=0B > | tuple-ids=3 row-size=16B cardinality=200052064 > | > 03:AGGREGATE > | output: avg(l_quantity) > | group by: l_partkey > | hosts=20 per-host-mem=3.28GB > | tuple-ids=3 row-size=16B cardinality=200052064 > | > 02:SCAN HDFS [tpch_1000_decimal_parquet.lineitem, RANDOM] > partitions=1/1 files=880 size=216.61GB > table stats: 5999989709 rows total > column stats: all > hosts=20 per-host-mem=176.00MB > tuple-ids=2 row-size=16B cardinality=5999989709 > {code} > Summary > |Operator ||#Hosts|| Avg Time|| Max Time|| #Rows ||Est. > #Rows|| Peak Mem ||Est. Peak Mem ||Detail | > |12:AGGREGATE |1 |256.620ms| 256.620ms| 1| 1| > 92.00 KB| -1.00 B| FINALIZE | > |11:EXCHANGE |1 |184.430us| 184.430us| 20| 1| > 0| -1.00 B| UNPARTITIONED | > |06:AGGREGATE |20 |364.045ms| 1s508ms| 20| 1| > 9.37 MB| 10.00 MB| | > |05:HASH JOIN |20 |279.175ms| 304.600ms| 523.09K| 29.99M| > 155.04 MB| 125.18 MB| RIGHT SEMI JOIN, PARTITIONED | > |I--10:EXCHANGE |20 |22.448ms| 32.954ms| 5.98M| 29.99M| > 0| 0| HASH(p_partkey) | > |I 04:HASH JOIN |20 |25s417ms| 35s579ms| 5.98M| 29.99M| > 146.02 MB| 58.30 MB| INNER JOIN, BROADCAST | > |I I--09:EXCHANGE |20 |16.270ms| 35.329ms| 199.30K| 1.00M| > 0| 0| BROADCAST | > |I I 01:SCAN HDFS |20 |218.505ms| 331.299ms| 199.30K| 1.00M| > 173.43 MB| 264.00 MB| tpch_1000_decimal_parquet.part| > |I 00:SCAN HDFS |20 |1s365ms| 1s822ms| 6.00B| 6.00B| > 1.92 GB| 264.00 MB| tpch_1000_decimal_parquet.l... | > |08:AGGREGATE |20 |29s706ms| 35s917ms| 200.00M| 200.05M| > 1.64 GB| 167.89 MB| FINALIZE| > |07:EXCHANGE |20 |5s081ms| 8s410ms| 3.11B| 200.05M| > 0| 0| HASH(l_partkey)| > |03:AGGREGATE |20 |4m10s 5m12s | 3.11B| 200.05M| > 24.77 GB| 3.28 GB| > |02:SCAN HDFS |20 |1s544ms| 2s517ms| 6.00B| 6.00B| > 838.85 MB| 176.00 MB| tpch_1000_decimal_parquet.l... | -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org