[
https://issues.apache.org/jira/browse/IMPALA-14757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Csaba Ringhofer updated IMPALA-14757:
-------------------------------------
Labels: ramp-up (was: )
> Analytic functions' mem usage can be underestimated
> ---------------------------------------------------
>
> Key: IMPALA-14757
> URL: https://issues.apache.org/jira/browse/IMPALA-14757
> Project: IMPALA
> Issue Type: Bug
> Components: Frontend
> Reporter: Csaba Ringhofer
> Priority: Major
> Labels: ramp-up
>
> set num_nodes=1;
> with s as (select l_shipdate, l_orderkey, max(l_orderkey) over() maxkey
> from tpch_parquet.lineitem) select * from s where maxkey = l_orderkey;
> summary:
> {code}
> +--------------+--------+-------+----------+----------+-------+------------+-----------+---------------+-----------------------+
> | Operator | #Hosts | #Inst | Avg Time | Max Time | #Rows | Est. #Rows |
> Peak Mem | Est. Peak Mem | Detail |
> +--------------+--------+-------+----------+----------+-------+------------+-----------+---------------+-----------------------+
> | F00:ROOT | 1 | 1 | 22.67us | 22.67us | | |
> 4.02 MB | 21.75 MB | |
> | 02:SELECT | 1 | 1 | 24.14ms | 24.14ms | 2 | 600.12K |
> 24.00 KB | 0 B | |
> | 01:ANALYTIC | 1 | 1 | 723.74ms | 723.74ms | 6.00M | 6.00M |
> 178.09 MB | 4.00 MB | |
> | 00:SCAN HDFS | 1 | 1 | 10.94ms | 10.94ms | 6.00M | 6.00M |
> 29.22 MB | 160.00 MB | tpch_parquet.lineitem |
> +--------------+--------+-------+----------+----------+-------+------------+-----------+---------------+-----------------------+
> {code}
> The analytic node consumer 178MB vs the estimated 4MB.
> Note that the analytic node can spill, so if it would actually hit the
> mem_limit then it would start spilling and not increase its memory usage
> higher.
> Another issue is that the results are heavily overestimated (2 vs 600.12K),
> the planner should realize that maxkey will have a single value for all rows
> and esimate selectivity based on NDV of column.
> Note that this query would be more mem efficient if it was rewritten to use
> scalar subquery the get the max (at the cost of reading the table twice) or
> to use ORDER BY.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]