[jira] [Updated] (IMPALA-13086) Cardinality estimate of AggregationNode should consider predicates on group-by columns

Quanlong Huang (Jira) Wed, 15 May 2024 17:48:04 -0700


     [ 
https://issues.apache.org/jira/browse/IMPALA-13086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Quanlong Huang updated IMPALA-13086:
------------------------------------
    Attachment: plan.txt

> Cardinality estimate of AggregationNode should consider predicates on 
> group-by columns
> --------------------------------------------------------------------------------------
>
>                 Key: IMPALA-13086
>                 URL: https://issues.apache.org/jira/browse/IMPALA-13086
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>            Reporter: Quanlong Huang
>            Priority: Critical
>         Attachments: plan.txt
>
>
> Consider the following tables:
> {code:sql}
> CREATE EXTERNAL TABLE t1(
>   t1_id bigint,
>   t5_id bigint,
>   t5_name string,
>   register_date string
> ) stored as textfile;
> CREATE EXTERNAL TABLE t2(
>   t1_id bigint,
>   t3_id bigint,
>   pay_time timestamp,
>   refund_time timestamp,
>   state_code int
> ) stored as textfile;
> CREATE EXTERNAL TABLE t3(
>   t3_id bigint,
>   t3_name string,
>   class_id int
> ) stored as textfile;
> CREATE EXTERNAL TABLE t5( 
>   id bigint,
>   t5_id bigint,
>   t5_name string,
>   branch_id bigint,
>   branch_name string
> ) stored as textfile;
> alter table t1 set tblproperties('numRows'='6031170829');
> alter table t1 set column stats t1_id ('numDVs'='8131016','numNulls'='0');
> alter table t1 set column stats t5_id ('numDVs'='389','numNulls'='0');
> alter table t1 set column stats t5_name 
> ('numDVs'='523','numNulls'='85928157','maxsize'='27','avgSize'='17.79120063781738');
> alter table t1 set column stats register_date 
> ('numDVs'='9283','numNulls'='0','maxsize'='8','avgSize'='8');
> alter table t2 set tblproperties('numRows'='864341085');
> alter table t2 set column stats t1_id ('numDVs'='1007302','numNulls'='0');
> alter table t2 set column stats t3_id ('numDVs'='5013','numNulls'='2800503');
> alter table t2 set column stats pay_time ('numDVs'='1372020','numNulls'='0');
> alter table t2 set column stats refund_time 
> ('numDVs'='251658','numNulls'='791645118');
> alter table t2 set column stats state_code ('numDVs'='8','numNulls'='0');
> alter table t3 set tblproperties('numRows'='4452');
> alter table t3 set column stats t3_id ('numDVs'='4452','numNulls'='0');
> alter table t3 set column stats t3_name 
> ('numDVs'='4452','numNulls'='0','maxsize'='176','avgSize'='37.60469818115234');
> alter table t3 set column stats class_id ('numDVs'='75','numNulls'='0');
> alter table t5 set tblproperties('numRows'='2177245');
> alter table t5 set column stats t5_id ('numDVs'='826','numNulls'='0');
> alter table t5 set column stats t5_name 
> ('numDVs'='523','numNulls'='0','maxsize'='67','avgSize'='19.12560081481934');
> alter table t5 set column stats branch_id ('numDVs'='53','numNulls'='0');
> alter table t5 set column stats branch_name 
> ('numDVs'='55','numNulls'='0','maxsize'='61','avgSize'='16.05229949951172');
> {code}
> Put a data file to each table to make the stats valid
> {code:bash}
> echo '2024' > data.txt
> hdfs dfs -put data.txt hdfs://localhost:20500/test-warehouse/lab2.db/t1
> hdfs dfs -put data.txt hdfs://localhost:20500/test-warehouse/lab2.db/t2
> hdfs dfs -put data.txt hdfs://localhost:20500/test-warehouse/lab2.db/t3
> hdfs dfs -put data.txt hdfs://localhost:20500/test-warehouse/lab2.db/t5
> {code}
> REFRESH these tables after adding the data files.
> The cardinality of AggregationNodes are overestimated in the following query:
> {code:sql}
> explain select 
>   register_date,
>   t4.t5_id, 
>   t5.t5_name,
>   t5.branch_name,
>   count(distinct t1_id),
>   count(distinct case when diff_day=0 then t1_id else null end ),
>   count(distinct case when diff_day<=3 then t1_id else null end ),
>   count(distinct case when diff_day<=7 then t1_id else null end ),
>   count(distinct case when diff_day<=14 then t1_id else null end ),
>   count(distinct case when diff_day<=30 then t1_id else null end ),
>   count(distinct case when diff_day<=60 then t1_id else null end ),
>   count(distinct case when pay_time is not null then t1_id else null end )
> from (
>   select t1.t1_id,t1.register_date,t1.t5_id,t2.pay_time,t2.t3_id,t3.t3_name,
>     datediff(pay_time,register_date) diff_day
>   from (
>     select t1_id,pay_time,t3_id from t2
>     where state_code = 0 and pay_time>=trunc(NOW(),'Y')
>       and cast(pay_time as date) <> cast(refund_time as date)
>   )t2
>   join t3 on t2.t3_id=t3.t3_id
>   right join t1 on t1.t1_id=t2.t1_id
> )t4
> left join t5 on t4.t5_id=t5.t5_id
> where register_date='20230515'
> group by register_date,t4.t5_id,t5.t5_name,t5.branch_name;{code}
> One of the AggregationNode:
> {noformat}
> 17:AGGREGATE [FINALIZE]
> |  Class 0
> |    output: count:merge(t1_id)
> |    group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
> |  Class 1
> |    output: count:merge(CASE WHEN diff_day = 0 THEN t1_id ELSE NULL END)
> |    group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
> |  Class 2
> |    output: count:merge(CASE WHEN diff_day <= 3 THEN t1_id ELSE NULL END)
> |    group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
> |  Class 3
> |    output: count:merge(CASE WHEN diff_day <= 7 THEN t1_id ELSE NULL END)
> |    group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
> |  Class 4
> |    output: count:merge(CASE WHEN diff_day <= 14 THEN t1_id ELSE NULL END)
> |    group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
> |  Class 5
> |    output: count:merge(CASE WHEN diff_day <= 30 THEN t1_id ELSE NULL END)
> |    group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
> |  Class 6
> |    output: count:merge(CASE WHEN diff_day <= 60 THEN t1_id ELSE NULL END)
> |    group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
> |  Class 7
> |    output: count:merge(CASE WHEN pay_time IS NOT NULL THEN t1_id ELSE NULL 
> END)
> |    group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
> |  mem-estimate=73.07TB mem-reservation=272.00MB thread-reservation=0
> |  tuple-ids=7N,9N,11N,13N,15N,17N,19N,21N row-size=761B cardinality=830.98G
> |  in pipelines: 17(GETNEXT), 15(OPEN){noformat}
> The cardinality is 830.98B which is the multiplied NDVs of the group by 
> columns and times the 8 aggregation classes: 9283 * 389 * 523 * 55 * 8. 
> (Thank [~boroknagyz] for digging into this!)
> This doesn't respect the predicates on the group by columns. Predicate 
> register_date='20230515' should reduce the NDV of 'register_date' to 1.
> CC [~rizaon]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

[jira] [Updated] (IMPALA-13086) Cardinality estimate of AggregationNode should consider predicates on group-by columns

Reply via email to