[jira] [Created] (HIVE-20660) Group by statistics estimation could be improved by bounding the total number of rows to source table

Vineet Garg (JIRA) Sun, 30 Sep 2018 16:59:07 -0700

Vineet Garg created HIVE-20660:
----------------------------------

             Summary: Group by statistics estimation could be improved by 
bounding the total number of rows to source table
                 Key: HIVE-20660
                 URL: https://issues.apache.org/jira/browse/HIVE-20660
             Project: Hive
          Issue Type: Improvement
          Components: Statistics
    Affects Versions: 4.0.0
            Reporter: Vineet Garg
            Assignee: Vineet Garg



Currently the stats for group by is estimated by taking product of NDVs of all 
the keys and bounding it by the number of rows of its input. This bound could 
be improved by using the source table instead of immediate input, the insight 
in this case is that cardinality/ndvs of a table can not go beyond the original 
(outer joins will only add NULLs thereby increasing the cardinality by 1).

Note that the assumption here is that group by keys all belong to the same 
source table/input.
This will improve the estimation in situations where group by is executed after 
joins wherein Hive could end up estimating the number of rows.

*Reproducer*
{code:sql}
set hive.stats.fetch.column.stats=true;

create table t1(i int, j int);
alter table t1 update statistics set('numRows'='10000', 'rawDataSize'='18000');
alter table t1 update statistics for column i 
set('numDVs'='2500','numNulls'='50','highValue'='1000','lowValue'='0');
alter table t1 update statistics for column j 
set('numDVs'='500','numNulls'='30','highValue'='100','lowValue'='50');

create table t2(i2 int, j2 int);
alter table t2 update statistics set('numRows'='100000000', 
'rawDataSize'='10000');
alter table t2 update statistics for column i2 
set('numDVs'='10000000','numNulls'='0','highValue'='8000','lowValue'='0');
alter table t2 update statistics for column j2 
set('numDVs'='10','numNulls'='0','highValue'='800','lowValue'='-1');

explain select count (1) from t1,t2
        where t1.j=t2.i2 
group by t1.i, t1.j;
{code}


{code:sql}
Reducer 2
            Reduce Operator Tree:
              Merge Join Operator
                condition map:
                     Inner Join 0 to 1
                keys:
                  0 _col1 (type: int)
                  1 _col0 (type: int)
                outputColumnNames: _col0, _col1
                Statistics: Num rows: 99700 Data size: 797288 Basic stats: 
COMPLETE Column stats: COMPLETE
                Group By Operator
                  aggregations: count()
                  keys: _col0 (type: int), _col1 (type: int)
                  mode: hash
                  outputColumnNames: _col0, _col1, _col2
                  Statistics: Num rows: 49850 Data size: 797448 Basic stats: 
COMPLETE Column stats: COMPLETE <==========
                  Reduce Output Operator
                    key expressions: _col0 (type: int), _col1 (type: int)
                    sort order: ++
                    Map-reduce partition columns: _col0 (type: int), _col1 
(type: int)
                    Statistics: Num rows: 49850 Data size: 797448 Basic stats: 
COMPLETE Column stats: COMPLETE
                    value expressions: _col2 (type: bigint)
  .....................
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (HIVE-20660) Group by statistics estimation could be improved by bounding the total number of rows to source table

Reply via email to