Igor Dvorzhak created HIVE-22894:
------------------------------------
Summary: Filter on subquery with GROUP BY returns wrong column
Key: HIVE-22894
URL: https://issues.apache.org/jira/browse/HIVE-22894
Project: Hive
Issue Type: Bug
Components: CBO
Affects Versions: 2.3.6
Reporter: Igor Dvorzhak
Reproduction steps:
{code:java}
$ echo -e "02/11/20,C_A,C_A_B\n02/11/20,C_A,C_A_C" | hadoop fs -put -
/user/hive/warehouse/test/data.csv
$ hive
> CREATE TABLE test(date_str STRING, category STRING, subcategory STRING) ROW
> FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
OK
Time taken: 0.877 seconds
> SELECT DISTINCT category FROM (SELECT date_str, category, subcategory FROM
> test HERE date_str='02/11/20' GROUP BY date_str, category, subcategory) AS t
> WHERE t.category='C_A';
OK
C_A_B
C_A_C
Time taken: 9.108 seconds, Fetched: 2 row(s)
> EXPLAIN SELECT DISTINCT category FROM (SELECT date_str, category, subcategory
> FROM test WHERE date_str='02/11/20' GROUP BY date_str, category, subcategory)
> AS t WHERE t.category='C_A';
OK
Plan optimized by CBO.Vertex dependency in root stage
Reducer 2 <- Map 1 (SIMPLE_EDGE)Stage-0
Fetch Operator
limit:-1
Stage-1
Reducer 2
File Output Operator [FS_12]
Group By Operator [GBY_10] (rows=1 width=38)
Output:["_col0"],keys:_col0
Group By Operator [GBY_5] (rows=1 width=38)
Output:["_col0"],keys:KEY._col0
<-Map 1 [SIMPLE_EDGE]
SHUFFLE [RS_4]
PartitionCols:_col0
Group By Operator [GBY_3] (rows=1 width=38)
Output:["_col0"],keys:subcategory
Select Operator [SEL_2] (rows=1 width=38)
Output:["subcategory"]
Filter Operator [FIL_13] (rows=1 width=38)
predicate:((date_str = '02/11/20') and (category = 'C_A'))
TableScan [TS_0] (rows=1 width=38)
default@test,test,Tbl:COMPLETE,Col:NONE,Output:["date_str","category","subcategory"]
Time taken: 0.21 seconds, Fetched: 27 row(s)
{code}
It works as expected with disabled CBO:
{code:java}
> SET hive.cbo.enable=false;
> SELECT DISTINCT category FROM (SELECT date_str, category, subcategory FROM
> test WHERE date_str='02/11/20' GROUP BY date_str, category, subcategory) AS t
> WHERE t.category='C_A';
OK
C_A
Time taken: 13.948 seconds, Fetched: 1 row(s)
> EXPLAIN SELECT DISTINCT category FROM (SELECT date_str, category, subcategory
> FROM test WHERE date_str='02/11/20' GROUP BY date_str, category, subcategory)
> AS t WHERE t.category='C_A';
OK
Vertex dependency in root stage
Reducer 2 <- Map 1 (SIMPLE_EDGE)
Reducer 3 <- Reducer 2 (SIMPLE_EDGE)Stage-0
Fetch Operator
limit:-1
Stage-1
Reducer 3
File Output Operator [FS_13]
Select Operator [SEL_12] (rows=1 width=38)
Output:["_col0"]
Group By Operator [GBY_11] (rows=1 width=38)
Output:["_col0"],keys:'C_A'
<-Reducer 2 [SIMPLE_EDGE]
SHUFFLE [RS_10]
PartitionCols:'C_A'
Group By Operator [GBY_9] (rows=1 width=38)
Output:["_col0"],keys:'C_A'
Select Operator [SEL_6] (rows=1 width=38)
Group By Operator [GBY_5] (rows=1 width=38)
Output:["_col0","_col1","_col2"],keys:'02/11/20', 'C_A',
KEY._col2
<-Map 1 [SIMPLE_EDGE]
SHUFFLE [RS_4]
PartitionCols:'02/11/20', 'C_A', _col2
Group By Operator [GBY_3] (rows=1 width=38)
Output:["_col0","_col1","_col2"],keys:'02/11/20',
'C_A', subcategory
Select Operator [SEL_2] (rows=1 width=38)
Output:["subcategory"]
Filter Operator [FIL_14] (rows=1 width=38)
predicate:((date_str = '02/11/20') and (category =
'C_A'))
TableScan [TS_0] (rows=1 width=38)
default@test,test,Tbl:COMPLETE,Col:NONE,Output:["date_str","category","subcategory"]
Time taken: 0.065 seconds, Fetched: 34 row(s){code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)