Gopal V created HIVE-14708:
------------------------------
Summary: Optimizer: NOT IN query scans one input two times
Key: HIVE-14708
URL: https://issues.apache.org/jira/browse/HIVE-14708
Project: Hive
Issue Type: Bug
Components: Logical Optimizer
Affects Versions: 2.2.0
Reporter: Gopal V
Priority: Critical
{code}
hive (tpcds_bin_partitioned_orc_1000)> explain select count(1) from store_sales
where ss_sold_date_sk NOT in (select d_date_sk from date_dim);
Stage-1
Reducer 2 vectorized, llap
File Output Operator [FS_52]
Group By Operator [GBY_51] (rows=1 width=8)
Output:["_col0"],aggregations:["count(VALUE._col0)"]
<-Map 1 [SIMPLE_EDGE] vectorized, llap
SHUFFLE [RS_50]
Group By Operator [GBY_49] (rows=1 width=8)
Output:["_col0"],aggregations:["count(1)"]
Select Operator [SEL_48] (rows=1 width=4)
Filter Operator [FIL_47] (rows=1 width=4)
predicate:_col2 is null
Map Join Operator [MAPJOIN_46] (rows=2879987999 width=4)
Conds:MAPJOIN_45._col0=RS_43._col0(Left
Outer),Output:["_col2"]
<-Map 5 [BROADCAST_EDGE] vectorized, llap
BROADCAST [RS_43]
PartitionCols:_col0
Select Operator [SEL_42] (rows=73049 width=4)
Output:["_col0"]
TableScan [TS_11] (rows=73049 width=4)
tpcds_bin_partitioned_orc_1000@date_dim,date_dim,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk"]
<-Map Join Operator [MAPJOIN_45] (rows=2879987999 width=4)
Conds:(Inner),Output:["_col0"]
<-Reducer 4 [BROADCAST_EDGE] vectorized, llap
BROADCAST [RS_41]
Select Operator [SEL_40] (rows=1 width=8)
Filter Operator [FIL_39] (rows=1 width=8)
predicate:(_col0 = 0)
Group By Operator [GBY_38] (rows=1 width=8)
Output:["_col0"],aggregations:["count(VALUE._col0)"]
<-Map 3 [SIMPLE_EDGE] vectorized, llap
SHUFFLE [RS_37]
Group By Operator [GBY_36] (rows=1 width=8)
Output:["_col0"],aggregations:["count()"]
Select Operator [SEL_35] (rows=1 width=4)
Filter Operator [FIL_34] (rows=1 width=4)
predicate:d_date_sk is null
TableScan [TS_2] (rows=73049 width=4)
tpcds_bin_partitioned_orc_1000@date_dim,date_dim,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk"]
<-Select Operator [SEL_44] (rows=2879987999 width=4)
Output:["_col0"]
TableScan [TS_0] (rows=2879987999 width=92)
tpcds_bin_partitioned_orc_1000@store_sales,store_sales,Tbl:COMPLETE,Col:COMPLETE
{code}
The 2nd scan is merely to count the number of NULLs and has
{{predicate:d_date_sk is null}} in the operator.
The NULL checks can be done inline with the NOT-NULL codepath instead of
producing 2 independent full-scans of the date_dim table.
This is not significant in a scenario like the above where the small table side
is an actual HDFS table, but entirely throttles performance when the small side
is actually an expensive aggregate.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)