Hi team.

I found that when I enable reduceduplication, count(distinct)+GroupBy becomes 
very slow.
Is there a problem with reduceduplication?


test query info:
|
CONFIG
|
SQL
|
TIME
|
|
hive.optimize.reducededuplication=true
|
select count(1) from(select uni_shop_id,partner,count(distinct uni_id) from 
default.b_std_trade_sampling group by uni_shop_id,partner) s1;
|
400s
|
|
hive.optimize.reducededuplication=false
|
select count(1) from(select uni_shop_id,partner,count(distinct uni_id) from 
default.b_std_trade_sampling group by uni_shop_id,partner) s1;
|
180s
|


table basic info:
|
info
|
row
|
|
select count(1) form default.b_std_trade_sampling
|
 9774285968
|
|
select count(distinct uni_id) form default.b_std_trade_sampling
|
5367720404
|
|
select count(distinct partner),count(distinct uni_shop_id) form 
default.b_std_trade_sampling
|
50,13000
|

I'd be grateful if someone could guide me.


Reply via email to