Hi team. I found that when I enable reduceduplication, count(distinct)+GroupBy becomes very slow. Is there a problem with reduceduplication?
test query info: | CONFIG | SQL | TIME | | hive.optimize.reducededuplication=true | select count(1) from(select uni_shop_id,partner,count(distinct uni_id) from default.b_std_trade_sampling group by uni_shop_id,partner) s1; | 400s | | hive.optimize.reducededuplication=false | select count(1) from(select uni_shop_id,partner,count(distinct uni_id) from default.b_std_trade_sampling group by uni_shop_id,partner) s1; | 180s | table basic info: | info | row | | select count(1) form default.b_std_trade_sampling | 9774285968 | | select count(distinct uni_id) form default.b_std_trade_sampling | 5367720404 | | select count(distinct partner),count(distinct uni_shop_id) form default.b_std_trade_sampling | 50,13000 | I'd be grateful if someone could guide me.