Question on correlation optimizer

2013-12-10 Thread Avrilia Floratou
Hi, I'm running TPCH query 21 on Hive. 0.12 and have enabled hive.optimize.correlation. I could see the effect of the correlation optimizer on query 17 but when running query 21 I don't actually see the optimizer being used. I used the publicly available tpc-h queries for hive and merged all the

Re: Question on correlation optimizer

2013-12-10 Thread Yin Huai
Hi Avrilia, It is caused by distinct aggregations in TPC-H Q21. Because Hive adds those distinct columns in the key columns of ReduceSinkOperators and correlation optimizer only check exact same key columns right now, this query will not be optimized. The jira of this issue is

Re: Question on correlation optimizer

2013-12-10 Thread Avrilia Floratou
Hi Yin, Thanks for the detailed explanation. I have one more question for the correlation optimizer. When I ran explain in query 17 I get the plan for stage 1 where the bulk of the time goes. I can understand what is happening in the map phase but the reduce phase confuses me when the optimizer