I am trying to understand what are some of the options/settings available to 
tune the performance of Hive Queries. I have seen the benefits of Map side 
joins and Partitioning/Clustering. However I have yet to realize the impact map 
side aggregation has on query performance. I tried running this query against 
with and without map-side join turned on and did not see much difference in the 
execution times. The raw data in this partition is about 5.5 million. Looking 
for some pointers to see what type of queries benefit from Map-side aggregation


set hive.auto.convert.join=false;


set hive.map.aggr=false;

Non-partitioned, non-clustered single table with where clause on date and no 
map side aggregation

select a11.emp_id, count(1), count (distinct a11.customer_id), 
sum(a11.qty_sold) from orderdetailrcfile a11 where order_date ='01-01-2008' 
group by a11.emp_id;

400 secs


set hive.map.aggr=true;

Non-partitioned, non-clustered single table with where clause with where clause 
on date and map side aggregation

select a11.emp_id, count(1), count (distinct a11.customer_id), 
sum(a11.qty_sold) from orderdetailrcfile a11 where order_date ='01-01-2008' 
group by a11.emp_id;

390 secs


Also is there any reason to not turn on map-side joins all the time. In my 
tests I have always seen the performance either be the same or improve with 
map-side joins turned on. Are there any other parameters or Hive features that 
can help improve the performance of Hive queries.
Thanks
Anand

Reply via email to