I often tune mapred.map.tasks & mapred.reduce.tasks per Hive query. For example.
set mapred.map.tasks=31; set mapred.reduce.tasks=11; FROM Pageviews2 p join client_ip c on c.id = p.clientip_id insert overwrite directory '/user/ecapriolo/hivetest1' SELECT ip, count(1) WHERE (date_id = 6) AND ( p.sitename_id=5 OR p.sitename_id=9 OR p.sitename_id=13 OR p.sitename_id=17) GROUP BY ip; This query actually operates on 4~5 GB, the run time is very impressive and is accomplished with 3 M/R jobs. Time taken: 91.64 seconds. While being mostly clueless about how the optimizer/query planner works, I would think that being able to set mapred.map.tasks, and mapred.reduce.tasks as a hint to each MapReduce phase would really kick up the performance. For example this section of the query (p.sitename_id=5 OR p.sitename_id=9 OR p.sitename_id=13 OR p.sitename_id=17) Will prune the result set greatly, subsequent phases really may not need as many mappers/ or reduces as the first phase. Again, I have not looked at reformulating the query which may aid in optimization but is there a place for setting variables per phase? Edward
