I am reposting a message from friday to the Hive developer group to broaden the audience. Hopefully you won't mind the duplicate:
Hello Hive users, I study informatics at the TU Berlin. As part of my master's thesis a couple of months ago I conducted some experiments on Hive (version 0.7.0). Unfortunately, I don't have access to some critical compile-time data, so I was hoping you can help me make some sense of some numbers by filling the gaps or pointing me to sources with information regarding the way the Hive optimizer works. For all generated datasets, I used a plain TEXTFILE storage format without a table-space partitioning (that is, my data was stored in the HDFS inside large logical CSV like files). I defined a set of minimalistic queries (up to 4 logical operators working on 1-2 input relations) organized into distinct subsets depending on the nature of the performed operations (e.g. embarrassingly parallel queries, parallel aggregation queries, parallel joins, etc.). For all queries, I performed experiments with increasing selectivity factor to see how the system reacts to an increase in the amount of processed data. I will try to illustrate my questions with short abstract examples - consider the following general query with a range based filter and an additive aggregation function: SELECT x, AVG(y) FROM A WHERE z < {alpha} GROUP BY x; The group key column x has a relatively small number of distinct values, i.e. there are few reduce groups with high number of records in each one of them. I used increasing {alpha} values to range the selectivity of A from 0.4 to 1.0, but I don't see a corresponding increase in the amount of output records emitted by the mappers. *Since the value of the "COMBINE_OUTPUT_RECORDS" counter for all jobs is also zero, I assume that you somehow push the combine logic inside the Hive-subplan executed by the mapper - is that correct?* Similar observations occur if I substitute the additive aggregation function (AVG) with a built-in "holistic" one (PERCENTILE): set hive.map.aggr.hash.percentmemory=0.05; set mapred.child.java.opts=-Xmx24576m; SELECT x, PERCENTILE(y, array(0.25, 0.50, 0.75)) FROM A WHERE z < {alpha} GROUP BY x; Again, log stats suggest that there is no Hadoop combiner (I can understand that in this case), although the number of map output records is the same as in the previous case. This suggests that somehow (maybe because there are only 10 distinct groups) the aggregated y-values can be compacted into a single record per group inside the mapper - I assume this is done by the hashtable optimization -- *can someone briefly explain when and how exactly the hash optimization in the PERCENTILE function works?* I'll be very grateful to any sort of help! Regards, Alexander Alexandrov