Hi, I'm researching the different flavors of MPPs and had trouble understanding Shuffle differences between Apache Drill, Impala and Hive (essentially M/R), and thought you guys might be able to help me a bit.
Suppose you are doing a classic SELECT ip, count(*) FROM ip_events GROUP BY ip query. In Hive, the mapper task will do the local group by (combiner) on a specific hdfs block (input data), and copy the resulting group by from the mapper task to a reducer task machine. Once the reducer task has all mapper grouped by data, it then performs its own group by to yield the final result. In Apache Drill (here you need to correct me if I'm wrong) - the data is read from hdfs block on its data-node, grouped by locally and once its complete (spilled to disk if can't fit in RAM), the records are transmitted (exchange operator) to the node going the final group by. This node is not waiting for "mapper" nodes to wrap up, but starts aggregating as soon as it receives the records (spill to disk if data can't fit in RAM). In this approach, failure recovery in case one "mapper" node goes down is impossible since you already aggregated parts of the data and it can't be separated, thus re-running this "mapper" task again will skew the results. In Shark, in contrast, the "mapper" task group by results are transferred from mapper node to reducer node, but kept in memory (spilled to disk if needed), and once ALL mapper tasks data has arrived, the final reducer group by aggregation begins. This approach allows for failure recovery - if one mapper node drops dead mid-query you can re-run it on another mapper node, and continue. So my questions are: 1. Is the above correct? 2. Does Impala works like Apache Drill as I described it above? Thanks alot ! Asaf
