> These looks pretty impressive. What execution mode were you running >these? Yarn client may be?
There is no other mode - everything runs on YARN. > 53 times The factor is actually bigger in actual execution. The MRv2 version takes 2.47s to prep a query, while the LLAP version takes 1.64s. The MRv2 version takes 200.319s to execute the query, while the LLAP version takes 1.02s. The execution factor is nearly ~200x, but the compile becomes significant as you scale down the latencies. > My calculations on Hive 2 on Spark 1.3.1 Not sure where Hive2-on-Spark is going - the last commit to SparkCompiler was late last year, before there was a Hive2. On the speed front, I'm pretty sure you have got most of the Hive2 optimizations disabled, even the most basic of the Stinger optimizations might be missing for you. Check if you have set hive.vectorized.execution.enabled=true; Some of these new optimizations don't work on H-o-S, because Hive-on-Spark does not implement a true broadcast join - instead it uses a SparkHashTableSinkOperatorwhich actually writes to HDFS instead of sending it directy to the downstream task. I don't understand why that is the case instead of RDD brodcast, but that prevents the JOIN optimizations which convert the 34 sec query into a 3.8 sec query from applying to Spark execution. A couple of examples would be set hive.vectorized.execution.mapjoin.native.fast.hashtable.enabled=true; set hive.vectorized.execution.mapjoin.minmax.enabled=true; Those two make easy work of joins in LLAP, particularly semi-joins which are common in BI queries. Once LLAP is out of tech preview, we can enable most of them by default for Tez+LLAP, but that would not mean all of it applies to Hive-on-(Spark/MR). Getting these new features onto another engine takes active effort from the engine's devs. Cheers, Gopal