> These looks pretty impressive. What execution mode were you running
>these? Yarn client may be?

There is no other mode - everything runs on YARN.

> 53 times


The factor is actually bigger in actual execution.

The MRv2 version takes 2.47s to prep a query, while the LLAP version takes
1.64s.

The MRv2 version takes 200.319s to execute the query, while the LLAP
version takes 1.02s.

The execution factor is nearly ~200x, but the compile becomes significant
as you scale down the latencies.

> My calculations on Hive 2 on Spark 1.3.1

Not sure where Hive2-on-Spark is going - the last commit to SparkCompiler
was late last year, before there was a Hive2.

On the speed front, I'm pretty sure you have got most of the Hive2
optimizations disabled, even the most basic of the Stinger optimizations
might be missing for you.

Check if you have

set hive.vectorized.execution.enabled=true;


Some of these new optimizations don't work on H-o-S, because Hive-on-Spark
does not implement a true broadcast join - instead it uses a
SparkHashTableSinkOperatorwhich actually writes to HDFS instead of sending
it directy to the downstream task.


I don't understand why that is the case instead of RDD brodcast, but that
prevents the JOIN optimizations which convert the 34 sec query into a 3.8
sec query from applying to Spark execution.

A couple of examples would be

set hive.vectorized.execution.mapjoin.native.fast.hashtable.enabled=true;
set hive.vectorized.execution.mapjoin.minmax.enabled=true;

Those two make easy work of joins in LLAP, particularly semi-joins which
are common in BI queries.


Once LLAP is out of tech preview, we can enable most of them by default
for Tez+LLAP, but that would not mean all of it applies to
Hive-on-(Spark/MR).

Getting these new features onto another engine takes active effort from
the engine's devs.

Cheers,
Gopal










Reply via email to