> For example, a Hive job may start Tez containers, which then retrieve data 
> from LLAP running concurrently. In the current implementation, this is 
> unrealistic

That is how LLAP was built - to push work from Tez to LLAP vertex by vertex, 
instead of an all-or-nothing implementation.

Here are the slides describing how that is plugged in LLAP from Hadoop Summit 
2015.

https://www.slideshare.net/Hadoop_Summit/llap-longlived-execution-in-hive/21

The flag in question is hive.llap.execution.mode - the most common use-case 
imagined for it was something like the mode=map, where only table-scan + all 
secure operators (i.e no temporary UDFs) are run inside LLAP (to take advantage 
of the cache).

LLAP can shuffle data to a Tez container, but it cannot shuffle data from a Tez 
container back into the daemon (& that's not very useful, since it won't be 
cached).

Here's the class that decides the hybrid execution tree & the plans the split 
between LLAP and Tez in the same query DAG.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/LlapDecider.java#L81

If you want to consume the LLAP cached rows from something like GPUs running 
Caffee, you can access LLAP cache via the SparkSQL data-source APIs.

https://github.com/hortonworks/spark-llap-release/blob/HDP-2.6.3.0-235-tag/examples/src/main/python/spark_llap_dsl.py

This is faster than directly reading off Cloud filesystems (because of LLAP's 
SSD cache), but even with a perf penalty on-prem it is very useful to restrict 
the access of the Spark ML[1] to certain columns (i.e you can extract lat/long, 
from a table which has other PII data) without having to make a complete copy 
of the data after projections to share from the EDW end of the shop to the ML 
side of it, even if the entire data-set is HDFS encrypted.

Cheers,
Gopal
[1] - https://hortonworks.com/blog/row-column-level-control-apache-spark/


Reply via email to