Re: Using Spark on Hive with Hive also using Spark as its execution engine

Mich Talebzadeh Tue, 31 May 2016 12:23:55 -0700

Couple of points if I may and kindly bear with my remarks.

Whilst it will be very interesting to try TEZ with LLAP. As I read from LLAP


"Sub-second queries require fast query execution and low setup cost. The
challenge for Hive is to achieve this without giving up on the scale and
flexibility that users depend on. This requires a new approach using a
hybrid engine that leverages Tez and something new called  LLAP (Live Long
and Process, #llap online).

LLAP is an optional daemon process running on multiple nodes, that provides
the following:

   - Caching and data reuse across queries with compressed columnar data
   in-memory (off-heap)
   - Multi-threaded execution including reads with predicate pushdown and
   hash joins
   - High throughput IO using Async IO Elevator with dedicated thread and
   core per disk
   - Granular column level security across applications
   - "

OK so we have added an in-memory capability to TEZ by way of LLAP, In other
words what Spark does already and BTW it does not require a daemon running
on any host. Don't take me wrong. It is interesting but this sounds to me
(without testing myself) adding caching capability to TEZ to bring it on
par with SPARK.

Remember:

Spark -> DAG + in-memory caching
TEZ = MR on DAG
TEZ + LLAP => DAG + in-memory caching

OK it is another way getting the same result. However, my concerns:


   - Spark has a wide user base. I judge this from Spark user group traffic
   - TEZ user group has no traffic I am afraid
   - LLAP I don't know

Sounds like Hortonworks promote TEZ and Cloudera does not want to know
anything about Hive. and they promote Impala but that sounds like a sinking
ship these days.

Having said that I will try TEZ + LLAP :) No pun intended

Regards

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 31 May 2016 at 08:19, Jörn Franke <jornfra...@gmail.com> wrote:

> Thanks very interesting explanation. Looking forward to test it.
>
> > On 31 May 2016, at 07:51, Gopal Vijayaraghavan <gop...@apache.org>
> wrote:
> >
> >
> >> That being said all systems are evolving. Hive supports tez+llap which
> >> is basically the in-memory support.
> >
> > There is a big difference between where LLAP & SparkSQL, which has to do
> > with access pattern needs.
> >
> > The first one is related to the lifetime of the cache - the Spark RDD
> > cache is per-user-session which allows for further operation in that
> > session to be optimized.
> >
> > LLAP is designed to be hammered by multiple user sessions running
> > different queries, designed to automate the cache eviction & selection
> > process. There's no user visible explicit .cache() to remember - it's
> > automatic and concurrent.
> >
> > My team works with both engines, trying to improve it for ORC, but the
> > goals of both are different.
> >
> > I will probably have to write a proper academic paper & get it
> > edited/reviewed instead of send my ramblings to the user lists like this.
> > Still, this needs an example to talk about.
> >
> > To give a qualified example, let's leave the world of single use clusters
> > and take the use-case detailed here
> >
> > http://hortonworks.com/blog/impala-vs-hive-performance-benchmark/
> >
> >
> > There are two distinct problems there - one is that a single day sees
> upto
> > 100k independent user sessions running queries and that most queries
> cover
> > the last hour (& possibly join/compare against a similar hour aggregate
> > from the past).
> >
> > The problem with having independent 100k user-sessions from different
> > connections was that the SparkSQL layer drops the RDD lineage & cache
> > whenever a user ends a session.
> >
> > The scale problem in general for Impala was that even though the data
> size
> > was in multiple terabytes, the actual hot data was approx <20Gb, which
> > resides on <10 machines with locality.
> >
> > The same problem applies when you apply RDD caching with something like
> > un-replicated like Tachyon/Alluxio, since the same RDD will be exceeding
> > popular that the machines which hold those blocks run extra hot.
> >
> > A cache model per-user session is entirely wasteful and a common cache +
> > MPP model effectively overloads 2-3% of cluster, while leaving the other
> > machines idle.
> >
> > LLAP was designed specifically to prevent that hotspotting, while
> > maintaining the common cache model - within a few minutes after an hour
> > ticks over, the whole cluster develops temporal popularity for the hot
> > data and nearly every rack has at least one cached copy of the same data
> > for availability/performance.
> >
> > Since data stream tend to be extremely wide table (Omniture) comes to
> > mine, so the cache actually does not hold all columns in a table and
> since
> > Zipf distributions are extremely common in these real data sets, the
> cache
> > does not hold all rows either.
> >
> > select count(clicks) from table where zipcode = 695506;
> >
> > with ORC data bucketed + *sorted* by zipcode, the row-groups which are in
> > the cache will be the only 2 columns (clicks & zipcode) & all bloomfilter
> > indexes for all files will be loaded into memory, all misses on the bloom
> > will not even feature in the cache.
> >
> > A subsequent query for
> >
> > select count(clicks) from table where zipcode = 695586;
> >
> > will run against the collected indexes, before deciding which files need
> > to be loaded into cache.
> >
> >
> > Then again,
> >
> > select count(clicks)/count(impressions) from table where zipcode =
> 695586;
> >
> > will load only impressions out of the table into cache, to add it to the
> > columnar cache without producing another complete copy (RDDs are not
> > mutable, but LLAP cache is additive).
> >
> > The column split cache & index-cache separation allows for this to be
> > cheaper than a full rematerialization - both are evicted as they fill up,
> > with different priorities.
> >
> > Following the same vein, LLAP can do a bit of clairvoyant pre-processing,
> > with a bit of input from UX patterns observed from Tableau/Microstrategy
> > users to give it the impression of being much faster than the engine
> > really can be.
> >
> > Illusion of performance is likely to be indistinguishable from actual -
> > I'm actually looking for subjects for that experiment :)
> >
> > Cheers,
> > Gopal
> >
> >
>

Re: Using Spark on Hive with Hive also using Spark as its execution engine

Reply via email to