Re: Using Spark on Hive with Hive also using Spark as its execution engine

Jörn Franke Mon, 11 Jul 2016 09:00:18 -0700

I think llap should be in the future a general component so llap + spark can 
make sense. I see tez and spark not as competitors but they have different 
purposes. Hive+Tez+llap is not the same as hive+spark. I think it goes beyond 
that for interactive queries .
Tez - you should use a distribution (eg Hortonworks) - generally I would use a 
distribution for anything related to performance , testing etc. because doing 
an own installation is more complex and more difficult to maintain. Performance 
and also features will be less good if you do not use a distribution. Which one 
is up to your choice.


> On 11 Jul 2016, at 17:09, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> 
> The presentation will go deeper into the topic. Otherwise some thoughts  of 
> mine. Fell free to comment. criticise :) 
> 
> I am a member of Spark Hive and Tez user groups plus one or two others
> Spark is by far the biggest in terms of community interaction
> Tez, typically one thread in a month
> Personally started building Tez for Hive from Tez source and gave up as it 
> was not working. This was my own build as opposed to a distro
> if Hive says you should use Spark or Tez then using Spark is a perfectly 
> valid choice
> If Tez & LLAP offers you a Spark (DAG + in-memory caching) under the bonnet 
> why bother.
> Yes I have seen some test results (Hive on Spark vs Hive on Tez) etc. but 
> they are a bit dated (not being unkind) and cannot be taken as is today. One 
> their concern if I recall was excessive CPU and memory usage of Spark but 
> then with the same token LLAP will add additional need for resources
> Essentially I am more comfortable to use less of technology stack than more.  
> With Hive and Spark (in this context) we have two. With Hive, Tez and LLAP, 
> we have three stacks to look after that add to skill cost as well.
> Yep. It is still good to keep it simple
> 
> My thoughts on this are that if you have a viable open source product like 
> Spark which is becoming a sort of Vogue in Big Data space and moving very 
> fast, why look for another one. Hive does what it says on the Tin and good 
> reliable Data Warehouse.
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
>> On 11 July 2016 at 15:22, Ashok Kumar <ashok34...@yahoo.com> wrote:
>> Hi Mich,
>> 
>> Your recent presentation in London on this topic "Running Spark on Hive or 
>> Hive on Spark"
>> 
>> Have you made any more interesting findings that you like to bring up?
>> 
>> If Hive is offering both Spark and Tez in addition to MR, what stopping one 
>> not to use Spark? I still don't get why TEZ + LLAP is going to be a better 
>> choice from what you mentioned?
>> 
>> thanking you 
>> 
>> 
>> 
>> On Tuesday, 31 May 2016, 20:22, Mich Talebzadeh <mich.talebza...@gmail.com> 
>> wrote:
>> 
>> 
>> Couple of points if I may and kindly bear with my remarks.
>> 
>> Whilst it will be very interesting to try TEZ with LLAP. As I read from LLAP
>> 
>> "Sub-second queries require fast query execution and low setup cost. The 
>> challenge for Hive is to achieve this without giving up on the scale and 
>> flexibility that users depend on. This requires a new approach using a 
>> hybrid engine that leverages Tez and something new called  LLAP (Live Long 
>> and Process, #llap online).
>> 
>> LLAP is an optional daemon process running on multiple nodes, that provides 
>> the following:
>> Caching and data reuse across queries with compressed columnar data 
>> in-memory (off-heap)
>> Multi-threaded execution including reads with predicate pushdown and hash 
>> joins
>> High throughput IO using Async IO Elevator with dedicated thread and core 
>> per disk
>> Granular column level security across applications
>> "
>> OK so we have added an in-memory capability to TEZ by way of LLAP, In other 
>> words what Spark does already and BTW it does not require a daemon running 
>> on any host. Don't take me wrong. It is interesting but this sounds to me 
>> (without testing myself) adding caching capability to TEZ to bring it on par 
>> with SPARK.
>> 
>> Remember:
>> 
>> Spark -> DAG + in-memory caching
>> TEZ = MR on DAG
>> TEZ + LLAP => DAG + in-memory caching
>> 
>> OK it is another way getting the same result. However, my concerns:
>> 
>> Spark has a wide user base. I judge this from Spark user group traffic
>> TEZ user group has no traffic I am afraid
>> LLAP I don't know
>> Sounds like Hortonworks promote TEZ and Cloudera does not want to know 
>> anything about Hive. and they promote Impala but that sounds like a sinking 
>> ship these days.
>> 
>> Having said that I will try TEZ + LLAP :) No pun intended
>> 
>> Regards
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> http://talebzadehmich.wordpress.com
>>  
>> 
>> On 31 May 2016 at 08:19, Jörn Franke <jornfra...@gmail.com> wrote:
>> Thanks very interesting explanation. Looking forward to test it.
>> 
>> > On 31 May 2016, at 07:51, Gopal Vijayaraghavan <gop...@apache.org> wrote:
>> >
>> >
>> >> That being said all systems are evolving. Hive supports tez+llap which
>> >> is basically the in-memory support.
>> >
>> > There is a big difference between where LLAP & SparkSQL, which has to do
>> > with access pattern needs.
>> >
>> > The first one is related to the lifetime of the cache - the Spark RDD
>> > cache is per-user-session which allows for further operation in that
>> > session to be optimized.
>> >
>> > LLAP is designed to be hammered by multiple user sessions running
>> > different queries, designed to automate the cache eviction & selection
>> > process. There's no user visible explicit .cache() to remember - it's
>> > automatic and concurrent.
>> >
>> > My team works with both engines, trying to improve it for ORC, but the
>> > goals of both are different.
>> >
>> > I will probably have to write a proper academic paper & get it
>> > edited/reviewed instead of send my ramblings to the user lists like this.
>> > Still, this needs an example to talk about.
>> >
>> > To give a qualified example, let's leave the world of single use clusters
>> > and take the use-case detailed here
>> >
>> > http://hortonworks.com/blog/impala-vs-hive-performance-benchmark/
>> >
>> >
>> > There are two distinct problems there - one is that a single day sees upto
>> > 100k independent user sessions running queries and that most queries cover
>> > the last hour (& possibly join/compare against a similar hour aggregate
>> > from the past).
>> >
>> > The problem with having independent 100k user-sessions from different
>> > connections was that the SparkSQL layer drops the RDD lineage & cache
>> > whenever a user ends a session.
>> >
>> > The scale problem in general for Impala was that even though the data size
>> > was in multiple terabytes, the actual hot data was approx <20Gb, which
>> > resides on <10 machines with locality.
>> >
>> > The same problem applies when you apply RDD caching with something like
>> > un-replicated like Tachyon/Alluxio, since the same RDD will be exceeding
>> > popular that the machines which hold those blocks run extra hot.
>> >
>> > A cache model per-user session is entirely wasteful and a common cache +
>> > MPP model effectively overloads 2-3% of cluster, while leaving the other
>> > machines idle.
>> >
>> > LLAP was designed specifically to prevent that hotspotting, while
>> > maintaining the common cache model - within a few minutes after an hour
>> > ticks over, the whole cluster develops temporal popularity for the hot
>> > data and nearly every rack has at least one cached copy of the same data
>> > for availability/performance.
>> >
>> > Since data stream tend to be extremely wide table (Omniture) comes to
>> > mine, so the cache actually does not hold all columns in a table and since
>> > Zipf distributions are extremely common in these real data sets, the cache
>> > does not hold all rows either.
>> >
>> > select count(clicks) from table where zipcode = 695506;
>> >
>> > with ORC data bucketed + *sorted* by zipcode, the row-groups which are in
>> > the cache will be the only 2 columns (clicks & zipcode) & all bloomfilter
>> > indexes for all files will be loaded into memory, all misses on the bloom
>> > will not even feature in the cache.
>> >
>> > A subsequent query for
>> >
>> > select count(clicks) from table where zipcode = 695586;
>> >
>> > will run against the collected indexes, before deciding which files need
>> > to be loaded into cache.
>> >
>> >
>> > Then again,
>> >
>> > select count(clicks)/count(impressions) from table where zipcode = 695586;
>> >
>> > will load only impressions out of the table into cache, to add it to the
>> > columnar cache without producing another complete copy (RDDs are not
>> > mutable, but LLAP cache is additive).
>> >
>> > The column split cache & index-cache separation allows for this to be
>> > cheaper than a full rematerialization - both are evicted as they fill up,
>> > with different priorities.
>> >
>> > Following the same vein, LLAP can do a bit of clairvoyant pre-processing,
>> > with a bit of input from UX patterns observed from Tableau/Microstrategy
>> > users to give it the impression of being much faster than the engine
>> > really can be.
>> >
>> > Illusion of performance is likely to be indistinguishable from actual -
>> > I'm actually looking for subjects for that experiment :)
>> >
>> > Cheers,
>> > Gopal
>> >
>> >
>

Re: Using Spark on Hive with Hive also using Spark as its execution engine

Reply via email to