I think llap should be in the future a general component so llap + spark can make sense. I see tez and spark not as competitors but they have different purposes. Hive+Tez+llap is not the same as hive+spark. I think it goes beyond that for interactive queries . Tez - you should use a distribution (eg Hortonworks) - generally I would use a distribution for anything related to performance , testing etc. because doing an own installation is more complex and more difficult to maintain. Performance and also features will be less good if you do not use a distribution. Which one is up to your choice.
> On 11 Jul 2016, at 17:09, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > > The presentation will go deeper into the topic. Otherwise some thoughts of > mine. Fell free to comment. criticise :) > > I am a member of Spark Hive and Tez user groups plus one or two others > Spark is by far the biggest in terms of community interaction > Tez, typically one thread in a month > Personally started building Tez for Hive from Tez source and gave up as it > was not working. This was my own build as opposed to a distro > if Hive says you should use Spark or Tez then using Spark is a perfectly > valid choice > If Tez & LLAP offers you a Spark (DAG + in-memory caching) under the bonnet > why bother. > Yes I have seen some test results (Hive on Spark vs Hive on Tez) etc. but > they are a bit dated (not being unkind) and cannot be taken as is today. One > their concern if I recall was excessive CPU and memory usage of Spark but > then with the same token LLAP will add additional need for resources > Essentially I am more comfortable to use less of technology stack than more. > With Hive and Spark (in this context) we have two. With Hive, Tez and LLAP, > we have three stacks to look after that add to skill cost as well. > Yep. It is still good to keep it simple > > My thoughts on this are that if you have a viable open source product like > Spark which is becoming a sort of Vogue in Big Data space and moving very > fast, why look for another one. Hive does what it says on the Tin and good > reliable Data Warehouse. > > HTH > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > http://talebzadehmich.wordpress.com > > Disclaimer: Use it at your own risk. Any and all responsibility for any loss, > damage or destruction of data or any other property which may arise from > relying on this email's technical content is explicitly disclaimed. The > author will in no case be liable for any monetary damages arising from such > loss, damage or destruction. > > >> On 11 July 2016 at 15:22, Ashok Kumar <ashok34...@yahoo.com> wrote: >> Hi Mich, >> >> Your recent presentation in London on this topic "Running Spark on Hive or >> Hive on Spark" >> >> Have you made any more interesting findings that you like to bring up? >> >> If Hive is offering both Spark and Tez in addition to MR, what stopping one >> not to use Spark? I still don't get why TEZ + LLAP is going to be a better >> choice from what you mentioned? >> >> thanking you >> >> >> >> On Tuesday, 31 May 2016, 20:22, Mich Talebzadeh <mich.talebza...@gmail.com> >> wrote: >> >> >> Couple of points if I may and kindly bear with my remarks. >> >> Whilst it will be very interesting to try TEZ with LLAP. As I read from LLAP >> >> "Sub-second queries require fast query execution and low setup cost. The >> challenge for Hive is to achieve this without giving up on the scale and >> flexibility that users depend on. This requires a new approach using a >> hybrid engine that leverages Tez and something new called LLAP (Live Long >> and Process, #llap online). >> >> LLAP is an optional daemon process running on multiple nodes, that provides >> the following: >> Caching and data reuse across queries with compressed columnar data >> in-memory (off-heap) >> Multi-threaded execution including reads with predicate pushdown and hash >> joins >> High throughput IO using Async IO Elevator with dedicated thread and core >> per disk >> Granular column level security across applications >> " >> OK so we have added an in-memory capability to TEZ by way of LLAP, In other >> words what Spark does already and BTW it does not require a daemon running >> on any host. Don't take me wrong. It is interesting but this sounds to me >> (without testing myself) adding caching capability to TEZ to bring it on par >> with SPARK. >> >> Remember: >> >> Spark -> DAG + in-memory caching >> TEZ = MR on DAG >> TEZ + LLAP => DAG + in-memory caching >> >> OK it is another way getting the same result. However, my concerns: >> >> Spark has a wide user base. I judge this from Spark user group traffic >> TEZ user group has no traffic I am afraid >> LLAP I don't know >> Sounds like Hortonworks promote TEZ and Cloudera does not want to know >> anything about Hive. and they promote Impala but that sounds like a sinking >> ship these days. >> >> Having said that I will try TEZ + LLAP :) No pun intended >> >> Regards >> >> Dr Mich Talebzadeh >> >> LinkedIn >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> >> http://talebzadehmich.wordpress.com >> >> >> On 31 May 2016 at 08:19, Jörn Franke <jornfra...@gmail.com> wrote: >> Thanks very interesting explanation. Looking forward to test it. >> >> > On 31 May 2016, at 07:51, Gopal Vijayaraghavan <gop...@apache.org> wrote: >> > >> > >> >> That being said all systems are evolving. Hive supports tez+llap which >> >> is basically the in-memory support. >> > >> > There is a big difference between where LLAP & SparkSQL, which has to do >> > with access pattern needs. >> > >> > The first one is related to the lifetime of the cache - the Spark RDD >> > cache is per-user-session which allows for further operation in that >> > session to be optimized. >> > >> > LLAP is designed to be hammered by multiple user sessions running >> > different queries, designed to automate the cache eviction & selection >> > process. There's no user visible explicit .cache() to remember - it's >> > automatic and concurrent. >> > >> > My team works with both engines, trying to improve it for ORC, but the >> > goals of both are different. >> > >> > I will probably have to write a proper academic paper & get it >> > edited/reviewed instead of send my ramblings to the user lists like this. >> > Still, this needs an example to talk about. >> > >> > To give a qualified example, let's leave the world of single use clusters >> > and take the use-case detailed here >> > >> > http://hortonworks.com/blog/impala-vs-hive-performance-benchmark/ >> > >> > >> > There are two distinct problems there - one is that a single day sees upto >> > 100k independent user sessions running queries and that most queries cover >> > the last hour (& possibly join/compare against a similar hour aggregate >> > from the past). >> > >> > The problem with having independent 100k user-sessions from different >> > connections was that the SparkSQL layer drops the RDD lineage & cache >> > whenever a user ends a session. >> > >> > The scale problem in general for Impala was that even though the data size >> > was in multiple terabytes, the actual hot data was approx <20Gb, which >> > resides on <10 machines with locality. >> > >> > The same problem applies when you apply RDD caching with something like >> > un-replicated like Tachyon/Alluxio, since the same RDD will be exceeding >> > popular that the machines which hold those blocks run extra hot. >> > >> > A cache model per-user session is entirely wasteful and a common cache + >> > MPP model effectively overloads 2-3% of cluster, while leaving the other >> > machines idle. >> > >> > LLAP was designed specifically to prevent that hotspotting, while >> > maintaining the common cache model - within a few minutes after an hour >> > ticks over, the whole cluster develops temporal popularity for the hot >> > data and nearly every rack has at least one cached copy of the same data >> > for availability/performance. >> > >> > Since data stream tend to be extremely wide table (Omniture) comes to >> > mine, so the cache actually does not hold all columns in a table and since >> > Zipf distributions are extremely common in these real data sets, the cache >> > does not hold all rows either. >> > >> > select count(clicks) from table where zipcode = 695506; >> > >> > with ORC data bucketed + *sorted* by zipcode, the row-groups which are in >> > the cache will be the only 2 columns (clicks & zipcode) & all bloomfilter >> > indexes for all files will be loaded into memory, all misses on the bloom >> > will not even feature in the cache. >> > >> > A subsequent query for >> > >> > select count(clicks) from table where zipcode = 695586; >> > >> > will run against the collected indexes, before deciding which files need >> > to be loaded into cache. >> > >> > >> > Then again, >> > >> > select count(clicks)/count(impressions) from table where zipcode = 695586; >> > >> > will load only impressions out of the table into cache, to add it to the >> > columnar cache without producing another complete copy (RDDs are not >> > mutable, but LLAP cache is additive). >> > >> > The column split cache & index-cache separation allows for this to be >> > cheaper than a full rematerialization - both are evicted as they fill up, >> > with different priorities. >> > >> > Following the same vein, LLAP can do a bit of clairvoyant pre-processing, >> > with a bit of input from UX patterns observed from Tableau/Microstrategy >> > users to give it the impression of being much faster than the engine >> > really can be. >> > >> > Illusion of performance is likely to be indistinguishable from actual - >> > I'm actually looking for subjects for that experiment :) >> > >> > Cheers, >> > Gopal >> > >> > >