Hi Mich, Your recent presentation in London on this topic "Running Spark on Hive or Hive on Spark" Have you made any more interesting findings that you like to bring up? If Hive is offering both Spark and Tez in addition to MR, what stopping one not to use Spark? I still don't get why TEZ + LLAP is going to be a better choice from what you mentioned? thanking you
On Tuesday, 31 May 2016, 20:22, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: Couple of points if I may and kindly bear with my remarks. Whilst it will be very interesting to try TEZ with LLAP. As I read from LLAP "Sub-second queries require fast query execution and low setup cost. The challenge for Hive is to achieve this without giving up on the scale and flexibility that users depend on. This requires a new approach using a hybrid engine that leverages Tez and something new called LLAP (Live Long and Process, #llap online). LLAP is an optional daemon process running on multiple nodes, that provides the following: - Caching and data reuse across queries with compressed columnar data in-memory (off-heap) - Multi-threaded execution including reads with predicate pushdown and hash joins - High throughput IO using Async IO Elevator with dedicated thread and core per disk - Granular column level security across applications - " OK so we have added an in-memory capability to TEZ by way of LLAP, In other words what Spark does already and BTW it does not require a daemon running on any host. Don't take me wrong. It is interesting but this sounds to me (without testing myself) adding caching capability to TEZ to bring it on par with SPARK. Remember: Spark -> DAG + in-memory cachingTEZ = MR on DAGTEZ + LLAP => DAG + in-memory caching OK it is another way getting the same result. However, my concerns: - Spark has a wide user base. I judge this from Spark user group traffic - TEZ user group has no traffic I am afraid - LLAP I don't know Sounds like Hortonworks promote TEZ and Cloudera does not want to know anything about Hive. and they promote Impala but that sounds like a sinking ship these days. Having said that I will try TEZ + LLAP :) No pun intended Regards Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com On 31 May 2016 at 08:19, Jörn Franke <jornfra...@gmail.com> wrote: Thanks very interesting explanation. Looking forward to test it. > On 31 May 2016, at 07:51, Gopal Vijayaraghavan <gop...@apache.org> wrote: > > >> That being said all systems are evolving. Hive supports tez+llap which >> is basically the in-memory support. > > There is a big difference between where LLAP & SparkSQL, which has to do > with access pattern needs. > > The first one is related to the lifetime of the cache - the Spark RDD > cache is per-user-session which allows for further operation in that > session to be optimized. > > LLAP is designed to be hammered by multiple user sessions running > different queries, designed to automate the cache eviction & selection > process. There's no user visible explicit .cache() to remember - it's > automatic and concurrent. > > My team works with both engines, trying to improve it for ORC, but the > goals of both are different. > > I will probably have to write a proper academic paper & get it > edited/reviewed instead of send my ramblings to the user lists like this. > Still, this needs an example to talk about. > > To give a qualified example, let's leave the world of single use clusters > and take the use-case detailed here > > http://hortonworks.com/blog/impala-vs-hive-performance-benchmark/ > > > There are two distinct problems there - one is that a single day sees upto > 100k independent user sessions running queries and that most queries cover > the last hour (& possibly join/compare against a similar hour aggregate > from the past). > > The problem with having independent 100k user-sessions from different > connections was that the SparkSQL layer drops the RDD lineage & cache > whenever a user ends a session. > > The scale problem in general for Impala was that even though the data size > was in multiple terabytes, the actual hot data was approx <20Gb, which > resides on <10 machines with locality. > > The same problem applies when you apply RDD caching with something like > un-replicated like Tachyon/Alluxio, since the same RDD will be exceeding > popular that the machines which hold those blocks run extra hot. > > A cache model per-user session is entirely wasteful and a common cache + > MPP model effectively overloads 2-3% of cluster, while leaving the other > machines idle. > > LLAP was designed specifically to prevent that hotspotting, while > maintaining the common cache model - within a few minutes after an hour > ticks over, the whole cluster develops temporal popularity for the hot > data and nearly every rack has at least one cached copy of the same data > for availability/performance. > > Since data stream tend to be extremely wide table (Omniture) comes to > mine, so the cache actually does not hold all columns in a table and since > Zipf distributions are extremely common in these real data sets, the cache > does not hold all rows either. > > select count(clicks) from table where zipcode = 695506; > > with ORC data bucketed + *sorted* by zipcode, the row-groups which are in > the cache will be the only 2 columns (clicks & zipcode) & all bloomfilter > indexes for all files will be loaded into memory, all misses on the bloom > will not even feature in the cache. > > A subsequent query for > > select count(clicks) from table where zipcode = 695586; > > will run against the collected indexes, before deciding which files need > to be loaded into cache. > > > Then again, > > select count(clicks)/count(impressions) from table where zipcode = 695586; > > will load only impressions out of the table into cache, to add it to the > columnar cache without producing another complete copy (RDDs are not > mutable, but LLAP cache is additive). > > The column split cache & index-cache separation allows for this to be > cheaper than a full rematerialization - both are evicted as they fill up, > with different priorities. > > Following the same vein, LLAP can do a bit of clairvoyant pre-processing, > with a bit of input from UX patterns observed from Tableau/Microstrategy > users to give it the impression of being much faster than the engine > really can be. > > Illusion of performance is likely to be indistinguishable from actual - > I'm actually looking for subjects for that experiment :) > > Cheers, > Gopal > >