Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-12 Thread Mich Talebzadeh
;>>> >>>>> Now Tez is basically MR with DAG. With Spark you get DAG + in-memory >>>>> computing. Think of it as a comparison between a classic RDBMS like Oracle >>>>> and IMDB like Oracle TimesTen with in-memory processing. >>>

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-12 Thread Mich Talebzadeh
Hi Ayan, This is a very valid question and I have not seen any available instrumentation in Spark that allows one to measure this in a practical way in a cluster. Classic example: 1. if you have memory issue do you upgrade your RAM or scale out horizontally by adding couple of more

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-12 Thread Mich Talebzadeh
will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 12 July 2016 at 09:33, Markovitz, Dudu <dmarkov...@paypal.com> wrote: > >> I don’t see how this explains the time differences. >> >> >> >

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-12 Thread Jörn Franke
> >> >> >> >> HTH >> >> >> >> >> >> Dr Mich Talebzadeh >> >> LinkedIn >> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> >> http://talebzadehmich.word

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-12 Thread Mich Talebzadeh
Tuesday, July 12, 2016 10:56 AM > *To:* user <u...@hive.apache.org> > *Cc:* user @spark <user@spark.apache.org> > > *Subject:* Re: Using Spark on Hive with Hive also using Spark as its > execution engine > > > > This the whole idea. Spark uses DAG + IM, MR is classic > &

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-12 Thread Mich Talebzadeh
u > > > > > > > > *From:* Mich Talebzadeh [mailto:mich.talebza...@gmail.com] > *Sent:* Monday, July 11, 2016 11:55 PM > *To:* user <u...@hive.apache.org>; user @spark <user@spark.apache.org> > *Subject:* Re: Using Spark on Hive with Hive also using Spark as i

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread ayan guha
ccHi Mich Thanks for showing examples, makes perfect sense. One question: "...I agree that on VLT (very large tables), the limitation in available memory may be the overriding factor in using Spark"...have you observed any specific threshold for VLT which tilts the favor against Spark. For

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Mich Talebzadeh
Another point with Hive on spark and Hive on Tez + LLAP, I am thinking loud :) 1. I am using Hive on Spark and I have a table of 10GB say with 100 users concurrently accessing the same partition of ORC table (last one hour or so) 2. Spark takes data and puts in in memory. I gather

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Mich Talebzadeh
In my test I did like for like keeping the systematic the same namely: 1. Table was a parquet table of 100 Million rows 2. The same set up was used for both Hive on Spark and Hive on MR 3. Spark was very impressive compared to MR on this particular test. Just to see any issues I

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Mich Talebzadeh
Appreciate all the comments. Hive on Spark. Spark runs as an execution engine and is only used when you query Hive. Otherwise it is not running. I run it in Yarn client mode. let me show you an example In hive-site xml set the execution engine to be spark to spark. It requires some configuration

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Michael Segel
Just a clarification. Tez is ‘vendor’ independent. ;-) Yeah… I know… Anyone can support it. Only Hortonworks has stacked the deck in their favor. Drill could be in the same boat, although there now more committers who are not working for MapR. I’m not sure who outside of HW is

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Jörn Franke
I think llap should be in the future a general component so llap + spark can make sense. I see tez and spark not as competitors but they have different purposes. Hive+Tez+llap is not the same as hive+spark. I think it goes beyond that for interactive queries . Tez - you should use a

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Michael Segel
I don’t think that it would be a good comparison. If memory serves, Tez w LLAP is going to be running a separate engine that is constantly running, no? Spark? That runs under hive… Unless you’re suggesting that the spark context is constantly running as part of the hiveserver2? > On May

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Mich Talebzadeh
The presentation will go deeper into the topic. Otherwise some thoughts of mine. Fell free to comment. criticise :) 1. I am a member of Spark Hive and Tez user groups plus one or two others 2. Spark is by far the biggest in terms of community interaction 3. Tez, typically one thread in

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Ashok Kumar
Hi Mich, Your recent presentation in London on this topic "Running Spark on Hive or Hive on Spark" Have you made any more interesting findings that you like to bring up? If Hive is offering both Spark and Tez in addition to MR, what stopping one not to use Spark? I still don't get why TEZ + LLAP

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-30 Thread Mich Talebzadeh
I think we are going to move to a model that the computation stack will be separate from storage stack and moreover something like Hive that provides the means for persistent storage (well HDFS is the one that stores all the data) will have an in-memory type capability much like what Oracle

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-30 Thread Michael Segel
And you have MapR supporting Apache Drill. So these are all alternatives to Spark, and its not necessarily an either or scenario. You can have both. > On May 30, 2016, at 12:49 PM, Mich Talebzadeh > wrote: > > yep Hortonworks supports Tez for one reason or other

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-30 Thread Jörn Franke
I do not think that in-memory itself will make things faster in all cases. Especially if you use Tez with Orc or parquet. Especially for ad hoc queries on large dataset (indecently if they fit in-memory or not) this will have a significant impact. This is an experience I have also with the

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-30 Thread Ovidiu-Cristian MARCU
Spark in relation to Tez can be like a Flink runner for Apache Beam? The use case of Tez however may be interesting (but current implementation only YARN-based?) Spark is efficient (or faster) for a number of reasons, including its ‘in-memory’ execution (from my understanding and experiments).

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-30 Thread Mich Talebzadeh
yep Hortonworks supports Tez for one reason or other which I am going hopefully to test it as the query engine for hive. Tthough I think Spark will be faster because of its in-memory support. Also if you are independent then you better off dealing with Spark and Hive without the need to support

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-30 Thread Michael Segel
Mich, Most people use vendor releases because they need to have the support. Hortonworks is the vendor who has the most skin in the game when it comes to Tez. If memory serves, Tez isn’t going to be M/R but a local execution engine? Then LLAP is the in-memory piece to speed up Tez? HTH

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-29 Thread Mich Talebzadeh
thanks I think the problem is that the TEZ user group is exceptionally quiet. Just sent an email to Hive user group to see anyone has managed to built a vendor independent version. Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-29 Thread Jörn Franke
Well I think it is different from MR. It has some optimizations which you do not find in MR. Especially the LLAP option in Hive2 makes it interesting. I think hive 1.2 works with 0.7 and 2.0 with 0.8 . At least for 1.2 it is integrated in the Hortonworks distribution. > On 29 May 2016, at

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-29 Thread Mich Talebzadeh
Hi Jorn, I started building apache-tez-0.8.2 but got few errors. Couple of guys from TEZ user group kindly gave a hand but I could not go very far (or may be I did not make enough efforts) making it work. That TEZ user group is very quiet as well. My understanding is TEZ is MR with DAG but of

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-29 Thread Jörn Franke
Very interesting do you plan also a test with TEZ? > On 29 May 2016, at 13:40, Mich Talebzadeh wrote: > > Hi, > > I did another study of Hive using Spark engine compared to Hive with MR. > > Basically took the original table imported using Sqoop and created and >

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-24 Thread Mich Talebzadeh
Hi, We use Hive as the database and use Spark as an all purpose query tool. Whether Hive is the write database for purpose or one is better off with something like Phoenix on Hbase, well the answer is it depends and your mileage varies. So fit for purpose. Ideally what wants is to use the

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-23 Thread ayan guha
Hi Thanks for very useful stats. Did you have any benchmark for using Spark as backend engine for Hive vs using Spark thrift server (and run spark code for hive queries)? We are using later but it will be very useful to remove thriftserver, if we can. On Tue, May 24, 2016 at 9:51 AM, Jörn

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-23 Thread Jörn Franke
Hi Mich, I think these comparisons are useful. One interesting aspect could be hardware scalability in this context. Additionally different type of computations. Furthermore, one could compare Spark and Tez+llap as execution engines. I have the gut feeling that each one can be justified by

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-23 Thread Ashok Kumar
Hi Dr Mich, This is very good news. I will be interested to know how Hive engages with Spark as an engine. What Spark processes are used to make this work?  Thanking you On Monday, 23 May 2016, 19:01, Mich Talebzadeh wrote: Have a look at this thread Dr Mich

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-23 Thread Mich Talebzadeh
Have a look at this thread Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com On 23 May 2016 at 09:10, Mich

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-23 Thread Mich Talebzadeh
Hi Timur and everyone. I will answer your first question as it is very relevant 1) How to make 2 versions of Spark live together on the same cluster (libraries clash, paths, etc.) ? Most of the Spark users perform ETL, ML operations on Spark as well. So, we may have 3 Spark installations

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-22 Thread Timur Shenkao
Hi, Thanks a lot for such interesting comparison. But important questions remain / to be addressed: 1) How to make 2 versions of Spark live together on the same cluster (libraries clash, paths, etc.) ? Most of the Spark users perform ETL, ML operations on Spark as well. So, we may have 3 Spark

Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-22 Thread Mich Talebzadeh
Hi, I have done a number of extensive tests using Spark-shell with Hive DB and ORC tables. Now one issue that we typically face is and I quote: Spark is fast as it uses Memory and DAG. Great but when we save data it is not fast enough OK but there is a solution now. If you use Spark with