Re: Using Spark on Hive with Hive also using Spark as its execution engine

Michael Segel Mon, 11 Jul 2016 09:09:26 -0700

Just a clarification. 

Tez is ‘vendor’ independent.  ;-)


Yeah… I know…  Anyone can support it.  Only Hortonworks has stacked the deck in 
their favor. 

Drill could be in the same boat, although there now more committers who are not 
working for MapR. I’m not sure who outside of HW is supporting Tez. 

But I digress. 

Here in the Spark user list, I have to ask how do you run hive on spark? Is the 
execution engine … the spark context always running? (Client mode I assume) 
Are the executors always running?   Can you run multiple queries from multiple 
users in parallel? 

These are some of the questions that should be asked and answered when 
considering how viable spark is going to be as the engine under Hive… 

Thx

-Mike

> On May 29, 2016, at 3:35 PM, Mich Talebzadeh <mich.talebza...@gmail.com> 
> wrote:
> 
> thanks I think the problem is that the TEZ user group is exceptionally quiet. 
> Just sent an email to Hive user group to see anyone has managed to built a 
> vendor independent version.
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 29 May 2016 at 21:23, Jörn Franke <jornfra...@gmail.com 
> <mailto:jornfra...@gmail.com>> wrote:
> Well I think it is different from MR. It has some optimizations which you do 
> not find in MR. Especially the LLAP option in Hive2 makes it interesting. 
> 
> I think hive 1.2 works with 0.7 and 2.0 with 0.8 . At least for 1.2 it is 
> integrated in the Hortonworks distribution. 
> 
> 
> On 29 May 2016, at 21:43, Mich Talebzadeh <mich.talebza...@gmail.com 
> <mailto:mich.talebza...@gmail.com>> wrote:
> 
>> Hi Jorn,
>> 
>> I started building apache-tez-0.8.2 but got few errors. Couple of guys from 
>> TEZ user group kindly gave a hand but I could not go very far (or may be I 
>> did not make enough efforts) making it work.
>> 
>> That TEZ user group is very quiet as well.
>> 
>> My understanding is TEZ is MR with DAG but of course Spark has both plus 
>> in-memory capability.
>> 
>> It would be interesting to see what version of TEZ works as execution engine 
>> with Hive.
>> 
>> Vendors are divided on this (use Hive with TEZ) or use Impala instead of 
>> Hive etc as I am sure you already know.
>> 
>> Cheers,
>> 
>> 
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>  
>> 
>> On 29 May 2016 at 20:19, Jörn Franke <jornfra...@gmail.com 
>> <mailto:jornfra...@gmail.com>> wrote:
>> Very interesting do you plan also a test with TEZ?
>> 
>> On 29 May 2016, at 13:40, Mich Talebzadeh <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>> 
>>> Hi,
>>> 
>>> I did another study of Hive using Spark engine compared to Hive with MR.
>>> 
>>> Basically took the original table imported using Sqoop and created and 
>>> populated a new ORC table partitioned by year and month into 48 partitions 
>>> as follows:
>>> 
>>> <sales_partition.PNG>
>>>  
>>> Connections use JDBC via beeline. Now for each partition using MR it takes 
>>> an average of 17 minutes as seen below for each PARTITION..  Now that is 
>>> just an individual partition and there are 48 partitions.
>>> 
>>> In contrast doing the same operation with Spark engine took 10 minutes all 
>>> inclusive. I just gave up on MR. You can see the StartTime and FinishTime 
>>> from below
>>> 
>>> <image.png>
>>> 
>>> This is by no means indicate that Spark is much better than MR but shows 
>>> that some very good results can ve achieved using Spark engine.
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>  
>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>>  
>>> 
>>> On 24 May 2016 at 08:03, Mich Talebzadeh <mich.talebza...@gmail.com 
>>> <mailto:mich.talebza...@gmail.com>> wrote:
>>> Hi,
>>> 
>>> We use Hive as the database and use Spark as an all purpose query tool.
>>> 
>>> Whether Hive is the write database for purpose or one is better off with 
>>> something like Phoenix on Hbase, well the answer is it depends and your 
>>> mileage varies. 
>>> 
>>> So fit for purpose.
>>> 
>>> Ideally what wants is to use the fastest  method to get the results. How 
>>> fast we confine it to our SLA agreements in production and that helps us 
>>> from unnecessary further work as we technologists like to play around.
>>> 
>>> So in short, we use Spark most of the time and use Hive as the backend 
>>> engine for data storage, mainly ORC tables.
>>> 
>>> We use Hive on Spark and with Hive 2 on Spark 1.3.1 for now we have a 
>>> combination that works. Granted it helps to use Hive 2 on Spark 1.6.1 but 
>>> at the moment it is one of my projects.
>>> 
>>> We do not use any vendor's products as it enables us to move away  from 
>>> being tied down after years of SAP, Oracle and MS dependency to yet another 
>>> vendor. Besides there is some politics going on with one promoting Tez and 
>>> another Spark as a backend. That is fine but obviously we prefer an 
>>> independent assessment ourselves.
>>> 
>>> My gut feeling is that one needs to look at the use case. Recently we had 
>>> to import a very large table from Oracle to Hive and decided to use Spark 
>>> 1.6.1 with Hive 2 on Spark 1.3.1 and that worked fine. We just used JDBC 
>>> connection with temp table and it was good. We could have used sqoop but 
>>> decided to settle for Spark so it all depends on use case.
>>> 
>>> HTH
>>> 
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>  
>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>>  
>>> 
>>> On 24 May 2016 at 03:11, ayan guha <guha.a...@gmail.com 
>>> <mailto:guha.a...@gmail.com>> wrote:
>>> Hi
>>> 
>>> Thanks for very useful stats. 
>>> 
>>> Did you have any benchmark for using Spark as backend engine for Hive vs 
>>> using Spark thrift server (and run spark code for hive queries)? We are 
>>> using later but it will be very useful to remove thriftserver, if we can. 
>>> 
>>> On Tue, May 24, 2016 at 9:51 AM, Jörn Franke <jornfra...@gmail.com 
>>> <mailto:jornfra...@gmail.com>> wrote:
>>> 
>>> Hi Mich,
>>> 
>>> I think these comparisons are useful. One interesting aspect could be 
>>> hardware scalability in this context. Additionally different type of 
>>> computations. Furthermore, one could compare Spark and Tez+llap as 
>>> execution engines. I have the gut feeling that  each one can be justified 
>>> by different use cases.
>>> Nevertheless, there should be always a disclaimer for such comparisons, 
>>> because Spark and Hive are not good for a lot of concurrent lookups of 
>>> single rows. They are not good for frequently write small amounts of data 
>>> (eg sensor data). Here hbase could be more interesting. Other use cases can 
>>> justify graph databases, such as Titan, or text analytics/ data matching 
>>> using Solr on Hadoop.
>>> Finally, even if you have a lot of data you need to think if you always 
>>> have to process everything. For instance, I have found valid use cases in 
>>> practice where we decided to evaluate 10 machine learning models in 
>>> parallel on only a sample of data and only evaluate the "winning" model of 
>>> the total of data.
>>> 
>>> As always it depends :) 
>>> 
>>> Best regards
>>> 
>>> P.s.: at least Hortonworks has in their distribution spark 1.5 with hive 
>>> 1.2 and spark 1.6 with hive 1.2. Maybe they have somewhere described how to 
>>> manage bringing both together. You may check also Apache Bigtop (vendor 
>>> neutral distribution) on how they managed to bring both together.
>>> 
>>> On 23 May 2016, at 01:42, Mich Talebzadeh <mich.talebza...@gmail.com 
>>> <mailto:mich.talebza...@gmail.com>> wrote:
>>> 
>>>> Hi,
>>>>  
>>>> I have done a number of extensive tests using Spark-shell with Hive DB and 
>>>> ORC tables.
>>>>  
>>>> Now one issue that we typically face is and I quote:
>>>>  
>>>> Spark is fast as it uses Memory and DAG. Great but when we save data it is 
>>>> not fast enough
>>>> 
>>>> OK but there is a solution now. If you use Spark with Hive and you are on 
>>>> a descent version of Hive >= 0.14, then you can also deploy Spark as 
>>>> execution engine for Hive. That will make your application run pretty fast 
>>>> as you no longer rely on the old Map-Reduce for Hive engine. In a nutshell 
>>>> what you are gaining speed in both querying and storage.
>>>>  
>>>> I have made some comparisons on this set-up and I am sure some of you will 
>>>> find it useful.
>>>>  
>>>> The version of Spark I use for Spark queries (Spark as query tool) is 1.6.
>>>> The version of Hive I use in Hive 2
>>>> The version of Spark I use as Hive execution engine is 1.3.1 It works and 
>>>> frankly Spark 1.3.1 as an execution engine is adequate (until we sort out 
>>>> the Hadoop libraries mismatch).
>>>>  
>>>> An example I am using Hive on Spark engine to find the min and max of IDs 
>>>> for a table with 1 billion rows:
>>>>  
>>>> 0: jdbc:hive2://rhes564:10010/default>  select min(id), max(id),avg(id), 
>>>> stddev(id) from oraclehadoop.dummy;
>>>> Query ID = hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006
>>>>  
>>>>  
>>>> Starting Spark Job = 5e092ef9-d798-4952-b156-74df49da9151
>>>>  
>>>> INFO  : Completed compiling 
>>>> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006);
>>>>  Time taken: 1.911 seconds
>>>> INFO  : Executing 
>>>> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006):
>>>>  select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy
>>>> INFO  : Query ID = 
>>>> hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006
>>>> INFO  : Total jobs = 1
>>>> INFO  : Launching Job 1 out of 1
>>>> INFO  : Starting task [Stage-1:MAPRED] in serial mode
>>>>  
>>>> Query Hive on Spark job[0] stages:
>>>> 0
>>>> 1
>>>> Status: Running (Hive on Spark job[0])
>>>> Job Progress Format
>>>> CurrentTime StageId_StageAttemptId: 
>>>> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount 
>>>> [StageCost]
>>>> 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1
>>>> 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22    Stage-1_0: 0/1
>>>> 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22    Stage-1_0: 0/1
>>>> 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22    Stage-1_0: 0/1
>>>> INFO  :
>>>> Query Hive on Spark job[0] stages:
>>>> INFO  : 0
>>>> INFO  : 1
>>>> INFO  :
>>>> Status: Running (Hive on Spark job[0])
>>>> INFO  : Job Progress Format
>>>> CurrentTime StageId_StageAttemptId: 
>>>> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount 
>>>> [StageCost]
>>>> INFO  : 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1
>>>> INFO  : 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22    Stage-1_0: 0/1
>>>> INFO  : 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22    Stage-1_0: 0/1
>>>> INFO  : 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22    Stage-1_0: 0/1
>>>> 2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished       Stage-1_0: 0(+1)/1
>>>> 2016-05-23 00:21:30,189 Stage-0_0: 22/22 Finished       Stage-1_0: 1/1 
>>>> Finished
>>>> Status: Finished successfully in 53.25 seconds
>>>> OK
>>>> INFO  : 2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished       Stage-1_0: 
>>>> 0(+1)/1
>>>> INFO  : 2016-05-23 00:21:30,189 Stage-0_0: 22/22 Finished       Stage-1_0: 
>>>> 1/1 Finished
>>>> INFO  : Status: Finished successfully in 53.25 seconds
>>>> INFO  : Completed executing 
>>>> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006);
>>>>  Time taken: 56.337 seconds
>>>> INFO  : OK
>>>> +-----+------------+---------------+-----------------------+--+
>>>> | c0  |     c1     |      c2       |          c3           |
>>>> +-----+------------+---------------+-----------------------+--+
>>>> | 1   | 100000000  | 5.00000005E7  | 2.8867513459481288E7  |
>>>> +-----+------------+---------------+-----------------------+--+
>>>> 1 row selected (58.529 seconds)
>>>>  
>>>> 58 seconds first run with cold cache is pretty good
>>>>  
>>>> And let us compare it with running the same query on map-reduce engine
>>>>  
>>>> : jdbc:hive2://rhes564:10010/default> set hive.execution.engine=mr;
>>>> Hive-on-MR is deprecated in Hive 2 and may not be available in the future 
>>>> versions. Consider using a different execution engine (i.e. spark, tez) or 
>>>> using Hive 1.X releases.
>>>> No rows affected (0.007 seconds)
>>>> 0: jdbc:hive2://rhes564:10010/default>  select min(id), max(id),avg(id), 
>>>> stddev(id) from oraclehadoop.dummy;
>>>> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in 
>>>> the future versions. Consider using a different execution engine (i.e. 
>>>> spark, tez) or using Hive 1.X releases.
>>>> Query ID = hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc
>>>> Total jobs = 1
>>>> Launching Job 1 out of 1
>>>> Number of reduce tasks determined at compile time: 1
>>>> In order to change the average load for a reducer (in bytes):
>>>>   set hive.exec.reducers.bytes.per.reducer=<number>
>>>> In order to limit the maximum number of reducers:
>>>>   set hive.exec.reducers.max=<number>
>>>> In order to set a constant number of reducers:
>>>>   set mapreduce.job.reduces=<number>
>>>> Starting Job = job_1463956731753_0005, Tracking URL = 
>>>> http://localhost.localdomain:8088/proxy/application_1463956731753_0005/ 
>>>> <http://localhost.localdomain:8088/proxy/application_1463956731753_0005/>
>>>> Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill 
>>>> job_1463956731753_0005
>>>> Hadoop job information for Stage-1: number of mappers: 22; number of 
>>>> reducers: 1
>>>> 2016-05-23 00:26:38,127 Stage-1 map = 0%,  reduce = 0%
>>>> INFO  : Compiling 
>>>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc):
>>>>  select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy
>>>> INFO  : Semantic Analysis Completed
>>>> INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:c0, 
>>>> type:int, comment:null), FieldSchema(name:c1, type:int, comment:null), 
>>>> FieldSchema(name:c2, type:double, comment:null), FieldSchema(name:c3, 
>>>> type:double, comment:null)], properties:null)
>>>> INFO  : Completed compiling 
>>>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc);
>>>>  Time taken: 0.144 seconds
>>>> INFO  : Executing 
>>>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc):
>>>>  select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy
>>>> WARN  : Hive-on-MR is deprecated in Hive 2 and may not be available in the 
>>>> future versions. Consider using a different execution engine (i.e. spark, 
>>>> tez) or using Hive 1.X releases.
>>>> INFO  : WARNING: Hive-on-MR is deprecated in Hive 2 and may not be 
>>>> available in the future versions. Consider using a different execution 
>>>> engine (i.e. spark, tez) or using Hive 1.X releases.
>>>> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in 
>>>> the future versions. Consider using a different execution engine (i.e. 
>>>> spark, tez) or using Hive 1.X releases.
>>>> INFO  : Query ID = 
>>>> hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc
>>>> INFO  : Total jobs = 1
>>>> INFO  : Launching Job 1 out of 1
>>>> INFO  : Starting task [Stage-1:MAPRED] in serial mode
>>>> INFO  : Number of reduce tasks determined at compile time: 1
>>>> INFO  : In order to change the average load for a reducer (in bytes):
>>>> INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
>>>> INFO  : In order to limit the maximum number of reducers:
>>>> INFO  :   set hive.exec.reducers.max=<number>
>>>> INFO  : In order to set a constant number of reducers:
>>>> INFO  :   set mapreduce.job.reduces=<number>
>>>> WARN  : Hadoop command-line option parsing not performed. Implement the 
>>>> Tool interface and execute your application with ToolRunner to remedy this.
>>>> INFO  : number of splits:22
>>>> INFO  : Submitting tokens for job: job_1463956731753_0005
>>>> INFO  : The url to track the job: 
>>>> http://localhost.localdomain:8088/proxy/application_1463956731753_0005/ 
>>>> <http://localhost.localdomain:8088/proxy/application_1463956731753_0005/>
>>>> INFO  : Starting Job = job_1463956731753_0005, Tracking URL = 
>>>> http://localhost.localdomain:8088/proxy/application_1463956731753_0005/ 
>>>> <http://localhost.localdomain:8088/proxy/application_1463956731753_0005/>
>>>> INFO  : Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill 
>>>> job_1463956731753_0005
>>>> INFO  : Hadoop job information for Stage-1: number of mappers: 22; number 
>>>> of reducers: 1
>>>> INFO  : 2016-05-23 00:26:38,127 Stage-1 map = 0%,  reduce = 0%
>>>> 2016-05-23 00:26:44,367 Stage-1 map = 5%,  reduce = 0%, Cumulative CPU 
>>>> 4.56 sec
>>>> INFO  : 2016-05-23 00:26:44,367 Stage-1 map = 5%,  reduce = 0%, Cumulative 
>>>> CPU 4.56 sec
>>>> 2016-05-23 00:26:50,558 Stage-1 map = 9%,  reduce = 0%, Cumulative CPU 
>>>> 9.17 sec
>>>> INFO  : 2016-05-23 00:26:50,558 Stage-1 map = 9%,  reduce = 0%, Cumulative 
>>>> CPU 9.17 sec
>>>> 2016-05-23 00:26:56,747 Stage-1 map = 14%,  reduce = 0%, Cumulative CPU 
>>>> 14.04 sec
>>>> INFO  : 2016-05-23 00:26:56,747 Stage-1 map = 14%,  reduce = 0%, 
>>>> Cumulative CPU 14.04 sec
>>>> 2016-05-23 00:27:02,944 Stage-1 map = 18%,  reduce = 0%, Cumulative CPU 
>>>> 18.64 sec
>>>> INFO  : 2016-05-23 00:27:02,944 Stage-1 map = 18%,  reduce = 0%, 
>>>> Cumulative CPU 18.64 sec
>>>> 2016-05-23 00:27:08,105 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
>>>> 23.25 sec
>>>> INFO  : 2016-05-23 00:27:08,105 Stage-1 map = 23%,  reduce = 0%, 
>>>> Cumulative CPU 23.25 sec
>>>> 2016-05-23 00:27:14,298 Stage-1 map = 27%,  reduce = 0%, Cumulative CPU 
>>>> 27.84 sec
>>>> INFO  : 2016-05-23 00:27:14,298 Stage-1 map = 27%,  reduce = 0%, 
>>>> Cumulative CPU 27.84 sec
>>>> 2016-05-23 00:27:20,484 Stage-1 map = 32%,  reduce = 0%, Cumulative CPU 
>>>> 32.56 sec
>>>> INFO  : 2016-05-23 00:27:20,484 Stage-1 map = 32%,  reduce = 0%, 
>>>> Cumulative CPU 32.56 sec
>>>> 2016-05-23 00:27:26,659 Stage-1 map = 36%,  reduce = 0%, Cumulative CPU 
>>>> 37.1 sec
>>>> INFO  : 2016-05-23 00:27:26,659 Stage-1 map = 36%,  reduce = 0%, 
>>>> Cumulative CPU 37.1 sec
>>>> 2016-05-23 00:27:32,839 Stage-1 map = 41%,  reduce = 0%, Cumulative CPU 
>>>> 41.74 sec
>>>> INFO  : 2016-05-23 00:27:32,839 Stage-1 map = 41%,  reduce = 0%, 
>>>> Cumulative CPU 41.74 sec
>>>> 2016-05-23 00:27:39,003 Stage-1 map = 45%,  reduce = 0%, Cumulative CPU 
>>>> 46.32 sec
>>>> INFO  : 2016-05-23 00:27:39,003 Stage-1 map = 45%,  reduce = 0%, 
>>>> Cumulative CPU 46.32 sec
>>>> 2016-05-23 00:27:45,173 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 
>>>> 50.93 sec
>>>> 2016-05-23 00:27:50,316 Stage-1 map = 55%,  reduce = 0%, Cumulative CPU 
>>>> 55.55 sec
>>>> INFO  : 2016-05-23 00:27:45,173 Stage-1 map = 50%,  reduce = 0%, 
>>>> Cumulative CPU 50.93 sec
>>>> INFO  : 2016-05-23 00:27:50,316 Stage-1 map = 55%,  reduce = 0%, 
>>>> Cumulative CPU 55.55 sec
>>>> 2016-05-23 00:27:56,482 Stage-1 map = 59%,  reduce = 0%, Cumulative CPU 
>>>> 60.25 sec
>>>> INFO  : 2016-05-23 00:27:56,482 Stage-1 map = 59%,  reduce = 0%, 
>>>> Cumulative CPU 60.25 sec
>>>> 2016-05-23 00:28:02,642 Stage-1 map = 64%,  reduce = 0%, Cumulative CPU 
>>>> 64.86 sec
>>>> INFO  : 2016-05-23 00:28:02,642 Stage-1 map = 64%,  reduce = 0%, 
>>>> Cumulative CPU 64.86 sec
>>>> 2016-05-23 00:28:08,814 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU 
>>>> 69.41 sec
>>>> INFO  : 2016-05-23 00:28:08,814 Stage-1 map = 68%,  reduce = 0%, 
>>>> Cumulative CPU 69.41 sec
>>>> 2016-05-23 00:28:14,977 Stage-1 map = 73%,  reduce = 0%, Cumulative CPU 
>>>> 74.06 sec
>>>> INFO  : 2016-05-23 00:28:14,977 Stage-1 map = 73%,  reduce = 0%, 
>>>> Cumulative CPU 74.06 sec
>>>> 2016-05-23 00:28:21,134 Stage-1 map = 77%,  reduce = 0%, Cumulative CPU 
>>>> 78.72 sec
>>>> INFO  : 2016-05-23 00:28:21,134 Stage-1 map = 77%,  reduce = 0%, 
>>>> Cumulative CPU 78.72 sec
>>>> 2016-05-23 00:28:27,282 Stage-1 map = 82%,  reduce = 0%, Cumulative CPU 
>>>> 83.32 sec
>>>> INFO  : 2016-05-23 00:28:27,282 Stage-1 map = 82%,  reduce = 0%, 
>>>> Cumulative CPU 83.32 sec
>>>> 2016-05-23 00:28:33,437 Stage-1 map = 86%,  reduce = 0%, Cumulative CPU 
>>>> 87.9 sec
>>>> INFO  : 2016-05-23 00:28:33,437 Stage-1 map = 86%,  reduce = 0%, 
>>>> Cumulative CPU 87.9 sec
>>>> 2016-05-23 00:28:38,579 Stage-1 map = 91%,  reduce = 0%, Cumulative CPU 
>>>> 92.52 sec
>>>> INFO  : 2016-05-23 00:28:38,579 Stage-1 map = 91%,  reduce = 0%, 
>>>> Cumulative CPU 92.52 sec
>>>> 2016-05-23 00:28:44,759 Stage-1 map = 95%,  reduce = 0%, Cumulative CPU 
>>>> 97.35 sec
>>>> INFO  : 2016-05-23 00:28:44,759 Stage-1 map = 95%,  reduce = 0%, 
>>>> Cumulative CPU 97.35 sec
>>>> 2016-05-23 00:28:49,915 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 
>>>> 99.6 sec
>>>> INFO  : 2016-05-23 00:28:49,915 Stage-1 map = 100%,  reduce = 0%, 
>>>> Cumulative CPU 99.6 sec
>>>> 2016-05-23 00:28:54,043 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 
>>>> 101.4 sec
>>>> MapReduce Total cumulative CPU time: 1 minutes 41 seconds 400 msec
>>>> Ended Job = job_1463956731753_0005
>>>> MapReduce Jobs Launched:
>>>> Stage-Stage-1: Map: 22  Reduce: 1   Cumulative CPU: 101.4 sec   HDFS Read: 
>>>> 5318569 HDFS Write: 46 SUCCESS
>>>> Total MapReduce CPU Time Spent: 1 minutes 41 seconds 400 msec
>>>> OK
>>>> INFO  : 2016-05-23 00:28:54,043 Stage-1 map = 100%,  reduce = 100%, 
>>>> Cumulative CPU 101.4 sec
>>>> INFO  : MapReduce Total cumulative CPU time: 1 minutes 41 seconds 400 msec
>>>> INFO  : Ended Job = job_1463956731753_0005
>>>> INFO  : MapReduce Jobs Launched:
>>>> INFO  : Stage-Stage-1: Map: 22  Reduce: 1   Cumulative CPU: 101.4 sec   
>>>> HDFS Read: 5318569 HDFS Write: 46 SUCCESS
>>>> INFO  : Total MapReduce CPU Time Spent: 1 minutes 41 seconds 400 msec
>>>> INFO  : Completed executing 
>>>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc);
>>>>  Time taken: 142.525 seconds
>>>> INFO  : OK
>>>> +-----+------------+---------------+-----------------------+--+
>>>> | c0  |     c1     |      c2       |          c3           |
>>>> +-----+------------+---------------+-----------------------+--+
>>>> | 1   | 100000000  | 5.00000005E7  | 2.8867513459481288E7  |
>>>> +-----+------------+---------------+-----------------------+--+
>>>> 1 row selected (142.744 seconds)
>>>>  
>>>> OK Hive on map-reduce engine took 142 seconds compared to 58 seconds with 
>>>> Hive on Spark. So you can obviously gain pretty well by using Hive on 
>>>> Spark.
>>>>  
>>>> Please also note that I did not use any vendor's build for this purpose. I 
>>>> compiled Spark 1.3.1 myself.
>>>>  
>>>> HTH
>>>>  
>>>>  
>>>> Dr Mich Talebzadeh
>>>>  
>>>> LinkedIn  
>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>  
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>>  
>>>> http://talebzadehmich.wordpress.com/ <http://talebzadehmich.wordpress.com/>
>>>>  
>>> 
>>> 
>>> 
>>> -- 
>>> Best Regards,
>>> Ayan Guha
>>> 
>>> 
>> 
>

Re: Using Spark on Hive with Hive also using Spark as its execution engine

Reply via email to