Re: Using Spark on Hive with Hive also using Spark as its execution engine

Mich Talebzadeh Mon, 11 Jul 2016 09:54:52 -0700

Appreciate all the comments.

Hive on Spark. Spark runs as an execution engine and is only used when you
query Hive. Otherwise it is not running. I run it in Yarn client mode. let
me show you an example


In hive-site xml set the execution engine to be spark to spark. It requires
some configuration but it does work :)

Alternatively log in to hive and do the setting there


set hive.execution.engine=spark;
set spark.home=/usr/lib/spark-1.3.1-bin-hadoop2.6;
set spark.master=yarn-client;
set spark.executor.memory=3g;
set spark.driver.memory=3g;
set spark.executor.cores=8;
set spark.ui.port=7777;

Small test ride

First using Hive 2 on Spark 1.3.1 to find max(id) for a 100million rows
parquet table

hive> select max(id) from oraclehadoop.dummy_parquet;

Starting Spark Job = a7752b2b-d73a-45de-aced-ddf02810938d
Query Hive on Spark job[1] stages:
2
3
Status: Running (Hive on Spark job[1])
Job Progress Format
CurrentTime StageId_StageAttemptId:
SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
[StageCost]
2016-07-11 17:41:52,386 Stage-2_0: 0(+8)/24     Stage-3_0: 0/1
2016-07-11 17:41:55,409 Stage-2_0: 1(+8)/24     Stage-3_0: 0/1
2016-07-11 17:41:56,420 Stage-2_0: 8(+4)/24     Stage-3_0: 0/1
2016-07-11 17:41:58,434 Stage-2_0: 10(+2)/24    Stage-3_0: 0/1
2016-07-11 17:41:59,440 Stage-2_0: 12(+8)/24    Stage-3_0: 0/1
2016-07-11 17:42:01,455 Stage-2_0: 17(+7)/24    Stage-3_0: 0/1
2016-07-11 17:42:02,462 Stage-2_0: 20(+4)/24    Stage-3_0: 0/1
2016-07-11 17:42:04,476 Stage-2_0: 23(+1)/24    Stage-3_0: 0/1
2016-07-11 17:42:05,483 Stage-2_0: 24/24 Finished       Stage-3_0: 1/1
Finished

*Status: Finished successfully in 14.12 seconds*OK
100000000
Time taken: 14.38 seconds, Fetched: 1 row(s)

--simply switch the engine in hive to MR

hive>
*set hive.execution.engine=mr;*Hive-on-MR is deprecated in Hive 2 and may
not be available in the future versions. Consider using a different
execution engine (i.e. spark, tez) or using Hive 1.X releases.

hive> select max(id) from oraclehadoop.dummy_parquet;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the
future versions. Consider using a different execution engine (i.e. spark,
tez) or using Hive 1.X releases.
Starting Job = job_1468226887011_0005, Tracking URL =
http://rhes564:8088/proxy/application_1468226887011_0005/
Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill
job_1468226887011_0005
Hadoop job information for Stage-1: number of mappers: 24; number of
reducers: 1
2016-07-11 17:42:46,904 Stage-1 map = 0%,  reduce = 0%
2016-07-11 17:42:56,328 Stage-1 map = 4%,  reduce = 0%, Cumulative CPU
31.76 sec
2016-07-11 17:43:05,676 Stage-1 map = 8%,  reduce = 0%, Cumulative CPU
61.78 sec
2016-07-11 17:43:16,091 Stage-1 map = 13%,  reduce = 0%, Cumulative CPU
95.44 sec
2016-07-11 17:43:24,419 Stage-1 map = 17%,  reduce = 0%, Cumulative CPU
121.6 sec
2016-07-11 17:43:32,734 Stage-1 map = 21%,  reduce = 0%, Cumulative CPU
149.37 sec
2016-07-11 17:43:41,031 Stage-1 map = 25%,  reduce = 0%, Cumulative CPU
177.62 sec
2016-07-11 17:43:48,305 Stage-1 map = 29%,  reduce = 0%, Cumulative CPU
204.92 sec
2016-07-11 17:43:56,580 Stage-1 map = 33%,  reduce = 0%, Cumulative CPU
235.34 sec
2016-07-11 17:44:05,917 Stage-1 map = 38%,  reduce = 0%, Cumulative CPU
262.18 sec
2016-07-11 17:44:14,222 Stage-1 map = 42%,  reduce = 0%, Cumulative CPU
286.21 sec
2016-07-11 17:44:22,502 Stage-1 map = 46%,  reduce = 0%, Cumulative CPU
310.34 sec
2016-07-11 17:44:32,923 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU
346.26 sec
2016-07-11 17:44:43,301 Stage-1 map = 54%,  reduce = 0%, Cumulative CPU
379.11 sec
2016-07-11 17:44:53,674 Stage-1 map = 58%,  reduce = 0%, Cumulative CPU
417.9 sec
2016-07-11 17:45:04,001 Stage-1 map = 63%,  reduce = 0%, Cumulative CPU
450.73 sec
2016-07-11 17:45:13,327 Stage-1 map = 67%,  reduce = 0%, Cumulative CPU
476.7 sec
2016-07-11 17:45:22,656 Stage-1 map = 71%,  reduce = 0%, Cumulative CPU
508.97 sec
2016-07-11 17:45:33,002 Stage-1 map = 75%,  reduce = 0%, Cumulative CPU
535.69 sec
2016-07-11 17:45:43,355 Stage-1 map = 79%,  reduce = 0%, Cumulative CPU
573.33 sec
2016-07-11 17:45:52,613 Stage-1 map = 83%,  reduce = 0%, Cumulative CPU
605.01 sec
2016-07-11 17:46:02,962 Stage-1 map = 88%,  reduce = 0%, Cumulative CPU
632.38 sec
2016-07-11 17:46:13,316 Stage-1 map = 92%,  reduce = 0%, Cumulative CPU
666.45 sec
2016-07-11 17:46:23,656 Stage-1 map = 96%,  reduce = 0%, Cumulative CPU
693.72 sec
2016-07-11 17:46:31,919 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU
714.71 sec
2016-07-11 17:46:36,060 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU
721.83 sec
MapReduce Total cumulative CPU time: 12 minutes 1 seconds 830 msec
Ended Job = job_1468226887011_0005
MapReduce Jobs Launched:
Stage-Stage-1: Map: 24  Reduce: 1   Cumulative CPU: 721.83 sec   HDFS Read:
400442823 HDFS Write: 10 SUCCESS
Total MapReduce CPU Time Spent: 12 minutes 1 seconds 830 msec
OK
100000000
*Time taken: 239.532 seconds, Fetched: 1 row(s)*


I leave it to you guys to guess which one is better :)

Cheers


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 11 July 2016 at 17:02, Michael Segel <msegel_had...@hotmail.com> wrote:

> Just a clarification.
>
> Tez is ‘vendor’ independent.  ;-)
>
> Yeah… I know…  Anyone can support it.  Only Hortonworks has stacked the
> deck in their favor.
>
> Drill could be in the same boat, although there now more committers who
> are not working for MapR. I’m not sure who outside of HW is supporting Tez.
>
> But I digress.
>
> Here in the Spark user list, I have to ask how do you run hive on spark?
> Is the execution engine … the spark context always running? (Client mode I
> assume)
> Are the executors always running?   Can you run multiple queries from
> multiple users in parallel?
>
> These are some of the questions that should be asked and answered when
> considering how viable spark is going to be as the engine under Hive…
>
> Thx
>
> -Mike
>
> On May 29, 2016, at 3:35 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
> thanks I think the problem is that the TEZ user group is exceptionally
> quiet. Just sent an email to Hive user group to see anyone has managed to
> built a vendor independent version.
>
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 29 May 2016 at 21:23, Jörn Franke <jornfra...@gmail.com> wrote:
>
>> Well I think it is different from MR. It has some optimizations which you
>> do not find in MR. Especially the LLAP option in Hive2 makes it
>> interesting.
>>
>> I think hive 1.2 works with 0.7 and 2.0 with 0.8 . At least for 1.2 it is
>> integrated in the Hortonworks distribution.
>>
>>
>> On 29 May 2016, at 21:43, Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>> Hi Jorn,
>>
>> I started building apache-tez-0.8.2 but got few errors. Couple of guys
>> from TEZ user group kindly gave a hand but I could not go very far (or may
>> be I did not make enough efforts) making it work.
>>
>> That TEZ user group is very quiet as well.
>>
>> My understanding is TEZ is MR with DAG but of course Spark has both plus
>> in-memory capability.
>>
>> It would be interesting to see what version of TEZ works as execution
>> engine with Hive.
>>
>> Vendors are divided on this (use Hive with TEZ) or use Impala instead of
>> Hive etc as I am sure you already know.
>>
>> Cheers,
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 29 May 2016 at 20:19, Jörn Franke <jornfra...@gmail.com> wrote:
>>
>>> Very interesting do you plan also a test with TEZ?
>>>
>>> On 29 May 2016, at 13:40, Mich Talebzadeh <mich.talebza...@gmail.com>
>>> wrote:
>>>
>>> Hi,
>>>
>>> I did another study of Hive using Spark engine compared to Hive with MR.
>>>
>>> Basically took the original table imported using Sqoop and created and
>>> populated a new ORC table partitioned by year and month into 48 partitions
>>> as follows:
>>>
>>> <sales_partition.PNG>
>>> 
>>> Connections use JDBC via beeline. Now for each partition using MR it
>>> takes an average of 17 minutes as seen below for each PARTITION..  Now that
>>> is just an individual partition and there are 48 partitions.
>>>
>>> In contrast doing the same operation with Spark engine took 10 minutes
>>> all inclusive. I just gave up on MR. You can see the StartTime and
>>> FinishTime from below
>>>
>>> <image.png>
>>>
>>> This is by no means indicate that Spark is much better than MR but shows
>>> that some very good results can ve achieved using Spark engine.
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 24 May 2016 at 08:03, Mich Talebzadeh <mich.talebza...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> We use Hive as the database and use Spark as an all purpose query tool.
>>>>
>>>> Whether Hive is the write database for purpose or one is better off
>>>> with something like Phoenix on Hbase, well the answer is it depends and
>>>> your mileage varies.
>>>>
>>>> So fit for purpose.
>>>>
>>>> Ideally what wants is to use the fastest  method to get the results.
>>>> How fast we confine it to our SLA agreements in production and that helps
>>>> us from unnecessary further work as we technologists like to play around.
>>>>
>>>> So in short, we use Spark most of the time and use Hive as the backend
>>>> engine for data storage, mainly ORC tables.
>>>>
>>>> We use Hive on Spark and with Hive 2 on Spark 1.3.1 for now we have a
>>>> combination that works. Granted it helps to use Hive 2 on Spark 1.6.1 but
>>>> at the moment it is one of my projects.
>>>>
>>>> We do not use any vendor's products as it enables us to move away  from
>>>> being tied down after years of SAP, Oracle and MS dependency to yet another
>>>> vendor. Besides there is some politics going on with one promoting Tez and
>>>> another Spark as a backend. That is fine but obviously we prefer an
>>>> independent assessment ourselves.
>>>>
>>>> My gut feeling is that one needs to look at the use case. Recently we
>>>> had to import a very large table from Oracle to Hive and decided to use
>>>> Spark 1.6.1 with Hive 2 on Spark 1.3.1 and that worked fine. We just used
>>>> JDBC connection with temp table and it was good. We could have used sqoop
>>>> but decided to settle for Spark so it all depends on use case.
>>>>
>>>> HTH
>>>>
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> On 24 May 2016 at 03:11, ayan guha <guha.a...@gmail.com> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> Thanks for very useful stats.
>>>>>
>>>>> Did you have any benchmark for using Spark as backend engine for Hive
>>>>> vs using Spark thrift server (and run spark code for hive queries)? We are
>>>>> using later but it will be very useful to remove thriftserver, if we can.
>>>>>
>>>>> On Tue, May 24, 2016 at 9:51 AM, Jörn Franke <jornfra...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> Hi Mich,
>>>>>>
>>>>>> I think these comparisons are useful. One interesting aspect could be
>>>>>> hardware scalability in this context. Additionally different type of
>>>>>> computations. Furthermore, one could compare Spark and Tez+llap as
>>>>>> execution engines. I have the gut feeling that  each one can be justified
>>>>>> by different use cases.
>>>>>> Nevertheless, there should be always a disclaimer for such
>>>>>> comparisons, because Spark and Hive are not good for a lot of concurrent
>>>>>> lookups of single rows. They are not good for frequently write small
>>>>>> amounts of data (eg sensor data). Here hbase could be more interesting.
>>>>>> Other use cases can justify graph databases, such as Titan, or text
>>>>>> analytics/ data matching using Solr on Hadoop.
>>>>>> Finally, even if you have a lot of data you need to think if you
>>>>>> always have to process everything. For instance, I have found valid use
>>>>>> cases in practice where we decided to evaluate 10 machine learning models
>>>>>> in parallel on only a sample of data and only evaluate the "winning" 
>>>>>> model
>>>>>> of the total of data.
>>>>>>
>>>>>> As always it depends :)
>>>>>>
>>>>>> Best regards
>>>>>>
>>>>>> P.s.: at least Hortonworks has in their distribution spark 1.5 with
>>>>>> hive 1.2 and spark 1.6 with hive 1.2. Maybe they have somewhere described
>>>>>> how to manage bringing both together. You may check also Apache Bigtop
>>>>>> (vendor neutral distribution) on how they managed to bring both together.
>>>>>>
>>>>>> On 23 May 2016, at 01:42, Mich Talebzadeh <mich.talebza...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>> I have done a number of extensive tests using Spark-shell with Hive
>>>>>> DB and ORC tables.
>>>>>>
>>>>>>
>>>>>> Now one issue that we typically face is and I quote:
>>>>>>
>>>>>>
>>>>>> Spark is fast as it uses Memory and DAG. Great but when we save data
>>>>>> it is not fast enough
>>>>>>
>>>>>> OK but there is a solution now. If you use Spark with Hive and you
>>>>>> are on a descent version of Hive >= 0.14, then you can also deploy Spark 
>>>>>> as
>>>>>> execution engine for Hive. That will make your application run pretty 
>>>>>> fast
>>>>>> as you no longer rely on the old Map-Reduce for Hive engine. In a 
>>>>>> nutshell
>>>>>> what you are gaining speed in both querying and storage.
>>>>>>
>>>>>>
>>>>>> I have made some comparisons on this set-up and I am sure some of you
>>>>>> will find it useful.
>>>>>>
>>>>>>
>>>>>> The version of Spark I use for Spark queries (Spark as query tool) is
>>>>>> 1.6.
>>>>>> The version of Hive I use in Hive 2
>>>>>> The version of Spark I use as Hive execution engine is 1.3.1 It works
>>>>>> and frankly Spark 1.3.1 as an execution engine is adequate (until we sort
>>>>>> out the Hadoop libraries mismatch).
>>>>>>
>>>>>>
>>>>>> An example I am using Hive on Spark engine to find the min and max of
>>>>>> IDs for a table with 1 billion rows:
>>>>>>
>>>>>>
>>>>>> 0: jdbc:hive2://rhes564:10010/default>  select min(id),
>>>>>> max(id),avg(id), stddev(id) from oraclehadoop.dummy;
>>>>>> Query ID = hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Starting Spark Job = 5e092ef9-d798-4952-b156-74df49da9151
>>>>>>
>>>>>>
>>>>>> INFO  : Completed compiling
>>>>>> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006);
>>>>>> Time taken: 1.911 seconds
>>>>>> INFO  : Executing
>>>>>> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006):
>>>>>> select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy
>>>>>> INFO  : Query ID =
>>>>>> hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006
>>>>>> INFO  : Total jobs = 1
>>>>>> INFO  : Launching Job 1 out of 1
>>>>>> INFO  : Starting task [Stage-1:MAPRED] in serial mode
>>>>>>
>>>>>>
>>>>>> Query Hive on Spark job[0] stages:
>>>>>> 0
>>>>>> 1
>>>>>> Status: Running (Hive on Spark job[0])
>>>>>> Job Progress Format
>>>>>> CurrentTime StageId_StageAttemptId:
>>>>>> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
>>>>>> [StageCost]
>>>>>> 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1
>>>>>> 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22    Stage-1_0: 0/1
>>>>>> 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22    Stage-1_0: 0/1
>>>>>> 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22    Stage-1_0: 0/1
>>>>>> INFO  :
>>>>>> Query Hive on Spark job[0] stages:
>>>>>> INFO  : 0
>>>>>> INFO  : 1
>>>>>> INFO  :
>>>>>> Status: Running (Hive on Spark job[0])
>>>>>> INFO  : Job Progress Format
>>>>>> CurrentTime StageId_StageAttemptId:
>>>>>> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
>>>>>> [StageCost]
>>>>>> INFO  : 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1
>>>>>> INFO  : 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22    Stage-1_0: 0/1
>>>>>> INFO  : 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22    Stage-1_0: 0/1
>>>>>> INFO  : 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22    Stage-1_0: 0/1
>>>>>> 2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished       Stage-1_0:
>>>>>> 0(+1)/1
>>>>>> 2016-05-23 00:21:30,189 Stage-0_0: 22/22 Finished       Stage-1_0:
>>>>>> 1/1 Finished
>>>>>> Status: Finished successfully in 53.25 seconds
>>>>>> OK
>>>>>> INFO  : 2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished
>>>>>> Stage-1_0: 0(+1)/1
>>>>>> INFO  : 2016-05-23 00:21:30,189 Stage-0_0: 22/22 Finished
>>>>>> Stage-1_0: 1/1 Finished
>>>>>> INFO  : Status: Finished successfully in 53.25 seconds
>>>>>> INFO  : Completed executing
>>>>>> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006);
>>>>>> Time taken: 56.337 seconds
>>>>>> INFO  : OK
>>>>>> +-----+------------+---------------+-----------------------+--+
>>>>>> | c0  |     c1     |      c2       |          c3           |
>>>>>> +-----+------------+---------------+-----------------------+--+
>>>>>> | 1   | 100000000  | 5.00000005E7  | 2.8867513459481288E7  |
>>>>>> +-----+------------+---------------+-----------------------+--+
>>>>>> 1 row selected (58.529 seconds)
>>>>>>
>>>>>>
>>>>>> 58 seconds first run with cold cache is pretty good
>>>>>>
>>>>>>
>>>>>> And let us compare it with running the same query on map-reduce engine
>>>>>>
>>>>>>
>>>>>> : jdbc:hive2://rhes564:10010/default> set hive.execution.engine=mr;
>>>>>> Hive-on-MR is deprecated in Hive 2 and may not be available in the
>>>>>> future versions. Consider using a different execution engine (i.e. spark,
>>>>>> tez) or using Hive 1.X releases.
>>>>>> No rows affected (0.007 seconds)
>>>>>> 0: jdbc:hive2://rhes564:10010/default>  select min(id),
>>>>>> max(id),avg(id), stddev(id) from oraclehadoop.dummy;
>>>>>> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available
>>>>>> in the future versions. Consider using a different execution engine (i.e.
>>>>>> spark, tez) or using Hive 1.X releases.
>>>>>> Query ID = hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc
>>>>>> Total jobs = 1
>>>>>> Launching Job 1 out of 1
>>>>>> Number of reduce tasks determined at compile time: 1
>>>>>> In order to change the average load for a reducer (in bytes):
>>>>>>   set hive.exec.reducers.bytes.per.reducer=<number>
>>>>>> In order to limit the maximum number of reducers:
>>>>>>   set hive.exec.reducers.max=<number>
>>>>>> In order to set a constant number of reducers:
>>>>>>   set mapreduce.job.reduces=<number>
>>>>>> Starting Job = job_1463956731753_0005, Tracking URL =
>>>>>> http://localhost.localdomain:8088/proxy/application_1463956731753_0005/
>>>>>> Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill
>>>>>> job_1463956731753_0005
>>>>>> Hadoop job information for Stage-1: number of mappers: 22; number of
>>>>>> reducers: 1
>>>>>> 2016-05-23 00:26:38,127 Stage-1 map = 0%,  reduce = 0%
>>>>>> INFO  : Compiling
>>>>>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc):
>>>>>> select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy
>>>>>> INFO  : Semantic Analysis Completed
>>>>>> INFO  : Returning Hive schema:
>>>>>> Schema(fieldSchemas:[FieldSchema(name:c0, type:int, comment:null),
>>>>>> FieldSchema(name:c1, type:int, comment:null), FieldSchema(name:c2,
>>>>>> type:double, comment:null), FieldSchema(name:c3, type:double,
>>>>>> comment:null)], properties:null)
>>>>>> INFO  : Completed compiling
>>>>>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc);
>>>>>> Time taken: 0.144 seconds
>>>>>> INFO  : Executing
>>>>>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc):
>>>>>> select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy
>>>>>> WARN  : Hive-on-MR is deprecated in Hive 2 and may not be available
>>>>>> in the future versions. Consider using a different execution engine (i.e.
>>>>>> spark, tez) or using Hive 1.X releases.
>>>>>> INFO  : WARNING: Hive-on-MR is deprecated in Hive 2 and may not be
>>>>>> available in the future versions. Consider using a different execution
>>>>>> engine (i.e. spark, tez) or using Hive 1.X releases.
>>>>>> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available
>>>>>> in the future versions. Consider using a different execution engine (i.e.
>>>>>> spark, tez) or using Hive 1.X releases.
>>>>>> INFO  : Query ID =
>>>>>> hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc
>>>>>> INFO  : Total jobs = 1
>>>>>> INFO  : Launching Job 1 out of 1
>>>>>> INFO  : Starting task [Stage-1:MAPRED] in serial mode
>>>>>> INFO  : Number of reduce tasks determined at compile time: 1
>>>>>> INFO  : In order to change the average load for a reducer (in bytes):
>>>>>> INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
>>>>>> INFO  : In order to limit the maximum number of reducers:
>>>>>> INFO  :   set hive.exec.reducers.max=<number>
>>>>>> INFO  : In order to set a constant number of reducers:
>>>>>> INFO  :   set mapreduce.job.reduces=<number>
>>>>>> WARN  : Hadoop command-line option parsing not performed. Implement
>>>>>> the Tool interface and execute your application with ToolRunner to remedy
>>>>>> this.
>>>>>> INFO  : number of splits:22
>>>>>> INFO  : Submitting tokens for job: job_1463956731753_0005
>>>>>> INFO  : The url to track the job:
>>>>>> http://localhost.localdomain:8088/proxy/application_1463956731753_0005/
>>>>>> INFO  : Starting Job = job_1463956731753_0005, Tracking URL =
>>>>>> http://localhost.localdomain:8088/proxy/application_1463956731753_0005/
>>>>>> INFO  : Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job
>>>>>> -kill job_1463956731753_0005
>>>>>> INFO  : Hadoop job information for Stage-1: number of mappers: 22;
>>>>>> number of reducers: 1
>>>>>> INFO  : 2016-05-23 00:26:38,127 Stage-1 map = 0%,  reduce = 0%
>>>>>> 2016-05-23 00:26:44,367 Stage-1 map = 5%,  reduce = 0%, Cumulative
>>>>>> CPU 4.56 sec
>>>>>> INFO  : 2016-05-23 00:26:44,367 Stage-1 map = 5%,  reduce = 0%,
>>>>>> Cumulative CPU 4.56 sec
>>>>>> 2016-05-23 00:26:50,558 Stage-1 map = 9%,  reduce = 0%, Cumulative
>>>>>> CPU 9.17 sec
>>>>>> INFO  : 2016-05-23 00:26:50,558 Stage-1 map = 9%,  reduce = 0%,
>>>>>> Cumulative CPU 9.17 sec
>>>>>> 2016-05-23 00:26:56,747 Stage-1 map = 14%,  reduce = 0%, Cumulative
>>>>>> CPU 14.04 sec
>>>>>> INFO  : 2016-05-23 00:26:56,747 Stage-1 map = 14%,  reduce = 0%,
>>>>>> Cumulative CPU 14.04 sec
>>>>>> 2016-05-23 00:27:02,944 Stage-1 map = 18%,  reduce = 0%, Cumulative
>>>>>> CPU 18.64 sec
>>>>>> INFO  : 2016-05-23 00:27:02,944 Stage-1 map = 18%,  reduce = 0%,
>>>>>> Cumulative CPU 18.64 sec
>>>>>> 2016-05-23 00:27:08,105 Stage-1 map = 23%,  reduce = 0%, Cumulative
>>>>>> CPU 23.25 sec
>>>>>> INFO  : 2016-05-23 00:27:08,105 Stage-1 map = 23%,  reduce = 0%,
>>>>>> Cumulative CPU 23.25 sec
>>>>>> 2016-05-23 00:27:14,298 Stage-1 map = 27%,  reduce = 0%, Cumulative
>>>>>> CPU 27.84 sec
>>>>>> INFO  : 2016-05-23 00:27:14,298 Stage-1 map = 27%,  reduce = 0%,
>>>>>> Cumulative CPU 27.84 sec
>>>>>> 2016-05-23 00:27:20,484 Stage-1 map = 32%,  reduce = 0%, Cumulative
>>>>>> CPU 32.56 sec
>>>>>> INFO  : 2016-05-23 00:27:20,484 Stage-1 map = 32%,  reduce = 0%,
>>>>>> Cumulative CPU 32.56 sec
>>>>>> 2016-05-23 00:27:26,659 Stage-1 map = 36%,  reduce = 0%, Cumulative
>>>>>> CPU 37.1 sec
>>>>>> INFO  : 2016-05-23 00:27:26,659 Stage-1 map = 36%,  reduce = 0%,
>>>>>> Cumulative CPU 37.1 sec
>>>>>> 2016-05-23 00:27:32,839 Stage-1 map = 41%,  reduce = 0%, Cumulative
>>>>>> CPU 41.74 sec
>>>>>> INFO  : 2016-05-23 00:27:32,839 Stage-1 map = 41%,  reduce = 0%,
>>>>>> Cumulative CPU 41.74 sec
>>>>>> 2016-05-23 00:27:39,003 Stage-1 map = 45%,  reduce = 0%, Cumulative
>>>>>> CPU 46.32 sec
>>>>>> INFO  : 2016-05-23 00:27:39,003 Stage-1 map = 45%,  reduce = 0%,
>>>>>> Cumulative CPU 46.32 sec
>>>>>> 2016-05-23 00:27:45,173 Stage-1 map = 50%,  reduce = 0%, Cumulative
>>>>>> CPU 50.93 sec
>>>>>> 2016-05-23 00:27:50,316 Stage-1 map = 55%,  reduce = 0%, Cumulative
>>>>>> CPU 55.55 sec
>>>>>> INFO  : 2016-05-23 00:27:45,173 Stage-1 map = 50%,  reduce = 0%,
>>>>>> Cumulative CPU 50.93 sec
>>>>>> INFO  : 2016-05-23 00:27:50,316 Stage-1 map = 55%,  reduce = 0%,
>>>>>> Cumulative CPU 55.55 sec
>>>>>> 2016-05-23 00:27:56,482 Stage-1 map = 59%,  reduce = 0%, Cumulative
>>>>>> CPU 60.25 sec
>>>>>> INFO  : 2016-05-23 00:27:56,482 Stage-1 map = 59%,  reduce = 0%,
>>>>>> Cumulative CPU 60.25 sec
>>>>>> 2016-05-23 00:28:02,642 Stage-1 map = 64%,  reduce = 0%, Cumulative
>>>>>> CPU 64.86 sec
>>>>>> INFO  : 2016-05-23 00:28:02,642 Stage-1 map = 64%,  reduce = 0%,
>>>>>> Cumulative CPU 64.86 sec
>>>>>> 2016-05-23 00:28:08,814 Stage-1 map = 68%,  reduce = 0%, Cumulative
>>>>>> CPU 69.41 sec
>>>>>> INFO  : 2016-05-23 00:28:08,814 Stage-1 map = 68%,  reduce = 0%,
>>>>>> Cumulative CPU 69.41 sec
>>>>>> 2016-05-23 00:28:14,977 Stage-1 map = 73%,  reduce = 0%, Cumulative
>>>>>> CPU 74.06 sec
>>>>>> INFO  : 2016-05-23 00:28:14,977 Stage-1 map = 73%,  reduce = 0%,
>>>>>> Cumulative CPU 74.06 sec
>>>>>> 2016-05-23 00:28:21,134 Stage-1 map = 77%,  reduce = 0%, Cumulative
>>>>>> CPU 78.72 sec
>>>>>> INFO  : 2016-05-23 00:28:21,134 Stage-1 map = 77%,  reduce = 0%,
>>>>>> Cumulative CPU 78.72 sec
>>>>>> 2016-05-23 00:28:27,282 Stage-1 map = 82%,  reduce = 0%, Cumulative
>>>>>> CPU 83.32 sec
>>>>>> INFO  : 2016-05-23 00:28:27,282 Stage-1 map = 82%,  reduce = 0%,
>>>>>> Cumulative CPU 83.32 sec
>>>>>> 2016-05-23 00:28:33,437 Stage-1 map = 86%,  reduce = 0%, Cumulative
>>>>>> CPU 87.9 sec
>>>>>> INFO  : 2016-05-23 00:28:33,437 Stage-1 map = 86%,  reduce = 0%,
>>>>>> Cumulative CPU 87.9 sec
>>>>>> 2016-05-23 00:28:38,579 Stage-1 map = 91%,  reduce = 0%, Cumulative
>>>>>> CPU 92.52 sec
>>>>>> INFO  : 2016-05-23 00:28:38,579 Stage-1 map = 91%,  reduce = 0%,
>>>>>> Cumulative CPU 92.52 sec
>>>>>> 2016-05-23 00:28:44,759 Stage-1 map = 95%,  reduce = 0%, Cumulative
>>>>>> CPU 97.35 sec
>>>>>> INFO  : 2016-05-23 00:28:44,759 Stage-1 map = 95%,  reduce = 0%,
>>>>>> Cumulative CPU 97.35 sec
>>>>>> 2016-05-23 00:28:49,915 Stage-1 map = 100%,  reduce = 0%, Cumulative
>>>>>> CPU 99.6 sec
>>>>>> INFO  : 2016-05-23 00:28:49,915 Stage-1 map = 100%,  reduce = 0%,
>>>>>> Cumulative CPU 99.6 sec
>>>>>> 2016-05-23 00:28:54,043 Stage-1 map = 100%,  reduce = 100%,
>>>>>> Cumulative CPU 101.4 sec
>>>>>> MapReduce Total cumulative CPU time: 1 minutes 41 seconds 400 msec
>>>>>> Ended Job = job_1463956731753_0005
>>>>>> MapReduce Jobs Launched:
>>>>>> Stage-Stage-1: Map: 22  Reduce: 1   Cumulative CPU: 101.4 sec   HDFS
>>>>>> Read: 5318569 HDFS Write: 46 SUCCESS
>>>>>> Total MapReduce CPU Time Spent: 1 minutes 41 seconds 400 msec
>>>>>> OK
>>>>>> INFO  : 2016-05-23 00:28:54,043 Stage-1 map = 100%,  reduce = 100%,
>>>>>> Cumulative CPU 101.4 sec
>>>>>> INFO  : MapReduce Total cumulative CPU time: 1 minutes 41 seconds 400
>>>>>> msec
>>>>>> INFO  : Ended Job = job_1463956731753_0005
>>>>>> INFO  : MapReduce Jobs Launched:
>>>>>> INFO  : Stage-Stage-1: Map: 22  Reduce: 1   Cumulative CPU: 101.4
>>>>>> sec   HDFS Read: 5318569 HDFS Write: 46 SUCCESS
>>>>>> INFO  : Total MapReduce CPU Time Spent: 1 minutes 41 seconds 400 msec
>>>>>> INFO  : Completed executing
>>>>>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc);
>>>>>> Time taken: 142.525 seconds
>>>>>> INFO  : OK
>>>>>> +-----+------------+---------------+-----------------------+--+
>>>>>> | c0  |     c1     |      c2       |          c3           |
>>>>>> +-----+------------+---------------+-----------------------+--+
>>>>>> | 1   | 100000000  | 5.00000005E7  | 2.8867513459481288E7  |
>>>>>> +-----+------------+---------------+-----------------------+--+
>>>>>> 1 row selected (142.744 seconds)
>>>>>>
>>>>>>
>>>>>> OK Hive on map-reduce engine took 142 seconds compared to 58 seconds
>>>>>> with Hive on Spark. So you can obviously gain pretty well by using Hive 
>>>>>> on
>>>>>> Spark.
>>>>>>
>>>>>>
>>>>>> Please also note that I did not use any vendor's build for this
>>>>>> purpose. I compiled Spark 1.3.1 myself.
>>>>>>
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Dr Mich Talebzadeh
>>>>>>
>>>>>>
>>>>>> LinkedIn
>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>
>>>>>>
>>>>>> http://talebzadehmich.wordpress.com/
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best Regards,
>>>>> Ayan Guha
>>>>>
>>>>
>>>>
>>>
>>
>
>

Re: Using Spark on Hive with Hive also using Spark as its execution engine

Reply via email to