Re: Using Spark on Hive with Hive also using Spark as its execution engine

Jörn Franke Sun, 29 May 2016 13:24:07 -0700

Well I think it is different from MR. It has some optimizations which you do 
not find in MR. Especially the LLAP option in Hive2 makes it interesting.


I think hive 1.2 works with 0.7 and 2.0 with 0.8 . At least for 1.2 it is 
integrated in the Hortonworks distribution. 


> On 29 May 2016, at 21:43, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> 
> Hi Jorn,
> 
> I started building apache-tez-0.8.2 but got few errors. Couple of guys from 
> TEZ user group kindly gave a hand but I could not go very far (or may be I 
> did not make enough efforts) making it work.
> 
> That TEZ user group is very quiet as well.
> 
> My understanding is TEZ is MR with DAG but of course Spark has both plus 
> in-memory capability.
> 
> It would be interesting to see what version of TEZ works as execution engine 
> with Hive. 
> 
> Vendors are divided on this (use Hive with TEZ) or use Impala instead of Hive 
> etc as I am sure you already know.
> 
> Cheers,
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
>> On 29 May 2016 at 20:19, Jörn Franke <jornfra...@gmail.com> wrote:
>> Very interesting do you plan also a test with TEZ?
>> 
>>> On 29 May 2016, at 13:40, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
>>> 
>>> Hi,
>>> 
>>> I did another study of Hive using Spark engine compared to Hive with MR.
>>> 
>>> Basically took the original table imported using Sqoop and created and 
>>> populated a new ORC table partitioned by year and month into 48 partitions 
>>> as follows:
>>> 
>>> <sales_partition.PNG>
>>>  
>>> Connections use JDBC via beeline. Now for each partition using MR it takes 
>>> an average of 17 minutes as seen below for each PARTITION..  Now that is 
>>> just an individual partition and there are 48 partitions.
>>> 
>>> In contrast doing the same operation with Spark engine took 10 minutes all 
>>> inclusive. I just gave up on MR. You can see the StartTime and FinishTime 
>>> from below
>>> 
>>> <image.png>
>>> 
>>> This is by no means indicate that Spark is much better than MR but shows 
>>> that some very good results can ve achieved using Spark engine.
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> http://talebzadehmich.wordpress.com
>>>  
>>> 
>>>> On 24 May 2016 at 08:03, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
>>>> Hi,
>>>> 
>>>> We use Hive as the database and use Spark as an all purpose query tool.
>>>> 
>>>> Whether Hive is the write database for purpose or one is better off with 
>>>> something like Phoenix on Hbase, well the answer is it depends and your 
>>>> mileage varies. 
>>>> 
>>>> So fit for purpose.
>>>> 
>>>> Ideally what wants is to use the fastest  method to get the results. How 
>>>> fast we confine it to our SLA agreements in production and that helps us 
>>>> from unnecessary further work as we technologists like to play around.
>>>> 
>>>> So in short, we use Spark most of the time and use Hive as the backend 
>>>> engine for data storage, mainly ORC tables.
>>>> 
>>>> We use Hive on Spark and with Hive 2 on Spark 1.3.1 for now we have a 
>>>> combination that works. Granted it helps to use Hive 2 on Spark 1.6.1 but 
>>>> at the moment it is one of my projects.
>>>> 
>>>> We do not use any vendor's products as it enables us to move away  from 
>>>> being tied down after years of SAP, Oracle and MS dependency to yet 
>>>> another vendor. Besides there is some politics going on with one promoting 
>>>> Tez and another Spark as a backend. That is fine but obviously we prefer 
>>>> an independent assessment ourselves.
>>>> 
>>>> My gut feeling is that one needs to look at the use case. Recently we had 
>>>> to import a very large table from Oracle to Hive and decided to use Spark 
>>>> 1.6.1 with Hive 2 on Spark 1.3.1 and that worked fine. We just used JDBC 
>>>> connection with temp table and it was good. We could have used sqoop but 
>>>> decided to settle for Spark so it all depends on use case.
>>>> 
>>>> HTH
>>>> 
>>>> 
>>>> 
>>>> Dr Mich Talebzadeh
>>>>  
>>>> LinkedIn  
>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>  
>>>> http://talebzadehmich.wordpress.com
>>>>  
>>>> 
>>>>> On 24 May 2016 at 03:11, ayan guha <guha.a...@gmail.com> wrote:
>>>>> Hi
>>>>> 
>>>>> Thanks for very useful stats. 
>>>>> 
>>>>> Did you have any benchmark for using Spark as backend engine for Hive vs 
>>>>> using Spark thrift server (and run spark code for hive queries)? We are 
>>>>> using later but it will be very useful to remove thriftserver, if we can. 
>>>>> 
>>>>>> On Tue, May 24, 2016 at 9:51 AM, Jörn Franke <jornfra...@gmail.com> 
>>>>>> wrote:
>>>>>> 
>>>>>> Hi Mich,
>>>>>> 
>>>>>> I think these comparisons are useful. One interesting aspect could be 
>>>>>> hardware scalability in this context. Additionally different type of 
>>>>>> computations. Furthermore, one could compare Spark and Tez+llap as 
>>>>>> execution engines. I have the gut feeling that  each one can be 
>>>>>> justified by different use cases.
>>>>>> Nevertheless, there should be always a disclaimer for such comparisons, 
>>>>>> because Spark and Hive are not good for a lot of concurrent lookups of 
>>>>>> single rows. They are not good for frequently write small amounts of 
>>>>>> data (eg sensor data). Here hbase could be more interesting. Other use 
>>>>>> cases can justify graph databases, such as Titan, or text analytics/ 
>>>>>> data matching using Solr on Hadoop.
>>>>>> Finally, even if you have a lot of data you need to think if you always 
>>>>>> have to process everything. For instance, I have found valid use cases 
>>>>>> in practice where we decided to evaluate 10 machine learning models in 
>>>>>> parallel on only a sample of data and only evaluate the "winning" model 
>>>>>> of the total of data.
>>>>>> 
>>>>>> As always it depends :) 
>>>>>> 
>>>>>> Best regards
>>>>>> 
>>>>>> P.s.: at least Hortonworks has in their distribution spark 1.5 with hive 
>>>>>> 1.2 and spark 1.6 with hive 1.2. Maybe they have somewhere described how 
>>>>>> to manage bringing both together. You may check also Apache Bigtop 
>>>>>> (vendor neutral distribution) on how they managed to bring both together.
>>>>>> 
>>>>>>> On 23 May 2016, at 01:42, Mich Talebzadeh <mich.talebza...@gmail.com> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>> Hi,
>>>>>>>  
>>>>>>> I have done a number of extensive tests using Spark-shell with Hive DB 
>>>>>>> and ORC tables.
>>>>>>>  
>>>>>>> Now one issue that we typically face is and I quote:
>>>>>>>  
>>>>>>> Spark is fast as it uses Memory and DAG. Great but when we save data it 
>>>>>>> is not fast enough
>>>>>>> 
>>>>>>> OK but there is a solution now. If you use Spark with Hive and you are 
>>>>>>> on a descent version of Hive >= 0.14, then you can also deploy Spark as 
>>>>>>> execution engine for Hive. That will make your application run pretty 
>>>>>>> fast as you no longer rely on the old Map-Reduce for Hive engine. In a 
>>>>>>> nutshell what you are gaining speed in both querying and storage.
>>>>>>>  
>>>>>>> I have made some comparisons on this set-up and I am sure some of you 
>>>>>>> will find it useful.
>>>>>>>  
>>>>>>> The version of Spark I use for Spark queries (Spark as query tool) is 
>>>>>>> 1.6.
>>>>>>> The version of Hive I use in Hive 2
>>>>>>> The version of Spark I use as Hive execution engine is 1.3.1 It works 
>>>>>>> and frankly Spark 1.3.1 as an execution engine is adequate (until we 
>>>>>>> sort out the Hadoop libraries mismatch).
>>>>>>>  
>>>>>>> An example I am using Hive on Spark engine to find the min and max of 
>>>>>>> IDs for a table with 1 billion rows:
>>>>>>>  
>>>>>>> 0: jdbc:hive2://rhes564:10010/default>  select min(id), 
>>>>>>> max(id),avg(id), stddev(id) from oraclehadoop.dummy;
>>>>>>> Query ID = hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006
>>>>>>>  
>>>>>>>  
>>>>>>> Starting Spark Job = 5e092ef9-d798-4952-b156-74df49da9151
>>>>>>>  
>>>>>>> INFO  : Completed compiling 
>>>>>>> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006);
>>>>>>>  Time taken: 1.911 seconds
>>>>>>> INFO  : Executing 
>>>>>>> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006):
>>>>>>>  select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy
>>>>>>> INFO  : Query ID = 
>>>>>>> hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006
>>>>>>> INFO  : Total jobs = 1
>>>>>>> INFO  : Launching Job 1 out of 1
>>>>>>> INFO  : Starting task [Stage-1:MAPRED] in serial mode
>>>>>>>  
>>>>>>> Query Hive on Spark job[0] stages:
>>>>>>> 0
>>>>>>> 1
>>>>>>> Status: Running (Hive on Spark job[0])
>>>>>>> Job Progress Format
>>>>>>> CurrentTime StageId_StageAttemptId: 
>>>>>>> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
>>>>>>>  [StageCost]
>>>>>>> 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1
>>>>>>> 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22    Stage-1_0: 0/1
>>>>>>> 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22    Stage-1_0: 0/1
>>>>>>> 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22    Stage-1_0: 0/1
>>>>>>> INFO  :
>>>>>>> Query Hive on Spark job[0] stages:
>>>>>>> INFO  : 0
>>>>>>> INFO  : 1
>>>>>>> INFO  :
>>>>>>> Status: Running (Hive on Spark job[0])
>>>>>>> INFO  : Job Progress Format
>>>>>>> CurrentTime StageId_StageAttemptId: 
>>>>>>> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
>>>>>>>  [StageCost]
>>>>>>> INFO  : 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1
>>>>>>> INFO  : 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22    Stage-1_0: 0/1
>>>>>>> INFO  : 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22    Stage-1_0: 0/1
>>>>>>> INFO  : 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22    Stage-1_0: 0/1
>>>>>>> 2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished       Stage-1_0: 
>>>>>>> 0(+1)/1
>>>>>>> 2016-05-23 00:21:30,189 Stage-0_0: 22/22 Finished       Stage-1_0: 1/1 
>>>>>>> Finished
>>>>>>> Status: Finished successfully in 53.25 seconds
>>>>>>> OK
>>>>>>> INFO  : 2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished       
>>>>>>> Stage-1_0: 0(+1)/1
>>>>>>> INFO  : 2016-05-23 00:21:30,189 Stage-0_0: 22/22 Finished       
>>>>>>> Stage-1_0: 1/1 Finished
>>>>>>> INFO  : Status: Finished successfully in 53.25 seconds
>>>>>>> INFO  : Completed executing 
>>>>>>> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006);
>>>>>>>  Time taken: 56.337 seconds
>>>>>>> INFO  : OK
>>>>>>> +-----+------------+---------------+-----------------------+--+
>>>>>>> | c0  |     c1     |      c2       |          c3           |
>>>>>>> +-----+------------+---------------+-----------------------+--+
>>>>>>> | 1   | 100000000  | 5.00000005E7  | 2.8867513459481288E7  |
>>>>>>> +-----+------------+---------------+-----------------------+--+
>>>>>>> 1 row selected (58.529 seconds)
>>>>>>>  
>>>>>>> 58 seconds first run with cold cache is pretty good
>>>>>>>  
>>>>>>> And let us compare it with running the same query on map-reduce engine
>>>>>>>  
>>>>>>> : jdbc:hive2://rhes564:10010/default> set hive.execution.engine=mr;
>>>>>>> Hive-on-MR is deprecated in Hive 2 and may not be available in the 
>>>>>>> future versions. Consider using a different execution engine (i.e. 
>>>>>>> spark, tez) or using Hive 1.X releases.
>>>>>>> No rows affected (0.007 seconds)
>>>>>>> 0: jdbc:hive2://rhes564:10010/default>  select min(id), 
>>>>>>> max(id),avg(id), stddev(id) from oraclehadoop.dummy;
>>>>>>> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in 
>>>>>>> the future versions. Consider using a different execution engine (i.e. 
>>>>>>> spark, tez) or using Hive 1.X releases.
>>>>>>> Query ID = hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc
>>>>>>> Total jobs = 1
>>>>>>> Launching Job 1 out of 1
>>>>>>> Number of reduce tasks determined at compile time: 1
>>>>>>> In order to change the average load for a reducer (in bytes):
>>>>>>>   set hive.exec.reducers.bytes.per.reducer=<number>
>>>>>>> In order to limit the maximum number of reducers:
>>>>>>>   set hive.exec.reducers.max=<number>
>>>>>>> In order to set a constant number of reducers:
>>>>>>>   set mapreduce.job.reduces=<number>
>>>>>>> Starting Job = job_1463956731753_0005, Tracking URL = 
>>>>>>> http://localhost.localdomain:8088/proxy/application_1463956731753_0005/
>>>>>>> Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill 
>>>>>>> job_1463956731753_0005
>>>>>>> Hadoop job information for Stage-1: number of mappers: 22; number of 
>>>>>>> reducers: 1
>>>>>>> 2016-05-23 00:26:38,127 Stage-1 map = 0%,  reduce = 0%
>>>>>>> INFO  : Compiling 
>>>>>>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc):
>>>>>>>  select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy
>>>>>>> INFO  : Semantic Analysis Completed
>>>>>>> INFO  : Returning Hive schema: 
>>>>>>> Schema(fieldSchemas:[FieldSchema(name:c0, type:int, comment:null), 
>>>>>>> FieldSchema(name:c1, type:int, comment:null), FieldSchema(name:c2, 
>>>>>>> type:double, comment:null), FieldSchema(name:c3, type:double, 
>>>>>>> comment:null)], properties:null)
>>>>>>> INFO  : Completed compiling 
>>>>>>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc);
>>>>>>>  Time taken: 0.144 seconds
>>>>>>> INFO  : Executing 
>>>>>>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc):
>>>>>>>  select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy
>>>>>>> WARN  : Hive-on-MR is deprecated in Hive 2 and may not be available in 
>>>>>>> the future versions. Consider using a different execution engine (i.e. 
>>>>>>> spark, tez) or using Hive 1.X releases.
>>>>>>> INFO  : WARNING: Hive-on-MR is deprecated in Hive 2 and may not be 
>>>>>>> available in the future versions. Consider using a different execution 
>>>>>>> engine (i.e. spark, tez) or using Hive 1.X releases.
>>>>>>> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in 
>>>>>>> the future versions. Consider using a different execution engine (i.e. 
>>>>>>> spark, tez) or using Hive 1.X releases.
>>>>>>> INFO  : Query ID = 
>>>>>>> hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc
>>>>>>> INFO  : Total jobs = 1
>>>>>>> INFO  : Launching Job 1 out of 1
>>>>>>> INFO  : Starting task [Stage-1:MAPRED] in serial mode
>>>>>>> INFO  : Number of reduce tasks determined at compile time: 1
>>>>>>> INFO  : In order to change the average load for a reducer (in bytes):
>>>>>>> INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
>>>>>>> INFO  : In order to limit the maximum number of reducers:
>>>>>>> INFO  :   set hive.exec.reducers.max=<number>
>>>>>>> INFO  : In order to set a constant number of reducers:
>>>>>>> INFO  :   set mapreduce.job.reduces=<number>
>>>>>>> WARN  : Hadoop command-line option parsing not performed. Implement the 
>>>>>>> Tool interface and execute your application with ToolRunner to remedy 
>>>>>>> this.
>>>>>>> INFO  : number of splits:22
>>>>>>> INFO  : Submitting tokens for job: job_1463956731753_0005
>>>>>>> INFO  : The url to track the job: 
>>>>>>> http://localhost.localdomain:8088/proxy/application_1463956731753_0005/
>>>>>>> INFO  : Starting Job = job_1463956731753_0005, Tracking URL = 
>>>>>>> http://localhost.localdomain:8088/proxy/application_1463956731753_0005/
>>>>>>> INFO  : Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill 
>>>>>>> job_1463956731753_0005
>>>>>>> INFO  : Hadoop job information for Stage-1: number of mappers: 22; 
>>>>>>> number of reducers: 1
>>>>>>> INFO  : 2016-05-23 00:26:38,127 Stage-1 map = 0%,  reduce = 0%
>>>>>>> 2016-05-23 00:26:44,367 Stage-1 map = 5%,  reduce = 0%, Cumulative CPU 
>>>>>>> 4.56 sec
>>>>>>> INFO  : 2016-05-23 00:26:44,367 Stage-1 map = 5%,  reduce = 0%, 
>>>>>>> Cumulative CPU 4.56 sec
>>>>>>> 2016-05-23 00:26:50,558 Stage-1 map = 9%,  reduce = 0%, Cumulative CPU 
>>>>>>> 9.17 sec
>>>>>>> INFO  : 2016-05-23 00:26:50,558 Stage-1 map = 9%,  reduce = 0%, 
>>>>>>> Cumulative CPU 9.17 sec
>>>>>>> 2016-05-23 00:26:56,747 Stage-1 map = 14%,  reduce = 0%, Cumulative CPU 
>>>>>>> 14.04 sec
>>>>>>> INFO  : 2016-05-23 00:26:56,747 Stage-1 map = 14%,  reduce = 0%, 
>>>>>>> Cumulative CPU 14.04 sec
>>>>>>> 2016-05-23 00:27:02,944 Stage-1 map = 18%,  reduce = 0%, Cumulative CPU 
>>>>>>> 18.64 sec
>>>>>>> INFO  : 2016-05-23 00:27:02,944 Stage-1 map = 18%,  reduce = 0%, 
>>>>>>> Cumulative CPU 18.64 sec
>>>>>>> 2016-05-23 00:27:08,105 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
>>>>>>> 23.25 sec
>>>>>>> INFO  : 2016-05-23 00:27:08,105 Stage-1 map = 23%,  reduce = 0%, 
>>>>>>> Cumulative CPU 23.25 sec
>>>>>>> 2016-05-23 00:27:14,298 Stage-1 map = 27%,  reduce = 0%, Cumulative CPU 
>>>>>>> 27.84 sec
>>>>>>> INFO  : 2016-05-23 00:27:14,298 Stage-1 map = 27%,  reduce = 0%, 
>>>>>>> Cumulative CPU 27.84 sec
>>>>>>> 2016-05-23 00:27:20,484 Stage-1 map = 32%,  reduce = 0%, Cumulative CPU 
>>>>>>> 32.56 sec
>>>>>>> INFO  : 2016-05-23 00:27:20,484 Stage-1 map = 32%,  reduce = 0%, 
>>>>>>> Cumulative CPU 32.56 sec
>>>>>>> 2016-05-23 00:27:26,659 Stage-1 map = 36%,  reduce = 0%, Cumulative CPU 
>>>>>>> 37.1 sec
>>>>>>> INFO  : 2016-05-23 00:27:26,659 Stage-1 map = 36%,  reduce = 0%, 
>>>>>>> Cumulative CPU 37.1 sec
>>>>>>> 2016-05-23 00:27:32,839 Stage-1 map = 41%,  reduce = 0%, Cumulative CPU 
>>>>>>> 41.74 sec
>>>>>>> INFO  : 2016-05-23 00:27:32,839 Stage-1 map = 41%,  reduce = 0%, 
>>>>>>> Cumulative CPU 41.74 sec
>>>>>>> 2016-05-23 00:27:39,003 Stage-1 map = 45%,  reduce = 0%, Cumulative CPU 
>>>>>>> 46.32 sec
>>>>>>> INFO  : 2016-05-23 00:27:39,003 Stage-1 map = 45%,  reduce = 0%, 
>>>>>>> Cumulative CPU 46.32 sec
>>>>>>> 2016-05-23 00:27:45,173 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 
>>>>>>> 50.93 sec
>>>>>>> 2016-05-23 00:27:50,316 Stage-1 map = 55%,  reduce = 0%, Cumulative CPU 
>>>>>>> 55.55 sec
>>>>>>> INFO  : 2016-05-23 00:27:45,173 Stage-1 map = 50%,  reduce = 0%, 
>>>>>>> Cumulative CPU 50.93 sec
>>>>>>> INFO  : 2016-05-23 00:27:50,316 Stage-1 map = 55%,  reduce = 0%, 
>>>>>>> Cumulative CPU 55.55 sec
>>>>>>> 2016-05-23 00:27:56,482 Stage-1 map = 59%,  reduce = 0%, Cumulative CPU 
>>>>>>> 60.25 sec
>>>>>>> INFO  : 2016-05-23 00:27:56,482 Stage-1 map = 59%,  reduce = 0%, 
>>>>>>> Cumulative CPU 60.25 sec
>>>>>>> 2016-05-23 00:28:02,642 Stage-1 map = 64%,  reduce = 0%, Cumulative CPU 
>>>>>>> 64.86 sec
>>>>>>> INFO  : 2016-05-23 00:28:02,642 Stage-1 map = 64%,  reduce = 0%, 
>>>>>>> Cumulative CPU 64.86 sec
>>>>>>> 2016-05-23 00:28:08,814 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU 
>>>>>>> 69.41 sec
>>>>>>> INFO  : 2016-05-23 00:28:08,814 Stage-1 map = 68%,  reduce = 0%, 
>>>>>>> Cumulative CPU 69.41 sec
>>>>>>> 2016-05-23 00:28:14,977 Stage-1 map = 73%,  reduce = 0%, Cumulative CPU 
>>>>>>> 74.06 sec
>>>>>>> INFO  : 2016-05-23 00:28:14,977 Stage-1 map = 73%,  reduce = 0%, 
>>>>>>> Cumulative CPU 74.06 sec
>>>>>>> 2016-05-23 00:28:21,134 Stage-1 map = 77%,  reduce = 0%, Cumulative CPU 
>>>>>>> 78.72 sec
>>>>>>> INFO  : 2016-05-23 00:28:21,134 Stage-1 map = 77%,  reduce = 0%, 
>>>>>>> Cumulative CPU 78.72 sec
>>>>>>> 2016-05-23 00:28:27,282 Stage-1 map = 82%,  reduce = 0%, Cumulative CPU 
>>>>>>> 83.32 sec
>>>>>>> INFO  : 2016-05-23 00:28:27,282 Stage-1 map = 82%,  reduce = 0%, 
>>>>>>> Cumulative CPU 83.32 sec
>>>>>>> 2016-05-23 00:28:33,437 Stage-1 map = 86%,  reduce = 0%, Cumulative CPU 
>>>>>>> 87.9 sec
>>>>>>> INFO  : 2016-05-23 00:28:33,437 Stage-1 map = 86%,  reduce = 0%, 
>>>>>>> Cumulative CPU 87.9 sec
>>>>>>> 2016-05-23 00:28:38,579 Stage-1 map = 91%,  reduce = 0%, Cumulative CPU 
>>>>>>> 92.52 sec
>>>>>>> INFO  : 2016-05-23 00:28:38,579 Stage-1 map = 91%,  reduce = 0%, 
>>>>>>> Cumulative CPU 92.52 sec
>>>>>>> 2016-05-23 00:28:44,759 Stage-1 map = 95%,  reduce = 0%, Cumulative CPU 
>>>>>>> 97.35 sec
>>>>>>> INFO  : 2016-05-23 00:28:44,759 Stage-1 map = 95%,  reduce = 0%, 
>>>>>>> Cumulative CPU 97.35 sec
>>>>>>> 2016-05-23 00:28:49,915 Stage-1 map = 100%,  reduce = 0%, Cumulative 
>>>>>>> CPU 99.6 sec
>>>>>>> INFO  : 2016-05-23 00:28:49,915 Stage-1 map = 100%,  reduce = 0%, 
>>>>>>> Cumulative CPU 99.6 sec
>>>>>>> 2016-05-23 00:28:54,043 Stage-1 map = 100%,  reduce = 100%, Cumulative 
>>>>>>> CPU 101.4 sec
>>>>>>> MapReduce Total cumulative CPU time: 1 minutes 41 seconds 400 msec
>>>>>>> Ended Job = job_1463956731753_0005
>>>>>>> MapReduce Jobs Launched:
>>>>>>> Stage-Stage-1: Map: 22  Reduce: 1   Cumulative CPU: 101.4 sec   HDFS 
>>>>>>> Read: 5318569 HDFS Write: 46 SUCCESS
>>>>>>> Total MapReduce CPU Time Spent: 1 minutes 41 seconds 400 msec
>>>>>>> OK
>>>>>>> INFO  : 2016-05-23 00:28:54,043 Stage-1 map = 100%,  reduce = 100%, 
>>>>>>> Cumulative CPU 101.4 sec
>>>>>>> INFO  : MapReduce Total cumulative CPU time: 1 minutes 41 seconds 400 
>>>>>>> msec
>>>>>>> INFO  : Ended Job = job_1463956731753_0005
>>>>>>> INFO  : MapReduce Jobs Launched:
>>>>>>> INFO  : Stage-Stage-1: Map: 22  Reduce: 1   Cumulative CPU: 101.4 sec   
>>>>>>> HDFS Read: 5318569 HDFS Write: 46 SUCCESS
>>>>>>> INFO  : Total MapReduce CPU Time Spent: 1 minutes 41 seconds 400 msec
>>>>>>> INFO  : Completed executing 
>>>>>>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc);
>>>>>>>  Time taken: 142.525 seconds
>>>>>>> INFO  : OK
>>>>>>> +-----+------------+---------------+-----------------------+--+
>>>>>>> | c0  |     c1     |      c2       |          c3           |
>>>>>>> +-----+------------+---------------+-----------------------+--+
>>>>>>> | 1   | 100000000  | 5.00000005E7  | 2.8867513459481288E7  |
>>>>>>> +-----+------------+---------------+-----------------------+--+
>>>>>>> 1 row selected (142.744 seconds)
>>>>>>>  
>>>>>>> OK Hive on map-reduce engine took 142 seconds compared to 58 seconds 
>>>>>>> with Hive on Spark. So you can obviously gain pretty well by using Hive 
>>>>>>> on Spark.
>>>>>>>  
>>>>>>> Please also note that I did not use any vendor's build for this 
>>>>>>> purpose. I compiled Spark 1.3.1 myself.
>>>>>>>  
>>>>>>> HTH
>>>>>>>  
>>>>>>>  
>>>>>>> Dr Mich Talebzadeh
>>>>>>>  
>>>>>>> LinkedIn  
>>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>  
>>>>>>> http://talebzadehmich.wordpress.com/
>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Best Regards,
>>>>> Ayan Guha
>

Re: Using Spark on Hive with Hive also using Spark as its execution engine

Reply via email to