Re: Using Spark on Hive with Hive also using Spark as its execution engine

Ovidiu-Cristian MARCU Mon, 30 May 2016 13:35:57 -0700

Spark in relation to Tez can be like a Flink runner for Apache Beam? The use 
case of Tez however may be interesting (but current implementation only 
YARN-based?)


Spark is efficient (or faster) for a number of reasons, including its 
‘in-memory’ execution (from my understanding and experiments). If one really 
cares to dive in, just enough to read their papers which explain very well the 
optimization framework (graph-specific, MPP db, Catalyst, ML pipelines etc.) 
which Spark become after the initial RDD implementation.

What Spark is missing is a way of reaching its users by a good ‘production’ 
level, good documentation and nice feedback from the masters of this unique 
piece.

Just an opinion.

Best,
Ovidiu


> On 30 May 2016, at 21:49, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> 
> yep Hortonworks supports Tez for one reason or other which I am going 
> hopefully to test it as the query engine for hive. Tthough I think Spark will 
> be faster because of its in-memory support.
> 
> Also if you are independent then you better off dealing with Spark and Hive 
> without the need to support another stack like Tez.
> 
> Cloudera support Impala instead of Hive but it is not something I have used. .
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 30 May 2016 at 20:19, Michael Segel <msegel_had...@hotmail.com 
> <mailto:msegel_had...@hotmail.com>> wrote:
> Mich, 
> 
> Most people use vendor releases because they need to have the support. 
> Hortonworks is the vendor who has the most skin in the game when it comes to 
> Tez. 
> 
> If memory serves, Tez isn’t going to be M/R but a local execution engine? 
> Then LLAP is the in-memory piece to speed up Tez? 
> 
> HTH
> 
> -Mike
> 
>> On May 29, 2016, at 1:35 PM, Mich Talebzadeh <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>> 
>> thanks I think the problem is that the TEZ user group is exceptionally 
>> quiet. Just sent an email to Hive user group to see anyone has managed to 
>> built a vendor independent version.
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>  
>> 
>> On 29 May 2016 at 21:23, Jörn Franke <jornfra...@gmail.com 
>> <mailto:jornfra...@gmail.com>> wrote:
>> Well I think it is different from MR. It has some optimizations which you do 
>> not find in MR. Especially the LLAP option in Hive2 makes it interesting. 
>> 
>> I think hive 1.2 works with 0.7 and 2.0 with 0.8 . At least for 1.2 it is 
>> integrated in the Hortonworks distribution. 
>> 
>> 
>> On 29 May 2016, at 21:43, Mich Talebzadeh <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>> 
>>> Hi Jorn,
>>> 
>>> I started building apache-tez-0.8.2 but got few errors. Couple of guys from 
>>> TEZ user group kindly gave a hand but I could not go very far (or may be I 
>>> did not make enough efforts) making it work.
>>> 
>>> That TEZ user group is very quiet as well.
>>> 
>>> My understanding is TEZ is MR with DAG but of course Spark has both plus 
>>> in-memory capability.
>>> 
>>> It would be interesting to see what version of TEZ works as execution 
>>> engine with Hive.
>>> 
>>> Vendors are divided on this (use Hive with TEZ) or use Impala instead of 
>>> Hive etc as I am sure you already know.
>>> 
>>> Cheers,
>>> 
>>> 
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>  
>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>>  
>>> 
>>> On 29 May 2016 at 20:19, Jörn Franke <jornfra...@gmail.com 
>>> <mailto:jornfra...@gmail.com>> wrote:
>>> Very interesting do you plan also a test with TEZ?
>>> 
>>> On 29 May 2016, at 13:40, Mich Talebzadeh <mich.talebza...@gmail.com 
>>> <mailto:mich.talebza...@gmail.com>> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> I did another study of Hive using Spark engine compared to Hive with MR.
>>>> 
>>>> Basically took the original table imported using Sqoop and created and 
>>>> populated a new ORC table partitioned by year and month into 48 partitions 
>>>> as follows:
>>>> 
>>>> <sales_partition.PNG>
>>>>  
>>>> Connections use JDBC via beeline. Now for each partition using MR it takes 
>>>> an average of 17 minutes as seen below for each PARTITION..  Now that is 
>>>> just an individual partition and there are 48 partitions.
>>>> 
>>>> In contrast doing the same operation with Spark engine took 10 minutes all 
>>>> inclusive. I just gave up on MR. You can see the StartTime and FinishTime 
>>>> from below
>>>> 
>>>> <image.png>
>>>> 
>>>> This is by no means indicate that Spark is much better than MR but shows 
>>>> that some very good results can ve achieved using Spark engine.
>>>> 
>>>> 
>>>> Dr Mich Talebzadeh
>>>>  
>>>> LinkedIn  
>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>  
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>>  
>>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>>>  
>>>> 
>>>> On 24 May 2016 at 08:03, Mich Talebzadeh <mich.talebza...@gmail.com 
>>>> <mailto:mich.talebza...@gmail.com>> wrote:
>>>> Hi,
>>>> 
>>>> We use Hive as the database and use Spark as an all purpose query tool.
>>>> 
>>>> Whether Hive is the write database for purpose or one is better off with 
>>>> something like Phoenix on Hbase, well the answer is it depends and your 
>>>> mileage varies. 
>>>> 
>>>> So fit for purpose.
>>>> 
>>>> Ideally what wants is to use the fastest  method to get the results. How 
>>>> fast we confine it to our SLA agreements in production and that helps us 
>>>> from unnecessary further work as we technologists like to play around.
>>>> 
>>>> So in short, we use Spark most of the time and use Hive as the backend 
>>>> engine for data storage, mainly ORC tables.
>>>> 
>>>> We use Hive on Spark and with Hive 2 on Spark 1.3.1 for now we have a 
>>>> combination that works. Granted it helps to use Hive 2 on Spark 1.6.1 but 
>>>> at the moment it is one of my projects.
>>>> 
>>>> We do not use any vendor's products as it enables us to move away  from 
>>>> being tied down after years of SAP, Oracle and MS dependency to yet 
>>>> another vendor. Besides there is some politics going on with one promoting 
>>>> Tez and another Spark as a backend. That is fine but obviously we prefer 
>>>> an independent assessment ourselves.
>>>> 
>>>> My gut feeling is that one needs to look at the use case. Recently we had 
>>>> to import a very large table from Oracle to Hive and decided to use Spark 
>>>> 1.6.1 with Hive 2 on Spark 1.3.1 and that worked fine. We just used JDBC 
>>>> connection with temp table and it was good. We could have used sqoop but 
>>>> decided to settle for Spark so it all depends on use case.
>>>> 
>>>> HTH
>>>> 
>>>> 
>>>> 
>>>> Dr Mich Talebzadeh
>>>>  
>>>> LinkedIn  
>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>  
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>>  
>>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>>>  
>>>> 
>>>> On 24 May 2016 at 03:11, ayan guha <guha.a...@gmail.com 
>>>> <mailto:guha.a...@gmail.com>> wrote:
>>>> Hi
>>>> 
>>>> Thanks for very useful stats. 
>>>> 
>>>> Did you have any benchmark for using Spark as backend engine for Hive vs 
>>>> using Spark thrift server (and run spark code for hive queries)? We are 
>>>> using later but it will be very useful to remove thriftserver, if we can. 
>>>> 
>>>> On Tue, May 24, 2016 at 9:51 AM, Jörn Franke <jornfra...@gmail.com 
>>>> <mailto:jornfra...@gmail.com>> wrote:
>>>> 
>>>> Hi Mich,
>>>> 
>>>> I think these comparisons are useful. One interesting aspect could be 
>>>> hardware scalability in this context. Additionally different type of 
>>>> computations. Furthermore, one could compare Spark and Tez+llap as 
>>>> execution engines. I have the gut feeling that  each one can be justified 
>>>> by different use cases.
>>>> Nevertheless, there should be always a disclaimer for such comparisons, 
>>>> because Spark and Hive are not good for a lot of concurrent lookups of 
>>>> single rows. They are not good for frequently write small amounts of data 
>>>> (eg sensor data). Here hbase could be more interesting. Other use cases 
>>>> can justify graph databases, such as Titan, or text analytics/ data 
>>>> matching using Solr on Hadoop.
>>>> Finally, even if you have a lot of data you need to think if you always 
>>>> have to process everything. For instance, I have found valid use cases in 
>>>> practice where we decided to evaluate 10 machine learning models in 
>>>> parallel on only a sample of data and only evaluate the "winning" model of 
>>>> the total of data.
>>>> 
>>>> As always it depends :) 
>>>> 
>>>> Best regards
>>>> 
>>>> P.s.: at least Hortonworks has in their distribution spark 1.5 with hive 
>>>> 1.2 and spark 1.6 with hive 1.2. Maybe they have somewhere described how 
>>>> to manage bringing both together. You may check also Apache Bigtop (vendor 
>>>> neutral distribution) on how they managed to bring both together.
>>>> 
>>>> On 23 May 2016, at 01:42, Mich Talebzadeh <mich.talebza...@gmail.com 
>>>> <mailto:mich.talebza...@gmail.com>> wrote:
>>>> 
>>>>> Hi,
>>>>>  
>>>>> I have done a number of extensive tests using Spark-shell with Hive DB 
>>>>> and ORC tables.
>>>>>  
>>>>> Now one issue that we typically face is and I quote:
>>>>>  
>>>>> Spark is fast as it uses Memory and DAG. Great but when we save data it 
>>>>> is not fast enough
>>>>> 
>>>>> OK but there is a solution now. If you use Spark with Hive and you are on 
>>>>> a descent version of Hive >= 0.14, then you can also deploy Spark as 
>>>>> execution engine for Hive. That will make your application run pretty 
>>>>> fast as you no longer rely on the old Map-Reduce for Hive engine. In a 
>>>>> nutshell what you are gaining speed in both querying and storage.
>>>>>  
>>>>> I have made some comparisons on this set-up and I am sure some of you 
>>>>> will find it useful.
>>>>>  
>>>>> The version of Spark I use for Spark queries (Spark as query tool) is 1.6.
>>>>> The version of Hive I use in Hive 2
>>>>> The version of Spark I use as Hive execution engine is 1.3.1 It works and 
>>>>> frankly Spark 1.3.1 as an execution engine is adequate (until we sort out 
>>>>> the Hadoop libraries mismatch).
>>>>>  
>>>>> An example I am using Hive on Spark engine to find the min and max of IDs 
>>>>> for a table with 1 billion rows:
>>>>>  
>>>>> 0: jdbc:hive2://rhes564:10010/default>  select min(id), max(id),avg(id), 
>>>>> stddev(id) from oraclehadoop.dummy;
>>>>> Query ID = hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006
>>>>>  
>>>>>  
>>>>> Starting Spark Job = 5e092ef9-d798-4952-b156-74df49da9151
>>>>>  
>>>>> INFO  : Completed compiling 
>>>>> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006);
>>>>>  Time taken: 1.911 seconds
>>>>> INFO  : Executing 
>>>>> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006):
>>>>>  select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy
>>>>> INFO  : Query ID = 
>>>>> hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006
>>>>> INFO  : Total jobs = 1
>>>>> INFO  : Launching Job 1 out of 1
>>>>> INFO  : Starting task [Stage-1:MAPRED] in serial mode
>>>>>  
>>>>> Query Hive on Spark job[0] stages:
>>>>> 0
>>>>> 1
>>>>> Status: Running (Hive on Spark job[0])
>>>>> Job Progress Format
>>>>> CurrentTime StageId_StageAttemptId: 
>>>>> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount 
>>>>> [StageCost]
>>>>> 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1
>>>>> 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22    Stage-1_0: 0/1
>>>>> 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22    Stage-1_0: 0/1
>>>>> 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22    Stage-1_0: 0/1
>>>>> INFO  :
>>>>> Query Hive on Spark job[0] stages:
>>>>> INFO  : 0
>>>>> INFO  : 1
>>>>> INFO  :
>>>>> Status: Running (Hive on Spark job[0])
>>>>> INFO  : Job Progress Format
>>>>> CurrentTime StageId_StageAttemptId: 
>>>>> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount 
>>>>> [StageCost]
>>>>> INFO  : 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1
>>>>> INFO  : 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22    Stage-1_0: 0/1
>>>>> INFO  : 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22    Stage-1_0: 0/1
>>>>> INFO  : 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22    Stage-1_0: 0/1
>>>>> 2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished       Stage-1_0: 0(+1)/1
>>>>> 2016-05-23 00:21:30,189 Stage-0_0: 22/22 Finished       Stage-1_0: 1/1 
>>>>> Finished
>>>>> Status: Finished successfully in 53.25 seconds
>>>>> OK
>>>>> INFO  : 2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished       
>>>>> Stage-1_0: 0(+1)/1
>>>>> INFO  : 2016-05-23 00:21:30,189 Stage-0_0: 22/22 Finished       
>>>>> Stage-1_0: 1/1 Finished
>>>>> INFO  : Status: Finished successfully in 53.25 seconds
>>>>> INFO  : Completed executing 
>>>>> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006);
>>>>>  Time taken: 56.337 seconds
>>>>> INFO  : OK
>>>>> +-----+------------+---------------+-----------------------+--+
>>>>> | c0  |     c1     |      c2       |          c3           |
>>>>> +-----+------------+---------------+-----------------------+--+
>>>>> | 1   | 100000000  | 5.00000005E7  | 2.8867513459481288E7  |
>>>>> +-----+------------+---------------+-----------------------+--+
>>>>> 1 row selected (58.529 seconds)
>>>>>  
>>>>> 58 seconds first run with cold cache is pretty good
>>>>>  
>>>>> And let us compare it with running the same query on map-reduce engine
>>>>>  
>>>>> : jdbc:hive2://rhes564:10010/default> set hive.execution.engine=mr;
>>>>> Hive-on-MR is deprecated in Hive 2 and may not be available in the future 
>>>>> versions. Consider using a different execution engine (i.e. spark, tez) 
>>>>> or using Hive 1.X releases.
>>>>> No rows affected (0.007 seconds)
>>>>> 0: jdbc:hive2://rhes564:10010/default>  select min(id), max(id),avg(id), 
>>>>> stddev(id) from oraclehadoop.dummy;
>>>>> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in 
>>>>> the future versions. Consider using a different execution engine (i.e. 
>>>>> spark, tez) or using Hive 1.X releases.
>>>>> Query ID = hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc
>>>>> Total jobs = 1
>>>>> Launching Job 1 out of 1
>>>>> Number of reduce tasks determined at compile time: 1
>>>>> In order to change the average load for a reducer (in bytes):
>>>>>   set hive.exec.reducers.bytes.per.reducer=<number>
>>>>> In order to limit the maximum number of reducers:
>>>>>   set hive.exec.reducers.max=<number>
>>>>> In order to set a constant number of reducers:
>>>>>   set mapreduce.job.reduces=<number>
>>>>> Starting Job = job_1463956731753_0005, Tracking URL = 
>>>>> http://localhost.localdomain:8088/proxy/application_1463956731753_0005/ 
>>>>> <http://localhost.localdomain:8088/proxy/application_1463956731753_0005/>
>>>>> Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill 
>>>>> job_1463956731753_0005
>>>>> Hadoop job information for Stage-1: number of mappers: 22; number of 
>>>>> reducers: 1
>>>>> 2016-05-23 00:26:38,127 Stage-1 map = 0%,  reduce = 0%
>>>>> INFO  : Compiling 
>>>>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc):
>>>>>  select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy
>>>>> INFO  : Semantic Analysis Completed
>>>>> INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:c0, 
>>>>> type:int, comment:null), FieldSchema(name:c1, type:int, comment:null), 
>>>>> FieldSchema(name:c2, type:double, comment:null), FieldSchema(name:c3, 
>>>>> type:double, comment:null)], properties:null)
>>>>> INFO  : Completed compiling 
>>>>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc);
>>>>>  Time taken: 0.144 seconds
>>>>> INFO  : Executing 
>>>>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc):
>>>>>  select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy
>>>>> WARN  : Hive-on-MR is deprecated in Hive 2 and may not be available in 
>>>>> the future versions. Consider using a different execution engine (i.e. 
>>>>> spark, tez) or using Hive 1.X releases.
>>>>> INFO  : WARNING: Hive-on-MR is deprecated in Hive 2 and may not be 
>>>>> available in the future versions. Consider using a different execution 
>>>>> engine (i.e. spark, tez) or using Hive 1.X releases.
>>>>> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in 
>>>>> the future versions. Consider using a different execution engine (i.e. 
>>>>> spark, tez) or using Hive 1.X releases.
>>>>> INFO  : Query ID = 
>>>>> hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc
>>>>> INFO  : Total jobs = 1
>>>>> INFO  : Launching Job 1 out of 1
>>>>> INFO  : Starting task [Stage-1:MAPRED] in serial mode
>>>>> INFO  : Number of reduce tasks determined at compile time: 1
>>>>> INFO  : In order to change the average load for a reducer (in bytes):
>>>>> INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
>>>>> INFO  : In order to limit the maximum number of reducers:
>>>>> INFO  :   set hive.exec.reducers.max=<number>
>>>>> INFO  : In order to set a constant number of reducers:
>>>>> INFO  :   set mapreduce.job.reduces=<number>
>>>>> WARN  : Hadoop command-line option parsing not performed. Implement the 
>>>>> Tool interface and execute your application with ToolRunner to remedy 
>>>>> this.
>>>>> INFO  : number of splits:22
>>>>> INFO  : Submitting tokens for job: job_1463956731753_0005
>>>>> INFO  : The url to track the job: 
>>>>> http://localhost.localdomain:8088/proxy/application_1463956731753_0005/ 
>>>>> <http://localhost.localdomain:8088/proxy/application_1463956731753_0005/>
>>>>> INFO  : Starting Job = job_1463956731753_0005, Tracking URL = 
>>>>> http://localhost.localdomain:8088/proxy/application_1463956731753_0005/ 
>>>>> <http://localhost.localdomain:8088/proxy/application_1463956731753_0005/>
>>>>> INFO  : Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill 
>>>>> job_1463956731753_0005
>>>>> INFO  : Hadoop job information for Stage-1: number of mappers: 22; number 
>>>>> of reducers: 1
>>>>> INFO  : 2016-05-23 00:26:38,127 Stage-1 map = 0%,  reduce = 0%
>>>>> 2016-05-23 00:26:44,367 Stage-1 map = 5%,  reduce = 0%, Cumulative CPU 
>>>>> 4.56 sec
>>>>> INFO  : 2016-05-23 00:26:44,367 Stage-1 map = 5%,  reduce = 0%, 
>>>>> Cumulative CPU 4.56 sec
>>>>> 2016-05-23 00:26:50,558 Stage-1 map = 9%,  reduce = 0%, Cumulative CPU 
>>>>> 9.17 sec
>>>>> INFO  : 2016-05-23 00:26:50,558 Stage-1 map = 9%,  reduce = 0%, 
>>>>> Cumulative CPU 9.17 sec
>>>>> 2016-05-23 00:26:56,747 Stage-1 map = 14%,  reduce = 0%, Cumulative CPU 
>>>>> 14.04 sec
>>>>> INFO  : 2016-05-23 00:26:56,747 Stage-1 map = 14%,  reduce = 0%, 
>>>>> Cumulative CPU 14.04 sec
>>>>> 2016-05-23 00:27:02,944 Stage-1 map = 18%,  reduce = 0%, Cumulative CPU 
>>>>> 18.64 sec
>>>>> INFO  : 2016-05-23 00:27:02,944 Stage-1 map = 18%,  reduce = 0%, 
>>>>> Cumulative CPU 18.64 sec
>>>>> 2016-05-23 00:27:08,105 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
>>>>> 23.25 sec
>>>>> INFO  : 2016-05-23 00:27:08,105 Stage-1 map = 23%,  reduce = 0%, 
>>>>> Cumulative CPU 23.25 sec
>>>>> 2016-05-23 00:27:14,298 Stage-1 map = 27%,  reduce = 0%, Cumulative CPU 
>>>>> 27.84 sec
>>>>> INFO  : 2016-05-23 00:27:14,298 Stage-1 map = 27%,  reduce = 0%, 
>>>>> Cumulative CPU 27.84 sec
>>>>> 2016-05-23 00:27:20,484 Stage-1 map = 32%,  reduce = 0%, Cumulative CPU 
>>>>> 32.56 sec
>>>>> INFO  : 2016-05-23 00:27:20,484 Stage-1 map = 32%,  reduce = 0%, 
>>>>> Cumulative CPU 32.56 sec
>>>>> 2016-05-23 00:27:26,659 Stage-1 map = 36%,  reduce = 0%, Cumulative CPU 
>>>>> 37.1 sec
>>>>> INFO  : 2016-05-23 00:27:26,659 Stage-1 map = 36%,  reduce = 0%, 
>>>>> Cumulative CPU 37.1 sec
>>>>> 2016-05-23 00:27:32,839 Stage-1 map = 41%,  reduce = 0%, Cumulative CPU 
>>>>> 41.74 sec
>>>>> INFO  : 2016-05-23 00:27:32,839 Stage-1 map = 41%,  reduce = 0%, 
>>>>> Cumulative CPU 41.74 sec
>>>>> 2016-05-23 00:27:39,003 Stage-1 map = 45%,  reduce = 0%, Cumulative CPU 
>>>>> 46.32 sec
>>>>> INFO  : 2016-05-23 00:27:39,003 Stage-1 map = 45%,  reduce = 0%, 
>>>>> Cumulative CPU 46.32 sec
>>>>> 2016-05-23 00:27:45,173 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 
>>>>> 50.93 sec
>>>>> 2016-05-23 00:27:50,316 Stage-1 map = 55%,  reduce = 0%, Cumulative CPU 
>>>>> 55.55 sec
>>>>> INFO  : 2016-05-23 00:27:45,173 Stage-1 map = 50%,  reduce = 0%, 
>>>>> Cumulative CPU 50.93 sec
>>>>> INFO  : 2016-05-23 00:27:50,316 Stage-1 map = 55%,  reduce = 0%, 
>>>>> Cumulative CPU 55.55 sec
>>>>> 2016-05-23 00:27:56,482 Stage-1 map = 59%,  reduce = 0%, Cumulative CPU 
>>>>> 60.25 sec
>>>>> INFO  : 2016-05-23 00:27:56,482 Stage-1 map = 59%,  reduce = 0%, 
>>>>> Cumulative CPU 60.25 sec
>>>>> 2016-05-23 00:28:02,642 Stage-1 map = 64%,  reduce = 0%, Cumulative CPU 
>>>>> 64.86 sec
>>>>> INFO  : 2016-05-23 00:28:02,642 Stage-1 map = 64%,  reduce = 0%, 
>>>>> Cumulative CPU 64.86 sec
>>>>> 2016-05-23 00:28:08,814 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU 
>>>>> 69.41 sec
>>>>> INFO  : 2016-05-23 00:28:08,814 Stage-1 map = 68%,  reduce = 0%, 
>>>>> Cumulative CPU 69.41 sec
>>>>> 2016-05-23 00:28:14,977 Stage-1 map = 73%,  reduce = 0%, Cumulative CPU 
>>>>> 74.06 sec
>>>>> INFO  : 2016-05-23 00:28:14,977 Stage-1 map = 73%,  reduce = 0%, 
>>>>> Cumulative CPU 74.06 sec
>>>>> 2016-05-23 00:28:21,134 Stage-1 map = 77%,  reduce = 0%, Cumulative CPU 
>>>>> 78.72 sec
>>>>> INFO  : 2016-05-23 00:28:21,134 Stage-1 map = 77%,  reduce = 0%, 
>>>>> Cumulative CPU 78.72 sec
>>>>> 2016-05-23 00:28:27,282 Stage-1 map = 82%,  reduce = 0%, Cumulative CPU 
>>>>> 83.32 sec
>>>>> INFO  : 2016-05-23 00:28:27,282 Stage-1 map = 82%,  reduce = 0%, 
>>>>> Cumulative CPU 83.32 sec
>>>>> 2016-05-23 00:28:33,437 Stage-1 map = 86%,  reduce = 0%, Cumulative CPU 
>>>>> 87.9 sec
>>>>> INFO  : 2016-05-23 00:28:33,437 Stage-1 map = 86%,  reduce = 0%, 
>>>>> Cumulative CPU 87.9 sec
>>>>> 2016-05-23 00:28:38,579 Stage-1 map = 91%,  reduce = 0%, Cumulative CPU 
>>>>> 92.52 sec
>>>>> INFO  : 2016-05-23 00:28:38,579 Stage-1 map = 91%,  reduce = 0%, 
>>>>> Cumulative CPU 92.52 sec
>>>>> 2016-05-23 00:28:44,759 Stage-1 map = 95%,  reduce = 0%, Cumulative CPU 
>>>>> 97.35 sec
>>>>> INFO  : 2016-05-23 00:28:44,759 Stage-1 map = 95%,  reduce = 0%, 
>>>>> Cumulative CPU 97.35 sec
>>>>> 2016-05-23 00:28:49,915 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 
>>>>> 99.6 sec
>>>>> INFO  : 2016-05-23 00:28:49,915 Stage-1 map = 100%,  reduce = 0%, 
>>>>> Cumulative CPU 99.6 sec
>>>>> 2016-05-23 00:28:54,043 Stage-1 map = 100%,  reduce = 100%, Cumulative 
>>>>> CPU 101.4 sec
>>>>> MapReduce Total cumulative CPU time: 1 minutes 41 seconds 400 msec
>>>>> Ended Job = job_1463956731753_0005
>>>>> MapReduce Jobs Launched:
>>>>> Stage-Stage-1: Map: 22  Reduce: 1   Cumulative CPU: 101.4 sec   HDFS 
>>>>> Read: 5318569 HDFS Write: 46 SUCCESS
>>>>> Total MapReduce CPU Time Spent: 1 minutes 41 seconds 400 msec
>>>>> OK
>>>>> INFO  : 2016-05-23 00:28:54,043 Stage-1 map = 100%,  reduce = 100%, 
>>>>> Cumulative CPU 101.4 sec
>>>>> INFO  : MapReduce Total cumulative CPU time: 1 minutes 41 seconds 400 msec
>>>>> INFO  : Ended Job = job_1463956731753_0005
>>>>> INFO  : MapReduce Jobs Launched:
>>>>> INFO  : Stage-Stage-1: Map: 22  Reduce: 1   Cumulative CPU: 101.4 sec   
>>>>> HDFS Read: 5318569 HDFS Write: 46 SUCCESS
>>>>> INFO  : Total MapReduce CPU Time Spent: 1 minutes 41 seconds 400 msec
>>>>> INFO  : Completed executing 
>>>>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc);
>>>>>  Time taken: 142.525 seconds
>>>>> INFO  : OK
>>>>> +-----+------------+---------------+-----------------------+--+
>>>>> | c0  |     c1     |      c2       |          c3           |
>>>>> +-----+------------+---------------+-----------------------+--+
>>>>> | 1   | 100000000  | 5.00000005E7  | 2.8867513459481288E7  |
>>>>> +-----+------------+---------------+-----------------------+--+
>>>>> 1 row selected (142.744 seconds)
>>>>>  
>>>>> OK Hive on map-reduce engine took 142 seconds compared to 58 seconds with 
>>>>> Hive on Spark. So you can obviously gain pretty well by using Hive on 
>>>>> Spark.
>>>>>  
>>>>> Please also note that I did not use any vendor's build for this purpose. 
>>>>> I compiled Spark 1.3.1 myself.
>>>>>  
>>>>> HTH
>>>>>  
>>>>>  
>>>>> Dr Mich Talebzadeh
>>>>>  
>>>>> LinkedIn  
>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>  
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>>>  
>>>>> http://talebzadehmich.wordpress.com/ 
>>>>> <http://talebzadehmich.wordpress.com/>
>>>>>  
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> Best Regards,
>>>> Ayan Guha
>>>> 
>>>> 
>>> 
>> 
> 
>

Re: Using Spark on Hive with Hive also using Spark as its execution engine

Reply via email to