Re: Using Spark on Hive with Hive also using Spark as its execution engine

Michael Segel Mon, 11 Jul 2016 08:58:20 -0700

I don’t think that it would be a good comparison. 

If memory serves, Tez w LLAP is going to be running a separate engine that is 
constantly running, no?


Spark?  That runs under hive… 

Unless you’re suggesting that the spark context is constantly running as part 
of the hiveserver2? 

> On May 23, 2016, at 6:51 PM, Jörn Franke <jornfra...@gmail.com> wrote:
> 
> 
> Hi Mich,
> 
> I think these comparisons are useful. One interesting aspect could be 
> hardware scalability in this context. Additionally different type of 
> computations. Furthermore, one could compare Spark and Tez+llap as execution 
> engines. I have the gut feeling that  each one can be justified by different 
> use cases.
> Nevertheless, there should be always a disclaimer for such comparisons, 
> because Spark and Hive are not good for a lot of concurrent lookups of single 
> rows. They are not good for frequently write small amounts of data (eg sensor 
> data). Here hbase could be more interesting. Other use cases can justify 
> graph databases, such as Titan, or text analytics/ data matching using Solr 
> on Hadoop.
> Finally, even if you have a lot of data you need to think if you always have 
> to process everything. For instance, I have found valid use cases in practice 
> where we decided to evaluate 10 machine learning models in parallel on only a 
> sample of data and only evaluate the "winning" model of the total of data.
> 
> As always it depends :) 
> 
> Best regards
> 
> P.s.: at least Hortonworks has in their distribution spark 1.5 with hive 1.2 
> and spark 1.6 with hive 1.2. Maybe they have somewhere described how to 
> manage bringing both together. You may check also Apache Bigtop (vendor 
> neutral distribution) on how they managed to bring both together.
> 
> On 23 May 2016, at 01:42, Mich Talebzadeh <mich.talebza...@gmail.com 
> <mailto:mich.talebza...@gmail.com>> wrote:
> 
>> Hi,
>>  
>> I have done a number of extensive tests using Spark-shell with Hive DB and 
>> ORC tables.
>>  
>> Now one issue that we typically face is and I quote:
>>  
>> Spark is fast as it uses Memory and DAG. Great but when we save data it is 
>> not fast enough
>> 
>> OK but there is a solution now. If you use Spark with Hive and you are on a 
>> descent version of Hive >= 0.14, then you can also deploy Spark as execution 
>> engine for Hive. That will make your application run pretty fast as you no 
>> longer rely on the old Map-Reduce for Hive engine. In a nutshell what you 
>> are gaining speed in both querying and storage.
>>  
>> I have made some comparisons on this set-up and I am sure some of you will 
>> find it useful.
>>  
>> The version of Spark I use for Spark queries (Spark as query tool) is 1.6.
>> The version of Hive I use in Hive 2
>> The version of Spark I use as Hive execution engine is 1.3.1 It works and 
>> frankly Spark 1.3.1 as an execution engine is adequate (until we sort out 
>> the Hadoop libraries mismatch).
>>  
>> An example I am using Hive on Spark engine to find the min and max of IDs 
>> for a table with 1 billion rows:
>>  
>> 0: jdbc:hive2://rhes564:10010/default>  select min(id), max(id),avg(id), 
>> stddev(id) from oraclehadoop.dummy;
>> Query ID = hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006
>>  
>>  
>> Starting Spark Job = 5e092ef9-d798-4952-b156-74df49da9151
>>  
>> INFO  : Completed compiling 
>> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006); 
>> Time taken: 1.911 seconds
>> INFO  : Executing 
>> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006): 
>> select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy
>> INFO  : Query ID = hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006
>> INFO  : Total jobs = 1
>> INFO  : Launching Job 1 out of 1
>> INFO  : Starting task [Stage-1:MAPRED] in serial mode
>>  
>> Query Hive on Spark job[0] stages:
>> 0
>> 1
>> Status: Running (Hive on Spark job[0])
>> Job Progress Format
>> CurrentTime StageId_StageAttemptId: 
>> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount 
>> [StageCost]
>> 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1
>> 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22    Stage-1_0: 0/1
>> 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22    Stage-1_0: 0/1
>> 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22    Stage-1_0: 0/1
>> INFO  :
>> Query Hive on Spark job[0] stages:
>> INFO  : 0
>> INFO  : 1
>> INFO  :
>> Status: Running (Hive on Spark job[0])
>> INFO  : Job Progress Format
>> CurrentTime StageId_StageAttemptId: 
>> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount 
>> [StageCost]
>> INFO  : 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1
>> INFO  : 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22    Stage-1_0: 0/1
>> INFO  : 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22    Stage-1_0: 0/1
>> INFO  : 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22    Stage-1_0: 0/1
>> 2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished       Stage-1_0: 0(+1)/1
>> 2016-05-23 00:21:30,189 Stage-0_0: 22/22 Finished       Stage-1_0: 1/1 
>> Finished
>> Status: Finished successfully in 53.25 seconds
>> OK
>> INFO  : 2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished       Stage-1_0: 
>> 0(+1)/1
>> INFO  : 2016-05-23 00:21:30,189 Stage-0_0: 22/22 Finished       Stage-1_0: 
>> 1/1 Finished
>> INFO  : Status: Finished successfully in 53.25 seconds
>> INFO  : Completed executing 
>> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006); 
>> Time taken: 56.337 seconds
>> INFO  : OK
>> +-----+------------+---------------+-----------------------+--+
>> | c0  |     c1     |      c2       |          c3           |
>> +-----+------------+---------------+-----------------------+--+
>> | 1   | 100000000  | 5.00000005E7  | 2.8867513459481288E7  |
>> +-----+------------+---------------+-----------------------+--+
>> 1 row selected (58.529 seconds)
>>  
>> 58 seconds first run with cold cache is pretty good
>>  
>> And let us compare it with running the same query on map-reduce engine
>>  
>> : jdbc:hive2://rhes564:10010/default> set hive.execution.engine=mr;
>> Hive-on-MR is deprecated in Hive 2 and may not be available in the future 
>> versions. Consider using a different execution engine (i.e. spark, tez) or 
>> using Hive 1.X releases.
>> No rows affected (0.007 seconds)
>> 0: jdbc:hive2://rhes564:10010/default>  select min(id), max(id),avg(id), 
>> stddev(id) from oraclehadoop.dummy;
>> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the 
>> future versions. Consider using a different execution engine (i.e. spark, 
>> tez) or using Hive 1.X releases.
>> Query ID = hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc
>> Total jobs = 1
>> Launching Job 1 out of 1
>> Number of reduce tasks determined at compile time: 1
>> In order to change the average load for a reducer (in bytes):
>>   set hive.exec.reducers.bytes.per.reducer=<number>
>> In order to limit the maximum number of reducers:
>>   set hive.exec.reducers.max=<number>
>> In order to set a constant number of reducers:
>>   set mapreduce.job.reduces=<number>
>> Starting Job = job_1463956731753_0005, Tracking URL = 
>> http://localhost.localdomain:8088/proxy/application_1463956731753_0005/ 
>> <http://localhost.localdomain:8088/proxy/application_1463956731753_0005/>
>> Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill 
>> job_1463956731753_0005
>> Hadoop job information for Stage-1: number of mappers: 22; number of 
>> reducers: 1
>> 2016-05-23 00:26:38,127 Stage-1 map = 0%,  reduce = 0%
>> INFO  : Compiling 
>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc): 
>> select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy
>> INFO  : Semantic Analysis Completed
>> INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:c0, 
>> type:int, comment:null), FieldSchema(name:c1, type:int, comment:null), 
>> FieldSchema(name:c2, type:double, comment:null), FieldSchema(name:c3, 
>> type:double, comment:null)], properties:null)
>> INFO  : Completed compiling 
>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc); 
>> Time taken: 0.144 seconds
>> INFO  : Executing 
>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc): 
>> select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy
>> WARN  : Hive-on-MR is deprecated in Hive 2 and may not be available in the 
>> future versions. Consider using a different execution engine (i.e. spark, 
>> tez) or using Hive 1.X releases.
>> INFO  : WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available 
>> in the future versions. Consider using a different execution engine (i.e. 
>> spark, tez) or using Hive 1.X releases.
>> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the 
>> future versions. Consider using a different execution engine (i.e. spark, 
>> tez) or using Hive 1.X releases.
>> INFO  : Query ID = hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc
>> INFO  : Total jobs = 1
>> INFO  : Launching Job 1 out of 1
>> INFO  : Starting task [Stage-1:MAPRED] in serial mode
>> INFO  : Number of reduce tasks determined at compile time: 1
>> INFO  : In order to change the average load for a reducer (in bytes):
>> INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
>> INFO  : In order to limit the maximum number of reducers:
>> INFO  :   set hive.exec.reducers.max=<number>
>> INFO  : In order to set a constant number of reducers:
>> INFO  :   set mapreduce.job.reduces=<number>
>> WARN  : Hadoop command-line option parsing not performed. Implement the Tool 
>> interface and execute your application with ToolRunner to remedy this.
>> INFO  : number of splits:22
>> INFO  : Submitting tokens for job: job_1463956731753_0005
>> INFO  : The url to track the job: 
>> http://localhost.localdomain:8088/proxy/application_1463956731753_0005/ 
>> <http://localhost.localdomain:8088/proxy/application_1463956731753_0005/>
>> INFO  : Starting Job = job_1463956731753_0005, Tracking URL = 
>> http://localhost.localdomain:8088/proxy/application_1463956731753_0005/ 
>> <http://localhost.localdomain:8088/proxy/application_1463956731753_0005/>
>> INFO  : Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill 
>> job_1463956731753_0005
>> INFO  : Hadoop job information for Stage-1: number of mappers: 22; number of 
>> reducers: 1
>> INFO  : 2016-05-23 00:26:38,127 Stage-1 map = 0%,  reduce = 0%
>> 2016-05-23 00:26:44,367 Stage-1 map = 5%,  reduce = 0%, Cumulative CPU 4.56 
>> sec
>> INFO  : 2016-05-23 00:26:44,367 Stage-1 map = 5%,  reduce = 0%, Cumulative 
>> CPU 4.56 sec
>> 2016-05-23 00:26:50,558 Stage-1 map = 9%,  reduce = 0%, Cumulative CPU 9.17 
>> sec
>> INFO  : 2016-05-23 00:26:50,558 Stage-1 map = 9%,  reduce = 0%, Cumulative 
>> CPU 9.17 sec
>> 2016-05-23 00:26:56,747 Stage-1 map = 14%,  reduce = 0%, Cumulative CPU 
>> 14.04 sec
>> INFO  : 2016-05-23 00:26:56,747 Stage-1 map = 14%,  reduce = 0%, Cumulative 
>> CPU 14.04 sec
>> 2016-05-23 00:27:02,944 Stage-1 map = 18%,  reduce = 0%, Cumulative CPU 
>> 18.64 sec
>> INFO  : 2016-05-23 00:27:02,944 Stage-1 map = 18%,  reduce = 0%, Cumulative 
>> CPU 18.64 sec
>> 2016-05-23 00:27:08,105 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 
>> 23.25 sec
>> INFO  : 2016-05-23 00:27:08,105 Stage-1 map = 23%,  reduce = 0%, Cumulative 
>> CPU 23.25 sec
>> 2016-05-23 00:27:14,298 Stage-1 map = 27%,  reduce = 0%, Cumulative CPU 
>> 27.84 sec
>> INFO  : 2016-05-23 00:27:14,298 Stage-1 map = 27%,  reduce = 0%, Cumulative 
>> CPU 27.84 sec
>> 2016-05-23 00:27:20,484 Stage-1 map = 32%,  reduce = 0%, Cumulative CPU 
>> 32.56 sec
>> INFO  : 2016-05-23 00:27:20,484 Stage-1 map = 32%,  reduce = 0%, Cumulative 
>> CPU 32.56 sec
>> 2016-05-23 00:27:26,659 Stage-1 map = 36%,  reduce = 0%, Cumulative CPU 37.1 
>> sec
>> INFO  : 2016-05-23 00:27:26,659 Stage-1 map = 36%,  reduce = 0%, Cumulative 
>> CPU 37.1 sec
>> 2016-05-23 00:27:32,839 Stage-1 map = 41%,  reduce = 0%, Cumulative CPU 
>> 41.74 sec
>> INFO  : 2016-05-23 00:27:32,839 Stage-1 map = 41%,  reduce = 0%, Cumulative 
>> CPU 41.74 sec
>> 2016-05-23 00:27:39,003 Stage-1 map = 45%,  reduce = 0%, Cumulative CPU 
>> 46.32 sec
>> INFO  : 2016-05-23 00:27:39,003 Stage-1 map = 45%,  reduce = 0%, Cumulative 
>> CPU 46.32 sec
>> 2016-05-23 00:27:45,173 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 
>> 50.93 sec
>> 2016-05-23 00:27:50,316 Stage-1 map = 55%,  reduce = 0%, Cumulative CPU 
>> 55.55 sec
>> INFO  : 2016-05-23 00:27:45,173 Stage-1 map = 50%,  reduce = 0%, Cumulative 
>> CPU 50.93 sec
>> INFO  : 2016-05-23 00:27:50,316 Stage-1 map = 55%,  reduce = 0%, Cumulative 
>> CPU 55.55 sec
>> 2016-05-23 00:27:56,482 Stage-1 map = 59%,  reduce = 0%, Cumulative CPU 
>> 60.25 sec
>> INFO  : 2016-05-23 00:27:56,482 Stage-1 map = 59%,  reduce = 0%, Cumulative 
>> CPU 60.25 sec
>> 2016-05-23 00:28:02,642 Stage-1 map = 64%,  reduce = 0%, Cumulative CPU 
>> 64.86 sec
>> INFO  : 2016-05-23 00:28:02,642 Stage-1 map = 64%,  reduce = 0%, Cumulative 
>> CPU 64.86 sec
>> 2016-05-23 00:28:08,814 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU 
>> 69.41 sec
>> INFO  : 2016-05-23 00:28:08,814 Stage-1 map = 68%,  reduce = 0%, Cumulative 
>> CPU 69.41 sec
>> 2016-05-23 00:28:14,977 Stage-1 map = 73%,  reduce = 0%, Cumulative CPU 
>> 74.06 sec
>> INFO  : 2016-05-23 00:28:14,977 Stage-1 map = 73%,  reduce = 0%, Cumulative 
>> CPU 74.06 sec
>> 2016-05-23 00:28:21,134 Stage-1 map = 77%,  reduce = 0%, Cumulative CPU 
>> 78.72 sec
>> INFO  : 2016-05-23 00:28:21,134 Stage-1 map = 77%,  reduce = 0%, Cumulative 
>> CPU 78.72 sec
>> 2016-05-23 00:28:27,282 Stage-1 map = 82%,  reduce = 0%, Cumulative CPU 
>> 83.32 sec
>> INFO  : 2016-05-23 00:28:27,282 Stage-1 map = 82%,  reduce = 0%, Cumulative 
>> CPU 83.32 sec
>> 2016-05-23 00:28:33,437 Stage-1 map = 86%,  reduce = 0%, Cumulative CPU 87.9 
>> sec
>> INFO  : 2016-05-23 00:28:33,437 Stage-1 map = 86%,  reduce = 0%, Cumulative 
>> CPU 87.9 sec
>> 2016-05-23 00:28:38,579 Stage-1 map = 91%,  reduce = 0%, Cumulative CPU 
>> 92.52 sec
>> INFO  : 2016-05-23 00:28:38,579 Stage-1 map = 91%,  reduce = 0%, Cumulative 
>> CPU 92.52 sec
>> 2016-05-23 00:28:44,759 Stage-1 map = 95%,  reduce = 0%, Cumulative CPU 
>> 97.35 sec
>> INFO  : 2016-05-23 00:28:44,759 Stage-1 map = 95%,  reduce = 0%, Cumulative 
>> CPU 97.35 sec
>> 2016-05-23 00:28:49,915 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 
>> 99.6 sec
>> INFO  : 2016-05-23 00:28:49,915 Stage-1 map = 100%,  reduce = 0%, Cumulative 
>> CPU 99.6 sec
>> 2016-05-23 00:28:54,043 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 
>> 101.4 sec
>> MapReduce Total cumulative CPU time: 1 minutes 41 seconds 400 msec
>> Ended Job = job_1463956731753_0005
>> MapReduce Jobs Launched:
>> Stage-Stage-1: Map: 22  Reduce: 1   Cumulative CPU: 101.4 sec   HDFS Read: 
>> 5318569 HDFS Write: 46 SUCCESS
>> Total MapReduce CPU Time Spent: 1 minutes 41 seconds 400 msec
>> OK
>> INFO  : 2016-05-23 00:28:54,043 Stage-1 map = 100%,  reduce = 100%, 
>> Cumulative CPU 101.4 sec
>> INFO  : MapReduce Total cumulative CPU time: 1 minutes 41 seconds 400 msec
>> INFO  : Ended Job = job_1463956731753_0005
>> INFO  : MapReduce Jobs Launched:
>> INFO  : Stage-Stage-1: Map: 22  Reduce: 1   Cumulative CPU: 101.4 sec   HDFS 
>> Read: 5318569 HDFS Write: 46 SUCCESS
>> INFO  : Total MapReduce CPU Time Spent: 1 minutes 41 seconds 400 msec
>> INFO  : Completed executing 
>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc); 
>> Time taken: 142.525 seconds
>> INFO  : OK
>> +-----+------------+---------------+-----------------------+--+
>> | c0  |     c1     |      c2       |          c3           |
>> +-----+------------+---------------+-----------------------+--+
>> | 1   | 100000000  | 5.00000005E7  | 2.8867513459481288E7  |
>> +-----+------------+---------------+-----------------------+--+
>> 1 row selected (142.744 seconds)
>>  
>> OK Hive on map-reduce engine took 142 seconds compared to 58 seconds with 
>> Hive on Spark. So you can obviously gain pretty well by using Hive on Spark.
>>  
>> Please also note that I did not use any vendor's build for this purpose. I 
>> compiled Spark 1.3.1 myself.
>>  
>> HTH
>>  
>>  
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com/ <http://talebzadehmich.wordpress.com/>
>>

Re: Using Spark on Hive with Hive also using Spark as its execution engine

Reply via email to