Using Spark on Hive with Hive also using Spark as its execution engine

Mich Talebzadeh Sun, 22 May 2016 16:42:40 -0700

Hi,



I have done a number of extensive tests using Spark-shell with Hive DB and
ORC tables.



Now one issue that we typically face is and I quote:



Spark is fast as it uses Memory and DAG. Great but when we save data it is
not fast enough

OK but there is a solution now. If you use Spark with Hive and you are on a
descent version of Hive >= 0.14, then you can also deploy Spark as
execution engine for Hive. That will make your application run pretty fast
as you no longer rely on the old Map-Reduce for Hive engine. In a nutshell
what you are gaining speed in both querying and storage.



I have made some comparisons on this set-up and I am sure some of you will
find it useful.



The version of Spark I use for Spark queries (Spark as query tool) is 1.6.

The version of Hive I use in Hive 2

The version of Spark I use as Hive execution engine is 1.3.1 It works and
frankly Spark 1.3.1 as an execution engine is adequate (until we sort out
the Hadoop libraries mismatch).



An example I am using Hive on Spark engine to find the min and max of IDs
for a table with 1 billion rows:



0: jdbc:hive2://rhes564:10010/default>  select min(id), max(id),avg(id),
stddev(id) from oraclehadoop.dummy;

Query ID = hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006





Starting Spark Job = 5e092ef9-d798-4952-b156-74df49da9151



INFO  : Completed compiling
command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006);
Time taken: 1.911 seconds

INFO  : Executing
command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006):
select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy

INFO  : Query ID =
hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006

INFO  : Total jobs = 1

INFO  : Launching Job 1 out of 1

INFO  : Starting task [Stage-1:MAPRED] in serial mode



Query Hive on Spark job[0] stages:

0

1

Status: Running (Hive on Spark job[0])

Job Progress Format

CurrentTime StageId_StageAttemptId:
SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
[StageCost]

2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1

2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22    Stage-1_0: 0/1

2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22    Stage-1_0: 0/1

2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22    Stage-1_0: 0/1

INFO  :

Query Hive on Spark job[0] stages:

INFO  : 0

INFO  : 1

INFO  :

Status: Running (Hive on Spark job[0])

INFO  : Job Progress Format

CurrentTime StageId_StageAttemptId:
SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
[StageCost]

INFO  : 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1

INFO  : 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22    Stage-1_0: 0/1

INFO  : 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22    Stage-1_0: 0/1

INFO  : 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22    Stage-1_0: 0/1

2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished       Stage-1_0: 0(+1)/1

2016-05-23 00:21:30,189 Stage-0_0: 22/22 Finished       Stage-1_0: 1/1
Finished

Status: Finished successfully in 53.25 seconds

OK

INFO  : 2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished       Stage-1_0:
0(+1)/1

INFO  : 2016-05-23 00:21:30,189 Stage-0_0: 22/22 Finished       Stage-1_0:
1/1 Finished

INFO  : Status: Finished successfully in 53.25 seconds

INFO  : Completed executing
command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006);
Time taken: 56.337 seconds

INFO  : OK

+-----+------------+---------------+-----------------------+--+

| c0  |     c1     |      c2       |          c3           |

+-----+------------+---------------+-----------------------+--+

| 1   | 100000000  | 5.00000005E7  | 2.8867513459481288E7  |

+-----+------------+---------------+-----------------------+--+

1 row selected (58.529 seconds)



58 seconds first run with cold cache is pretty good



And let us compare it with running the same query on map-reduce engine



: jdbc:hive2://rhes564:10010/default> set hive.execution.engine=mr;

Hive-on-MR is deprecated in Hive 2 and may not be available in the future
versions. Consider using a different execution engine (i.e. spark, tez) or
using Hive 1.X releases.

No rows affected (0.007 seconds)

0: jdbc:hive2://rhes564:10010/default>  select min(id), max(id),avg(id),
stddev(id) from oraclehadoop.dummy;

WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the
future versions. Consider using a different execution engine (i.e. spark,
tez) or using Hive 1.X releases.

Query ID = hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks determined at compile time: 1

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

  set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

  set mapreduce.job.reduces=<number>

Starting Job = job_1463956731753_0005, Tracking URL =
http://localhost.localdomain:8088/proxy/application_1463956731753_0005/

Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill
job_1463956731753_0005

Hadoop job information for Stage-1: number of mappers: 22; number of
reducers: 1

2016-05-23 00:26:38,127 Stage-1 map = 0%,  reduce = 0%

INFO  : Compiling
command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc):
select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy

INFO  : Semantic Analysis Completed

INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:c0,
type:int, comment:null), FieldSchema(name:c1, type:int, comment:null),
FieldSchema(name:c2, type:double, comment:null), FieldSchema(name:c3,
type:double, comment:null)], properties:null)

INFO  : Completed compiling
command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc);
Time taken: 0.144 seconds

INFO  : Executing
command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc):
select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy

WARN  : Hive-on-MR is deprecated in Hive 2 and may not be available in the
future versions. Consider using a different execution engine (i.e. spark,
tez) or using Hive 1.X releases.

INFO  : WARNING: Hive-on-MR is deprecated in Hive 2 and may not be
available in the future versions. Consider using a different execution
engine (i.e. spark, tez) or using Hive 1.X releases.

WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the
future versions. Consider using a different execution engine (i.e. spark,
tez) or using Hive 1.X releases.

INFO  : Query ID =
hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc

INFO  : Total jobs = 1

INFO  : Launching Job 1 out of 1

INFO  : Starting task [Stage-1:MAPRED] in serial mode

INFO  : Number of reduce tasks determined at compile time: 1

INFO  : In order to change the average load for a reducer (in bytes):

INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>

INFO  : In order to limit the maximum number of reducers:

INFO  :   set hive.exec.reducers.max=<number>

INFO  : In order to set a constant number of reducers:

INFO  :   set mapreduce.job.reduces=<number>

WARN  : Hadoop command-line option parsing not performed. Implement the
Tool interface and execute your application with ToolRunner to remedy this.

INFO  : number of splits:22

INFO  : Submitting tokens for job: job_1463956731753_0005

INFO  : The url to track the job:
http://localhost.localdomain:8088/proxy/application_1463956731753_0005/

INFO  : Starting Job = job_1463956731753_0005, Tracking URL =
http://localhost.localdomain:8088/proxy/application_1463956731753_0005/

INFO  : Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill
job_1463956731753_0005

INFO  : Hadoop job information for Stage-1: number of mappers: 22; number
of reducers: 1

INFO  : 2016-05-23 00:26:38,127 Stage-1 map = 0%,  reduce = 0%

2016-05-23 00:26:44,367 Stage-1 map = 5%,  reduce = 0%, Cumulative CPU 4.56
sec

INFO  : 2016-05-23 00:26:44,367 Stage-1 map = 5%,  reduce = 0%, Cumulative
CPU 4.56 sec

2016-05-23 00:26:50,558 Stage-1 map = 9%,  reduce = 0%, Cumulative CPU 9.17
sec

INFO  : 2016-05-23 00:26:50,558 Stage-1 map = 9%,  reduce = 0%, Cumulative
CPU 9.17 sec

2016-05-23 00:26:56,747 Stage-1 map = 14%,  reduce = 0%, Cumulative CPU
14.04 sec

INFO  : 2016-05-23 00:26:56,747 Stage-1 map = 14%,  reduce = 0%, Cumulative
CPU 14.04 sec

2016-05-23 00:27:02,944 Stage-1 map = 18%,  reduce = 0%, Cumulative CPU
18.64 sec

INFO  : 2016-05-23 00:27:02,944 Stage-1 map = 18%,  reduce = 0%, Cumulative
CPU 18.64 sec

2016-05-23 00:27:08,105 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU
23.25 sec

INFO  : 2016-05-23 00:27:08,105 Stage-1 map = 23%,  reduce = 0%, Cumulative
CPU 23.25 sec

2016-05-23 00:27:14,298 Stage-1 map = 27%,  reduce = 0%, Cumulative CPU
27.84 sec

INFO  : 2016-05-23 00:27:14,298 Stage-1 map = 27%,  reduce = 0%, Cumulative
CPU 27.84 sec

2016-05-23 00:27:20,484 Stage-1 map = 32%,  reduce = 0%, Cumulative CPU
32.56 sec

INFO  : 2016-05-23 00:27:20,484 Stage-1 map = 32%,  reduce = 0%, Cumulative
CPU 32.56 sec

2016-05-23 00:27:26,659 Stage-1 map = 36%,  reduce = 0%, Cumulative CPU
37.1 sec

INFO  : 2016-05-23 00:27:26,659 Stage-1 map = 36%,  reduce = 0%, Cumulative
CPU 37.1 sec

2016-05-23 00:27:32,839 Stage-1 map = 41%,  reduce = 0%, Cumulative CPU
41.74 sec

INFO  : 2016-05-23 00:27:32,839 Stage-1 map = 41%,  reduce = 0%, Cumulative
CPU 41.74 sec

2016-05-23 00:27:39,003 Stage-1 map = 45%,  reduce = 0%, Cumulative CPU
46.32 sec

INFO  : 2016-05-23 00:27:39,003 Stage-1 map = 45%,  reduce = 0%, Cumulative
CPU 46.32 sec

2016-05-23 00:27:45,173 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU
50.93 sec

2016-05-23 00:27:50,316 Stage-1 map = 55%,  reduce = 0%, Cumulative CPU
55.55 sec

INFO  : 2016-05-23 00:27:45,173 Stage-1 map = 50%,  reduce = 0%, Cumulative
CPU 50.93 sec

INFO  : 2016-05-23 00:27:50,316 Stage-1 map = 55%,  reduce = 0%, Cumulative
CPU 55.55 sec

2016-05-23 00:27:56,482 Stage-1 map = 59%,  reduce = 0%, Cumulative CPU
60.25 sec

INFO  : 2016-05-23 00:27:56,482 Stage-1 map = 59%,  reduce = 0%, Cumulative
CPU 60.25 sec

2016-05-23 00:28:02,642 Stage-1 map = 64%,  reduce = 0%, Cumulative CPU
64.86 sec

INFO  : 2016-05-23 00:28:02,642 Stage-1 map = 64%,  reduce = 0%, Cumulative
CPU 64.86 sec

2016-05-23 00:28:08,814 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU
69.41 sec

INFO  : 2016-05-23 00:28:08,814 Stage-1 map = 68%,  reduce = 0%, Cumulative
CPU 69.41 sec

2016-05-23 00:28:14,977 Stage-1 map = 73%,  reduce = 0%, Cumulative CPU
74.06 sec

INFO  : 2016-05-23 00:28:14,977 Stage-1 map = 73%,  reduce = 0%, Cumulative
CPU 74.06 sec

2016-05-23 00:28:21,134 Stage-1 map = 77%,  reduce = 0%, Cumulative CPU
78.72 sec

INFO  : 2016-05-23 00:28:21,134 Stage-1 map = 77%,  reduce = 0%, Cumulative
CPU 78.72 sec

2016-05-23 00:28:27,282 Stage-1 map = 82%,  reduce = 0%, Cumulative CPU
83.32 sec

INFO  : 2016-05-23 00:28:27,282 Stage-1 map = 82%,  reduce = 0%, Cumulative
CPU 83.32 sec

2016-05-23 00:28:33,437 Stage-1 map = 86%,  reduce = 0%, Cumulative CPU
87.9 sec

INFO  : 2016-05-23 00:28:33,437 Stage-1 map = 86%,  reduce = 0%, Cumulative
CPU 87.9 sec

2016-05-23 00:28:38,579 Stage-1 map = 91%,  reduce = 0%, Cumulative CPU
92.52 sec

INFO  : 2016-05-23 00:28:38,579 Stage-1 map = 91%,  reduce = 0%, Cumulative
CPU 92.52 sec

2016-05-23 00:28:44,759 Stage-1 map = 95%,  reduce = 0%, Cumulative CPU
97.35 sec

INFO  : 2016-05-23 00:28:44,759 Stage-1 map = 95%,  reduce = 0%, Cumulative
CPU 97.35 sec

2016-05-23 00:28:49,915 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU
99.6 sec

INFO  : 2016-05-23 00:28:49,915 Stage-1 map = 100%,  reduce = 0%,
Cumulative CPU 99.6 sec

2016-05-23 00:28:54,043 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU
101.4 sec

MapReduce Total cumulative CPU time: 1 minutes 41 seconds 400 msec

Ended Job = job_1463956731753_0005

MapReduce Jobs Launched:

Stage-Stage-1: Map: 22  Reduce: 1   Cumulative CPU: 101.4 sec   HDFS Read:
5318569 HDFS Write: 46 SUCCESS

Total MapReduce CPU Time Spent: 1 minutes 41 seconds 400 msec

OK

INFO  : 2016-05-23 00:28:54,043 Stage-1 map = 100%,  reduce = 100%,
Cumulative CPU 101.4 sec

INFO  : MapReduce Total cumulative CPU time: 1 minutes 41 seconds 400 msec

INFO  : Ended Job = job_1463956731753_0005

INFO  : MapReduce Jobs Launched:

INFO  : Stage-Stage-1: Map: 22  Reduce: 1   Cumulative CPU: 101.4 sec
HDFS Read: 5318569 HDFS Write: 46 SUCCESS

INFO  : Total MapReduce CPU Time Spent: 1 minutes 41 seconds 400 msec

INFO  : Completed executing
command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc);
Time taken: 142.525 seconds

INFO  : OK

+-----+------------+---------------+-----------------------+--+

| c0  |     c1     |      c2       |          c3           |

+-----+------------+---------------+-----------------------+--+

| 1   | 100000000  | 5.00000005E7  | 2.8867513459481288E7  |

+-----+------------+---------------+-----------------------+--+

1 row selected (142.744 seconds)



OK Hive on map-reduce engine took 142 seconds compared to 58 seconds with
Hive on Spark. So you can obviously gain pretty well by using Hive on Spark.



Please also note that I did not use any vendor's build for this purpose. I
compiled Spark 1.3.1 myself.



HTH





Dr Mich Talebzadeh



LinkedIn
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com/

Using Spark on Hive with Hive also using Spark as its execution engine

Reply via email to