Hi,
Thanks a lot for such interesting comparison. But important questions
remain / to be addressed:

1) How to make 2 versions of Spark live together on the same cluster
(libraries clash, paths, etc.) ?
Most of the Spark users perform ETL, ML operations on Spark as well. So, we
may have 3 Spark installations simultaneously

2) How stable such construction is on INSERT / UPDATE / CTAS operations?
Any problems with writing into specific tables / directories, ORC / Parquet
peculiarities, memory / timeout parameters tuning ?

3) How stable such construction is in multi-user / multi-tenant production
environment when several people make different queries simultaneously?

It's impossible to restart Spark masters, workers several tines a day, tune
it constantly.


On Mon, May 23, 2016 at 2:42 AM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Hi,
>
>
>
> I have done a number of extensive tests using Spark-shell with Hive DB and
> ORC tables.
>
>
>
> Now one issue that we typically face is and I quote:
>
>
>
> Spark is fast as it uses Memory and DAG. Great but when we save data it is
> not fast enough
>
> OK but there is a solution now. If you use Spark with Hive and you are on
> a descent version of Hive >= 0.14, then you can also deploy Spark as
> execution engine for Hive. That will make your application run pretty fast
> as you no longer rely on the old Map-Reduce for Hive engine. In a nutshell
> what you are gaining speed in both querying and storage.
>
>
>
> I have made some comparisons on this set-up and I am sure some of you will
> find it useful.
>
>
>
> The version of Spark I use for Spark queries (Spark as query tool) is 1.6.
>
> The version of Hive I use in Hive 2
>
> The version of Spark I use as Hive execution engine is 1.3.1 It works and
> frankly Spark 1.3.1 as an execution engine is adequate (until we sort out
> the Hadoop libraries mismatch).
>
>
>
> An example I am using Hive on Spark engine to find the min and max of IDs
> for a table with 1 billion rows:
>
>
>
> 0: jdbc:hive2://rhes564:10010/default>  select min(id), max(id),avg(id),
> stddev(id) from oraclehadoop.dummy;
>
> Query ID = hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006
>
>
>
>
>
> Starting Spark Job = 5e092ef9-d798-4952-b156-74df49da9151
>
>
>
> INFO  : Completed compiling
> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006);
> Time taken: 1.911 seconds
>
> INFO  : Executing
> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006):
> select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy
>
> INFO  : Query ID =
> hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006
>
> INFO  : Total jobs = 1
>
> INFO  : Launching Job 1 out of 1
>
> INFO  : Starting task [Stage-1:MAPRED] in serial mode
>
>
>
> Query Hive on Spark job[0] stages:
>
> 0
>
> 1
>
> Status: Running (Hive on Spark job[0])
>
> Job Progress Format
>
> CurrentTime StageId_StageAttemptId:
> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
> [StageCost]
>
> 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1
>
> 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22    Stage-1_0: 0/1
>
> 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22    Stage-1_0: 0/1
>
> 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22    Stage-1_0: 0/1
>
> INFO  :
>
> Query Hive on Spark job[0] stages:
>
> INFO  : 0
>
> INFO  : 1
>
> INFO  :
>
> Status: Running (Hive on Spark job[0])
>
> INFO  : Job Progress Format
>
> CurrentTime StageId_StageAttemptId:
> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
> [StageCost]
>
> INFO  : 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1
>
> INFO  : 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22    Stage-1_0: 0/1
>
> INFO  : 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22    Stage-1_0: 0/1
>
> INFO  : 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22    Stage-1_0: 0/1
>
> 2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished       Stage-1_0: 0(+1)/1
>
> 2016-05-23 00:21:30,189 Stage-0_0: 22/22 Finished       Stage-1_0: 1/1
> Finished
>
> Status: Finished successfully in 53.25 seconds
>
> OK
>
> INFO  : 2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished       Stage-1_0:
> 0(+1)/1
>
> INFO  : 2016-05-23 00:21:30,189 Stage-0_0: 22/22 Finished       Stage-1_0:
> 1/1 Finished
>
> INFO  : Status: Finished successfully in 53.25 seconds
>
> INFO  : Completed executing
> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006);
> Time taken: 56.337 seconds
>
> INFO  : OK
>
> +-----+------------+---------------+-----------------------+--+
>
> | c0  |     c1     |      c2       |          c3           |
>
> +-----+------------+---------------+-----------------------+--+
>
> | 1   | 100000000  | 5.00000005E7  | 2.8867513459481288E7  |
>
> +-----+------------+---------------+-----------------------+--+
>
> 1 row selected (58.529 seconds)
>
>
>
> 58 seconds first run with cold cache is pretty good
>
>
>
> And let us compare it with running the same query on map-reduce engine
>
>
>
> : jdbc:hive2://rhes564:10010/default> set hive.execution.engine=mr;
>
> Hive-on-MR is deprecated in Hive 2 and may not be available in the future
> versions. Consider using a different execution engine (i.e. spark, tez) or
> using Hive 1.X releases.
>
> No rows affected (0.007 seconds)
>
> 0: jdbc:hive2://rhes564:10010/default>  select min(id), max(id),avg(id),
> stddev(id) from oraclehadoop.dummy;
>
> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in
> the future versions. Consider using a different execution engine (i.e.
> spark, tez) or using Hive 1.X releases.
>
> Query ID = hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc
>
> Total jobs = 1
>
> Launching Job 1 out of 1
>
> Number of reduce tasks determined at compile time: 1
>
> In order to change the average load for a reducer (in bytes):
>
>   set hive.exec.reducers.bytes.per.reducer=<number>
>
> In order to limit the maximum number of reducers:
>
>   set hive.exec.reducers.max=<number>
>
> In order to set a constant number of reducers:
>
>   set mapreduce.job.reduces=<number>
>
> Starting Job = job_1463956731753_0005, Tracking URL =
> http://localhost.localdomain:8088/proxy/application_1463956731753_0005/
>
> Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill
> job_1463956731753_0005
>
> Hadoop job information for Stage-1: number of mappers: 22; number of
> reducers: 1
>
> 2016-05-23 00:26:38,127 Stage-1 map = 0%,  reduce = 0%
>
> INFO  : Compiling
> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc):
> select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy
>
> INFO  : Semantic Analysis Completed
>
> INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:c0,
> type:int, comment:null), FieldSchema(name:c1, type:int, comment:null),
> FieldSchema(name:c2, type:double, comment:null), FieldSchema(name:c3,
> type:double, comment:null)], properties:null)
>
> INFO  : Completed compiling
> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc);
> Time taken: 0.144 seconds
>
> INFO  : Executing
> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc):
> select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy
>
> WARN  : Hive-on-MR is deprecated in Hive 2 and may not be available in the
> future versions. Consider using a different execution engine (i.e. spark,
> tez) or using Hive 1.X releases.
>
> INFO  : WARNING: Hive-on-MR is deprecated in Hive 2 and may not be
> available in the future versions. Consider using a different execution
> engine (i.e. spark, tez) or using Hive 1.X releases.
>
> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in
> the future versions. Consider using a different execution engine (i.e.
> spark, tez) or using Hive 1.X releases.
>
> INFO  : Query ID =
> hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc
>
> INFO  : Total jobs = 1
>
> INFO  : Launching Job 1 out of 1
>
> INFO  : Starting task [Stage-1:MAPRED] in serial mode
>
> INFO  : Number of reduce tasks determined at compile time: 1
>
> INFO  : In order to change the average load for a reducer (in bytes):
>
> INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
>
> INFO  : In order to limit the maximum number of reducers:
>
> INFO  :   set hive.exec.reducers.max=<number>
>
> INFO  : In order to set a constant number of reducers:
>
> INFO  :   set mapreduce.job.reduces=<number>
>
> WARN  : Hadoop command-line option parsing not performed. Implement the
> Tool interface and execute your application with ToolRunner to remedy this.
>
> INFO  : number of splits:22
>
> INFO  : Submitting tokens for job: job_1463956731753_0005
>
> INFO  : The url to track the job:
> http://localhost.localdomain:8088/proxy/application_1463956731753_0005/
>
> INFO  : Starting Job = job_1463956731753_0005, Tracking URL =
> http://localhost.localdomain:8088/proxy/application_1463956731753_0005/
>
> INFO  : Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill
> job_1463956731753_0005
>
> INFO  : Hadoop job information for Stage-1: number of mappers: 22; number
> of reducers: 1
>
> INFO  : 2016-05-23 00:26:38,127 Stage-1 map = 0%,  reduce = 0%
>
> 2016-05-23 00:26:44,367 Stage-1 map = 5%,  reduce = 0%, Cumulative CPU
> 4.56 sec
>
> INFO  : 2016-05-23 00:26:44,367 Stage-1 map = 5%,  reduce = 0%, Cumulative
> CPU 4.56 sec
>
> 2016-05-23 00:26:50,558 Stage-1 map = 9%,  reduce = 0%, Cumulative CPU
> 9.17 sec
>
> INFO  : 2016-05-23 00:26:50,558 Stage-1 map = 9%,  reduce = 0%, Cumulative
> CPU 9.17 sec
>
> 2016-05-23 00:26:56,747 Stage-1 map = 14%,  reduce = 0%, Cumulative CPU
> 14.04 sec
>
> INFO  : 2016-05-23 00:26:56,747 Stage-1 map = 14%,  reduce = 0%,
> Cumulative CPU 14.04 sec
>
> 2016-05-23 00:27:02,944 Stage-1 map = 18%,  reduce = 0%, Cumulative CPU
> 18.64 sec
>
> INFO  : 2016-05-23 00:27:02,944 Stage-1 map = 18%,  reduce = 0%,
> Cumulative CPU 18.64 sec
>
> 2016-05-23 00:27:08,105 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU
> 23.25 sec
>
> INFO  : 2016-05-23 00:27:08,105 Stage-1 map = 23%,  reduce = 0%,
> Cumulative CPU 23.25 sec
>
> 2016-05-23 00:27:14,298 Stage-1 map = 27%,  reduce = 0%, Cumulative CPU
> 27.84 sec
>
> INFO  : 2016-05-23 00:27:14,298 Stage-1 map = 27%,  reduce = 0%,
> Cumulative CPU 27.84 sec
>
> 2016-05-23 00:27:20,484 Stage-1 map = 32%,  reduce = 0%, Cumulative CPU
> 32.56 sec
>
> INFO  : 2016-05-23 00:27:20,484 Stage-1 map = 32%,  reduce = 0%,
> Cumulative CPU 32.56 sec
>
> 2016-05-23 00:27:26,659 Stage-1 map = 36%,  reduce = 0%, Cumulative CPU
> 37.1 sec
>
> INFO  : 2016-05-23 00:27:26,659 Stage-1 map = 36%,  reduce = 0%,
> Cumulative CPU 37.1 sec
>
> 2016-05-23 00:27:32,839 Stage-1 map = 41%,  reduce = 0%, Cumulative CPU
> 41.74 sec
>
> INFO  : 2016-05-23 00:27:32,839 Stage-1 map = 41%,  reduce = 0%,
> Cumulative CPU 41.74 sec
>
> 2016-05-23 00:27:39,003 Stage-1 map = 45%,  reduce = 0%, Cumulative CPU
> 46.32 sec
>
> INFO  : 2016-05-23 00:27:39,003 Stage-1 map = 45%,  reduce = 0%,
> Cumulative CPU 46.32 sec
>
> 2016-05-23 00:27:45,173 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU
> 50.93 sec
>
> 2016-05-23 00:27:50,316 Stage-1 map = 55%,  reduce = 0%, Cumulative CPU
> 55.55 sec
>
> INFO  : 2016-05-23 00:27:45,173 Stage-1 map = 50%,  reduce = 0%,
> Cumulative CPU 50.93 sec
>
> INFO  : 2016-05-23 00:27:50,316 Stage-1 map = 55%,  reduce = 0%,
> Cumulative CPU 55.55 sec
>
> 2016-05-23 00:27:56,482 Stage-1 map = 59%,  reduce = 0%, Cumulative CPU
> 60.25 sec
>
> INFO  : 2016-05-23 00:27:56,482 Stage-1 map = 59%,  reduce = 0%,
> Cumulative CPU 60.25 sec
>
> 2016-05-23 00:28:02,642 Stage-1 map = 64%,  reduce = 0%, Cumulative CPU
> 64.86 sec
>
> INFO  : 2016-05-23 00:28:02,642 Stage-1 map = 64%,  reduce = 0%,
> Cumulative CPU 64.86 sec
>
> 2016-05-23 00:28:08,814 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU
> 69.41 sec
>
> INFO  : 2016-05-23 00:28:08,814 Stage-1 map = 68%,  reduce = 0%,
> Cumulative CPU 69.41 sec
>
> 2016-05-23 00:28:14,977 Stage-1 map = 73%,  reduce = 0%, Cumulative CPU
> 74.06 sec
>
> INFO  : 2016-05-23 00:28:14,977 Stage-1 map = 73%,  reduce = 0%,
> Cumulative CPU 74.06 sec
>
> 2016-05-23 00:28:21,134 Stage-1 map = 77%,  reduce = 0%, Cumulative CPU
> 78.72 sec
>
> INFO  : 2016-05-23 00:28:21,134 Stage-1 map = 77%,  reduce = 0%,
> Cumulative CPU 78.72 sec
>
> 2016-05-23 00:28:27,282 Stage-1 map = 82%,  reduce = 0%, Cumulative CPU
> 83.32 sec
>
> INFO  : 2016-05-23 00:28:27,282 Stage-1 map = 82%,  reduce = 0%,
> Cumulative CPU 83.32 sec
>
> 2016-05-23 00:28:33,437 Stage-1 map = 86%,  reduce = 0%, Cumulative CPU
> 87.9 sec
>
> INFO  : 2016-05-23 00:28:33,437 Stage-1 map = 86%,  reduce = 0%,
> Cumulative CPU 87.9 sec
>
> 2016-05-23 00:28:38,579 Stage-1 map = 91%,  reduce = 0%, Cumulative CPU
> 92.52 sec
>
> INFO  : 2016-05-23 00:28:38,579 Stage-1 map = 91%,  reduce = 0%,
> Cumulative CPU 92.52 sec
>
> 2016-05-23 00:28:44,759 Stage-1 map = 95%,  reduce = 0%, Cumulative CPU
> 97.35 sec
>
> INFO  : 2016-05-23 00:28:44,759 Stage-1 map = 95%,  reduce = 0%,
> Cumulative CPU 97.35 sec
>
> 2016-05-23 00:28:49,915 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU
> 99.6 sec
>
> INFO  : 2016-05-23 00:28:49,915 Stage-1 map = 100%,  reduce = 0%,
> Cumulative CPU 99.6 sec
>
> 2016-05-23 00:28:54,043 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU
> 101.4 sec
>
> MapReduce Total cumulative CPU time: 1 minutes 41 seconds 400 msec
>
> Ended Job = job_1463956731753_0005
>
> MapReduce Jobs Launched:
>
> Stage-Stage-1: Map: 22  Reduce: 1   Cumulative CPU: 101.4 sec   HDFS Read:
> 5318569 HDFS Write: 46 SUCCESS
>
> Total MapReduce CPU Time Spent: 1 minutes 41 seconds 400 msec
>
> OK
>
> INFO  : 2016-05-23 00:28:54,043 Stage-1 map = 100%,  reduce = 100%,
> Cumulative CPU 101.4 sec
>
> INFO  : MapReduce Total cumulative CPU time: 1 minutes 41 seconds 400 msec
>
> INFO  : Ended Job = job_1463956731753_0005
>
> INFO  : MapReduce Jobs Launched:
>
> INFO  : Stage-Stage-1: Map: 22  Reduce: 1   Cumulative CPU: 101.4 sec
> HDFS Read: 5318569 HDFS Write: 46 SUCCESS
>
> INFO  : Total MapReduce CPU Time Spent: 1 minutes 41 seconds 400 msec
>
> INFO  : Completed executing
> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc);
> Time taken: 142.525 seconds
>
> INFO  : OK
>
> +-----+------------+---------------+-----------------------+--+
>
> | c0  |     c1     |      c2       |          c3           |
>
> +-----+------------+---------------+-----------------------+--+
>
> | 1   | 100000000  | 5.00000005E7  | 2.8867513459481288E7  |
>
> +-----+------------+---------------+-----------------------+--+
>
> 1 row selected (142.744 seconds)
>
>
>
> OK Hive on map-reduce engine took 142 seconds compared to 58 seconds with
> Hive on Spark. So you can obviously gain pretty well by using Hive on Spark.
>
>
>
> Please also note that I did not use any vendor's build for this purpose. I
> compiled Spark 1.3.1 myself.
>
>
>
> HTH
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com/
>
>
>

Reply via email to