Hi, Thanks a lot for such interesting comparison. But important questions remain / to be addressed:
1) How to make 2 versions of Spark live together on the same cluster (libraries clash, paths, etc.) ? Most of the Spark users perform ETL, ML operations on Spark as well. So, we may have 3 Spark installations simultaneously 2) How stable such construction is on INSERT / UPDATE / CTAS operations? Any problems with writing into specific tables / directories, ORC / Parquet peculiarities, memory / timeout parameters tuning ? 3) How stable such construction is in multi-user / multi-tenant production environment when several people make different queries simultaneously? It's impossible to restart Spark masters, workers several tines a day, tune it constantly. On Mon, May 23, 2016 at 2:42 AM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Hi, > > > > I have done a number of extensive tests using Spark-shell with Hive DB and > ORC tables. > > > > Now one issue that we typically face is and I quote: > > > > Spark is fast as it uses Memory and DAG. Great but when we save data it is > not fast enough > > OK but there is a solution now. If you use Spark with Hive and you are on > a descent version of Hive >= 0.14, then you can also deploy Spark as > execution engine for Hive. That will make your application run pretty fast > as you no longer rely on the old Map-Reduce for Hive engine. In a nutshell > what you are gaining speed in both querying and storage. > > > > I have made some comparisons on this set-up and I am sure some of you will > find it useful. > > > > The version of Spark I use for Spark queries (Spark as query tool) is 1.6. > > The version of Hive I use in Hive 2 > > The version of Spark I use as Hive execution engine is 1.3.1 It works and > frankly Spark 1.3.1 as an execution engine is adequate (until we sort out > the Hadoop libraries mismatch). > > > > An example I am using Hive on Spark engine to find the min and max of IDs > for a table with 1 billion rows: > > > > 0: jdbc:hive2://rhes564:10010/default> select min(id), max(id),avg(id), > stddev(id) from oraclehadoop.dummy; > > Query ID = hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006 > > > > > > Starting Spark Job = 5e092ef9-d798-4952-b156-74df49da9151 > > > > INFO : Completed compiling > command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006); > Time taken: 1.911 seconds > > INFO : Executing > command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006): > select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy > > INFO : Query ID = > hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006 > > INFO : Total jobs = 1 > > INFO : Launching Job 1 out of 1 > > INFO : Starting task [Stage-1:MAPRED] in serial mode > > > > Query Hive on Spark job[0] stages: > > 0 > > 1 > > Status: Running (Hive on Spark job[0]) > > Job Progress Format > > CurrentTime StageId_StageAttemptId: > SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount > [StageCost] > > 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1 > > 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22 Stage-1_0: 0/1 > > 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22 Stage-1_0: 0/1 > > 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22 Stage-1_0: 0/1 > > INFO : > > Query Hive on Spark job[0] stages: > > INFO : 0 > > INFO : 1 > > INFO : > > Status: Running (Hive on Spark job[0]) > > INFO : Job Progress Format > > CurrentTime StageId_StageAttemptId: > SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount > [StageCost] > > INFO : 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1 > > INFO : 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22 Stage-1_0: 0/1 > > INFO : 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22 Stage-1_0: 0/1 > > INFO : 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22 Stage-1_0: 0/1 > > 2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished Stage-1_0: 0(+1)/1 > > 2016-05-23 00:21:30,189 Stage-0_0: 22/22 Finished Stage-1_0: 1/1 > Finished > > Status: Finished successfully in 53.25 seconds > > OK > > INFO : 2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished Stage-1_0: > 0(+1)/1 > > INFO : 2016-05-23 00:21:30,189 Stage-0_0: 22/22 Finished Stage-1_0: > 1/1 Finished > > INFO : Status: Finished successfully in 53.25 seconds > > INFO : Completed executing > command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006); > Time taken: 56.337 seconds > > INFO : OK > > +-----+------------+---------------+-----------------------+--+ > > | c0 | c1 | c2 | c3 | > > +-----+------------+---------------+-----------------------+--+ > > | 1 | 100000000 | 5.00000005E7 | 2.8867513459481288E7 | > > +-----+------------+---------------+-----------------------+--+ > > 1 row selected (58.529 seconds) > > > > 58 seconds first run with cold cache is pretty good > > > > And let us compare it with running the same query on map-reduce engine > > > > : jdbc:hive2://rhes564:10010/default> set hive.execution.engine=mr; > > Hive-on-MR is deprecated in Hive 2 and may not be available in the future > versions. Consider using a different execution engine (i.e. spark, tez) or > using Hive 1.X releases. > > No rows affected (0.007 seconds) > > 0: jdbc:hive2://rhes564:10010/default> select min(id), max(id),avg(id), > stddev(id) from oraclehadoop.dummy; > > WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in > the future versions. Consider using a different execution engine (i.e. > spark, tez) or using Hive 1.X releases. > > Query ID = hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc > > Total jobs = 1 > > Launching Job 1 out of 1 > > Number of reduce tasks determined at compile time: 1 > > In order to change the average load for a reducer (in bytes): > > set hive.exec.reducers.bytes.per.reducer=<number> > > In order to limit the maximum number of reducers: > > set hive.exec.reducers.max=<number> > > In order to set a constant number of reducers: > > set mapreduce.job.reduces=<number> > > Starting Job = job_1463956731753_0005, Tracking URL = > http://localhost.localdomain:8088/proxy/application_1463956731753_0005/ > > Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job -kill > job_1463956731753_0005 > > Hadoop job information for Stage-1: number of mappers: 22; number of > reducers: 1 > > 2016-05-23 00:26:38,127 Stage-1 map = 0%, reduce = 0% > > INFO : Compiling > command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc): > select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy > > INFO : Semantic Analysis Completed > > INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:c0, > type:int, comment:null), FieldSchema(name:c1, type:int, comment:null), > FieldSchema(name:c2, type:double, comment:null), FieldSchema(name:c3, > type:double, comment:null)], properties:null) > > INFO : Completed compiling > command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc); > Time taken: 0.144 seconds > > INFO : Executing > command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc): > select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy > > WARN : Hive-on-MR is deprecated in Hive 2 and may not be available in the > future versions. Consider using a different execution engine (i.e. spark, > tez) or using Hive 1.X releases. > > INFO : WARNING: Hive-on-MR is deprecated in Hive 2 and may not be > available in the future versions. Consider using a different execution > engine (i.e. spark, tez) or using Hive 1.X releases. > > WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in > the future versions. Consider using a different execution engine (i.e. > spark, tez) or using Hive 1.X releases. > > INFO : Query ID = > hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc > > INFO : Total jobs = 1 > > INFO : Launching Job 1 out of 1 > > INFO : Starting task [Stage-1:MAPRED] in serial mode > > INFO : Number of reduce tasks determined at compile time: 1 > > INFO : In order to change the average load for a reducer (in bytes): > > INFO : set hive.exec.reducers.bytes.per.reducer=<number> > > INFO : In order to limit the maximum number of reducers: > > INFO : set hive.exec.reducers.max=<number> > > INFO : In order to set a constant number of reducers: > > INFO : set mapreduce.job.reduces=<number> > > WARN : Hadoop command-line option parsing not performed. Implement the > Tool interface and execute your application with ToolRunner to remedy this. > > INFO : number of splits:22 > > INFO : Submitting tokens for job: job_1463956731753_0005 > > INFO : The url to track the job: > http://localhost.localdomain:8088/proxy/application_1463956731753_0005/ > > INFO : Starting Job = job_1463956731753_0005, Tracking URL = > http://localhost.localdomain:8088/proxy/application_1463956731753_0005/ > > INFO : Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job -kill > job_1463956731753_0005 > > INFO : Hadoop job information for Stage-1: number of mappers: 22; number > of reducers: 1 > > INFO : 2016-05-23 00:26:38,127 Stage-1 map = 0%, reduce = 0% > > 2016-05-23 00:26:44,367 Stage-1 map = 5%, reduce = 0%, Cumulative CPU > 4.56 sec > > INFO : 2016-05-23 00:26:44,367 Stage-1 map = 5%, reduce = 0%, Cumulative > CPU 4.56 sec > > 2016-05-23 00:26:50,558 Stage-1 map = 9%, reduce = 0%, Cumulative CPU > 9.17 sec > > INFO : 2016-05-23 00:26:50,558 Stage-1 map = 9%, reduce = 0%, Cumulative > CPU 9.17 sec > > 2016-05-23 00:26:56,747 Stage-1 map = 14%, reduce = 0%, Cumulative CPU > 14.04 sec > > INFO : 2016-05-23 00:26:56,747 Stage-1 map = 14%, reduce = 0%, > Cumulative CPU 14.04 sec > > 2016-05-23 00:27:02,944 Stage-1 map = 18%, reduce = 0%, Cumulative CPU > 18.64 sec > > INFO : 2016-05-23 00:27:02,944 Stage-1 map = 18%, reduce = 0%, > Cumulative CPU 18.64 sec > > 2016-05-23 00:27:08,105 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 23.25 sec > > INFO : 2016-05-23 00:27:08,105 Stage-1 map = 23%, reduce = 0%, > Cumulative CPU 23.25 sec > > 2016-05-23 00:27:14,298 Stage-1 map = 27%, reduce = 0%, Cumulative CPU > 27.84 sec > > INFO : 2016-05-23 00:27:14,298 Stage-1 map = 27%, reduce = 0%, > Cumulative CPU 27.84 sec > > 2016-05-23 00:27:20,484 Stage-1 map = 32%, reduce = 0%, Cumulative CPU > 32.56 sec > > INFO : 2016-05-23 00:27:20,484 Stage-1 map = 32%, reduce = 0%, > Cumulative CPU 32.56 sec > > 2016-05-23 00:27:26,659 Stage-1 map = 36%, reduce = 0%, Cumulative CPU > 37.1 sec > > INFO : 2016-05-23 00:27:26,659 Stage-1 map = 36%, reduce = 0%, > Cumulative CPU 37.1 sec > > 2016-05-23 00:27:32,839 Stage-1 map = 41%, reduce = 0%, Cumulative CPU > 41.74 sec > > INFO : 2016-05-23 00:27:32,839 Stage-1 map = 41%, reduce = 0%, > Cumulative CPU 41.74 sec > > 2016-05-23 00:27:39,003 Stage-1 map = 45%, reduce = 0%, Cumulative CPU > 46.32 sec > > INFO : 2016-05-23 00:27:39,003 Stage-1 map = 45%, reduce = 0%, > Cumulative CPU 46.32 sec > > 2016-05-23 00:27:45,173 Stage-1 map = 50%, reduce = 0%, Cumulative CPU > 50.93 sec > > 2016-05-23 00:27:50,316 Stage-1 map = 55%, reduce = 0%, Cumulative CPU > 55.55 sec > > INFO : 2016-05-23 00:27:45,173 Stage-1 map = 50%, reduce = 0%, > Cumulative CPU 50.93 sec > > INFO : 2016-05-23 00:27:50,316 Stage-1 map = 55%, reduce = 0%, > Cumulative CPU 55.55 sec > > 2016-05-23 00:27:56,482 Stage-1 map = 59%, reduce = 0%, Cumulative CPU > 60.25 sec > > INFO : 2016-05-23 00:27:56,482 Stage-1 map = 59%, reduce = 0%, > Cumulative CPU 60.25 sec > > 2016-05-23 00:28:02,642 Stage-1 map = 64%, reduce = 0%, Cumulative CPU > 64.86 sec > > INFO : 2016-05-23 00:28:02,642 Stage-1 map = 64%, reduce = 0%, > Cumulative CPU 64.86 sec > > 2016-05-23 00:28:08,814 Stage-1 map = 68%, reduce = 0%, Cumulative CPU > 69.41 sec > > INFO : 2016-05-23 00:28:08,814 Stage-1 map = 68%, reduce = 0%, > Cumulative CPU 69.41 sec > > 2016-05-23 00:28:14,977 Stage-1 map = 73%, reduce = 0%, Cumulative CPU > 74.06 sec > > INFO : 2016-05-23 00:28:14,977 Stage-1 map = 73%, reduce = 0%, > Cumulative CPU 74.06 sec > > 2016-05-23 00:28:21,134 Stage-1 map = 77%, reduce = 0%, Cumulative CPU > 78.72 sec > > INFO : 2016-05-23 00:28:21,134 Stage-1 map = 77%, reduce = 0%, > Cumulative CPU 78.72 sec > > 2016-05-23 00:28:27,282 Stage-1 map = 82%, reduce = 0%, Cumulative CPU > 83.32 sec > > INFO : 2016-05-23 00:28:27,282 Stage-1 map = 82%, reduce = 0%, > Cumulative CPU 83.32 sec > > 2016-05-23 00:28:33,437 Stage-1 map = 86%, reduce = 0%, Cumulative CPU > 87.9 sec > > INFO : 2016-05-23 00:28:33,437 Stage-1 map = 86%, reduce = 0%, > Cumulative CPU 87.9 sec > > 2016-05-23 00:28:38,579 Stage-1 map = 91%, reduce = 0%, Cumulative CPU > 92.52 sec > > INFO : 2016-05-23 00:28:38,579 Stage-1 map = 91%, reduce = 0%, > Cumulative CPU 92.52 sec > > 2016-05-23 00:28:44,759 Stage-1 map = 95%, reduce = 0%, Cumulative CPU > 97.35 sec > > INFO : 2016-05-23 00:28:44,759 Stage-1 map = 95%, reduce = 0%, > Cumulative CPU 97.35 sec > > 2016-05-23 00:28:49,915 Stage-1 map = 100%, reduce = 0%, Cumulative CPU > 99.6 sec > > INFO : 2016-05-23 00:28:49,915 Stage-1 map = 100%, reduce = 0%, > Cumulative CPU 99.6 sec > > 2016-05-23 00:28:54,043 Stage-1 map = 100%, reduce = 100%, Cumulative CPU > 101.4 sec > > MapReduce Total cumulative CPU time: 1 minutes 41 seconds 400 msec > > Ended Job = job_1463956731753_0005 > > MapReduce Jobs Launched: > > Stage-Stage-1: Map: 22 Reduce: 1 Cumulative CPU: 101.4 sec HDFS Read: > 5318569 HDFS Write: 46 SUCCESS > > Total MapReduce CPU Time Spent: 1 minutes 41 seconds 400 msec > > OK > > INFO : 2016-05-23 00:28:54,043 Stage-1 map = 100%, reduce = 100%, > Cumulative CPU 101.4 sec > > INFO : MapReduce Total cumulative CPU time: 1 minutes 41 seconds 400 msec > > INFO : Ended Job = job_1463956731753_0005 > > INFO : MapReduce Jobs Launched: > > INFO : Stage-Stage-1: Map: 22 Reduce: 1 Cumulative CPU: 101.4 sec > HDFS Read: 5318569 HDFS Write: 46 SUCCESS > > INFO : Total MapReduce CPU Time Spent: 1 minutes 41 seconds 400 msec > > INFO : Completed executing > command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc); > Time taken: 142.525 seconds > > INFO : OK > > +-----+------------+---------------+-----------------------+--+ > > | c0 | c1 | c2 | c3 | > > +-----+------------+---------------+-----------------------+--+ > > | 1 | 100000000 | 5.00000005E7 | 2.8867513459481288E7 | > > +-----+------------+---------------+-----------------------+--+ > > 1 row selected (142.744 seconds) > > > > OK Hive on map-reduce engine took 142 seconds compared to 58 seconds with > Hive on Spark. So you can obviously gain pretty well by using Hive on Spark. > > > > Please also note that I did not use any vendor's build for this purpose. I > compiled Spark 1.3.1 myself. > > > > HTH > > > > > > Dr Mich Talebzadeh > > > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > > > http://talebzadehmich.wordpress.com/ > > >