I don’t think that it would be a good comparison. If memory serves, Tez w LLAP is going to be running a separate engine that is constantly running, no?
Spark? That runs under hive… Unless you’re suggesting that the spark context is constantly running as part of the hiveserver2? > On May 23, 2016, at 6:51 PM, Jörn Franke <jornfra...@gmail.com> wrote: > > > Hi Mich, > > I think these comparisons are useful. One interesting aspect could be > hardware scalability in this context. Additionally different type of > computations. Furthermore, one could compare Spark and Tez+llap as execution > engines. I have the gut feeling that each one can be justified by different > use cases. > Nevertheless, there should be always a disclaimer for such comparisons, > because Spark and Hive are not good for a lot of concurrent lookups of single > rows. They are not good for frequently write small amounts of data (eg sensor > data). Here hbase could be more interesting. Other use cases can justify > graph databases, such as Titan, or text analytics/ data matching using Solr > on Hadoop. > Finally, even if you have a lot of data you need to think if you always have > to process everything. For instance, I have found valid use cases in practice > where we decided to evaluate 10 machine learning models in parallel on only a > sample of data and only evaluate the "winning" model of the total of data. > > As always it depends :) > > Best regards > > P.s.: at least Hortonworks has in their distribution spark 1.5 with hive 1.2 > and spark 1.6 with hive 1.2. Maybe they have somewhere described how to > manage bringing both together. You may check also Apache Bigtop (vendor > neutral distribution) on how they managed to bring both together. > > On 23 May 2016, at 01:42, Mich Talebzadeh <mich.talebza...@gmail.com > <mailto:mich.talebza...@gmail.com>> wrote: > >> Hi, >> >> I have done a number of extensive tests using Spark-shell with Hive DB and >> ORC tables. >> >> Now one issue that we typically face is and I quote: >> >> Spark is fast as it uses Memory and DAG. Great but when we save data it is >> not fast enough >> >> OK but there is a solution now. If you use Spark with Hive and you are on a >> descent version of Hive >= 0.14, then you can also deploy Spark as execution >> engine for Hive. That will make your application run pretty fast as you no >> longer rely on the old Map-Reduce for Hive engine. In a nutshell what you >> are gaining speed in both querying and storage. >> >> I have made some comparisons on this set-up and I am sure some of you will >> find it useful. >> >> The version of Spark I use for Spark queries (Spark as query tool) is 1.6. >> The version of Hive I use in Hive 2 >> The version of Spark I use as Hive execution engine is 1.3.1 It works and >> frankly Spark 1.3.1 as an execution engine is adequate (until we sort out >> the Hadoop libraries mismatch). >> >> An example I am using Hive on Spark engine to find the min and max of IDs >> for a table with 1 billion rows: >> >> 0: jdbc:hive2://rhes564:10010/default> select min(id), max(id),avg(id), >> stddev(id) from oraclehadoop.dummy; >> Query ID = hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006 >> >> >> Starting Spark Job = 5e092ef9-d798-4952-b156-74df49da9151 >> >> INFO : Completed compiling >> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006); >> Time taken: 1.911 seconds >> INFO : Executing >> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006): >> select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy >> INFO : Query ID = hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006 >> INFO : Total jobs = 1 >> INFO : Launching Job 1 out of 1 >> INFO : Starting task [Stage-1:MAPRED] in serial mode >> >> Query Hive on Spark job[0] stages: >> 0 >> 1 >> Status: Running (Hive on Spark job[0]) >> Job Progress Format >> CurrentTime StageId_StageAttemptId: >> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount >> [StageCost] >> 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1 >> 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22 Stage-1_0: 0/1 >> 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22 Stage-1_0: 0/1 >> 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22 Stage-1_0: 0/1 >> INFO : >> Query Hive on Spark job[0] stages: >> INFO : 0 >> INFO : 1 >> INFO : >> Status: Running (Hive on Spark job[0]) >> INFO : Job Progress Format >> CurrentTime StageId_StageAttemptId: >> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount >> [StageCost] >> INFO : 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1 >> INFO : 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22 Stage-1_0: 0/1 >> INFO : 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22 Stage-1_0: 0/1 >> INFO : 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22 Stage-1_0: 0/1 >> 2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished Stage-1_0: 0(+1)/1 >> 2016-05-23 00:21:30,189 Stage-0_0: 22/22 Finished Stage-1_0: 1/1 >> Finished >> Status: Finished successfully in 53.25 seconds >> OK >> INFO : 2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished Stage-1_0: >> 0(+1)/1 >> INFO : 2016-05-23 00:21:30,189 Stage-0_0: 22/22 Finished Stage-1_0: >> 1/1 Finished >> INFO : Status: Finished successfully in 53.25 seconds >> INFO : Completed executing >> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006); >> Time taken: 56.337 seconds >> INFO : OK >> +-----+------------+---------------+-----------------------+--+ >> | c0 | c1 | c2 | c3 | >> +-----+------------+---------------+-----------------------+--+ >> | 1 | 100000000 | 5.00000005E7 | 2.8867513459481288E7 | >> +-----+------------+---------------+-----------------------+--+ >> 1 row selected (58.529 seconds) >> >> 58 seconds first run with cold cache is pretty good >> >> And let us compare it with running the same query on map-reduce engine >> >> : jdbc:hive2://rhes564:10010/default> set hive.execution.engine=mr; >> Hive-on-MR is deprecated in Hive 2 and may not be available in the future >> versions. Consider using a different execution engine (i.e. spark, tez) or >> using Hive 1.X releases. >> No rows affected (0.007 seconds) >> 0: jdbc:hive2://rhes564:10010/default> select min(id), max(id),avg(id), >> stddev(id) from oraclehadoop.dummy; >> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the >> future versions. Consider using a different execution engine (i.e. spark, >> tez) or using Hive 1.X releases. >> Query ID = hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc >> Total jobs = 1 >> Launching Job 1 out of 1 >> Number of reduce tasks determined at compile time: 1 >> In order to change the average load for a reducer (in bytes): >> set hive.exec.reducers.bytes.per.reducer=<number> >> In order to limit the maximum number of reducers: >> set hive.exec.reducers.max=<number> >> In order to set a constant number of reducers: >> set mapreduce.job.reduces=<number> >> Starting Job = job_1463956731753_0005, Tracking URL = >> http://localhost.localdomain:8088/proxy/application_1463956731753_0005/ >> <http://localhost.localdomain:8088/proxy/application_1463956731753_0005/> >> Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job -kill >> job_1463956731753_0005 >> Hadoop job information for Stage-1: number of mappers: 22; number of >> reducers: 1 >> 2016-05-23 00:26:38,127 Stage-1 map = 0%, reduce = 0% >> INFO : Compiling >> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc): >> select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy >> INFO : Semantic Analysis Completed >> INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:c0, >> type:int, comment:null), FieldSchema(name:c1, type:int, comment:null), >> FieldSchema(name:c2, type:double, comment:null), FieldSchema(name:c3, >> type:double, comment:null)], properties:null) >> INFO : Completed compiling >> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc); >> Time taken: 0.144 seconds >> INFO : Executing >> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc): >> select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy >> WARN : Hive-on-MR is deprecated in Hive 2 and may not be available in the >> future versions. Consider using a different execution engine (i.e. spark, >> tez) or using Hive 1.X releases. >> INFO : WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available >> in the future versions. Consider using a different execution engine (i.e. >> spark, tez) or using Hive 1.X releases. >> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the >> future versions. Consider using a different execution engine (i.e. spark, >> tez) or using Hive 1.X releases. >> INFO : Query ID = hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc >> INFO : Total jobs = 1 >> INFO : Launching Job 1 out of 1 >> INFO : Starting task [Stage-1:MAPRED] in serial mode >> INFO : Number of reduce tasks determined at compile time: 1 >> INFO : In order to change the average load for a reducer (in bytes): >> INFO : set hive.exec.reducers.bytes.per.reducer=<number> >> INFO : In order to limit the maximum number of reducers: >> INFO : set hive.exec.reducers.max=<number> >> INFO : In order to set a constant number of reducers: >> INFO : set mapreduce.job.reduces=<number> >> WARN : Hadoop command-line option parsing not performed. Implement the Tool >> interface and execute your application with ToolRunner to remedy this. >> INFO : number of splits:22 >> INFO : Submitting tokens for job: job_1463956731753_0005 >> INFO : The url to track the job: >> http://localhost.localdomain:8088/proxy/application_1463956731753_0005/ >> <http://localhost.localdomain:8088/proxy/application_1463956731753_0005/> >> INFO : Starting Job = job_1463956731753_0005, Tracking URL = >> http://localhost.localdomain:8088/proxy/application_1463956731753_0005/ >> <http://localhost.localdomain:8088/proxy/application_1463956731753_0005/> >> INFO : Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job -kill >> job_1463956731753_0005 >> INFO : Hadoop job information for Stage-1: number of mappers: 22; number of >> reducers: 1 >> INFO : 2016-05-23 00:26:38,127 Stage-1 map = 0%, reduce = 0% >> 2016-05-23 00:26:44,367 Stage-1 map = 5%, reduce = 0%, Cumulative CPU 4.56 >> sec >> INFO : 2016-05-23 00:26:44,367 Stage-1 map = 5%, reduce = 0%, Cumulative >> CPU 4.56 sec >> 2016-05-23 00:26:50,558 Stage-1 map = 9%, reduce = 0%, Cumulative CPU 9.17 >> sec >> INFO : 2016-05-23 00:26:50,558 Stage-1 map = 9%, reduce = 0%, Cumulative >> CPU 9.17 sec >> 2016-05-23 00:26:56,747 Stage-1 map = 14%, reduce = 0%, Cumulative CPU >> 14.04 sec >> INFO : 2016-05-23 00:26:56,747 Stage-1 map = 14%, reduce = 0%, Cumulative >> CPU 14.04 sec >> 2016-05-23 00:27:02,944 Stage-1 map = 18%, reduce = 0%, Cumulative CPU >> 18.64 sec >> INFO : 2016-05-23 00:27:02,944 Stage-1 map = 18%, reduce = 0%, Cumulative >> CPU 18.64 sec >> 2016-05-23 00:27:08,105 Stage-1 map = 23%, reduce = 0%, Cumulative CPU >> 23.25 sec >> INFO : 2016-05-23 00:27:08,105 Stage-1 map = 23%, reduce = 0%, Cumulative >> CPU 23.25 sec >> 2016-05-23 00:27:14,298 Stage-1 map = 27%, reduce = 0%, Cumulative CPU >> 27.84 sec >> INFO : 2016-05-23 00:27:14,298 Stage-1 map = 27%, reduce = 0%, Cumulative >> CPU 27.84 sec >> 2016-05-23 00:27:20,484 Stage-1 map = 32%, reduce = 0%, Cumulative CPU >> 32.56 sec >> INFO : 2016-05-23 00:27:20,484 Stage-1 map = 32%, reduce = 0%, Cumulative >> CPU 32.56 sec >> 2016-05-23 00:27:26,659 Stage-1 map = 36%, reduce = 0%, Cumulative CPU 37.1 >> sec >> INFO : 2016-05-23 00:27:26,659 Stage-1 map = 36%, reduce = 0%, Cumulative >> CPU 37.1 sec >> 2016-05-23 00:27:32,839 Stage-1 map = 41%, reduce = 0%, Cumulative CPU >> 41.74 sec >> INFO : 2016-05-23 00:27:32,839 Stage-1 map = 41%, reduce = 0%, Cumulative >> CPU 41.74 sec >> 2016-05-23 00:27:39,003 Stage-1 map = 45%, reduce = 0%, Cumulative CPU >> 46.32 sec >> INFO : 2016-05-23 00:27:39,003 Stage-1 map = 45%, reduce = 0%, Cumulative >> CPU 46.32 sec >> 2016-05-23 00:27:45,173 Stage-1 map = 50%, reduce = 0%, Cumulative CPU >> 50.93 sec >> 2016-05-23 00:27:50,316 Stage-1 map = 55%, reduce = 0%, Cumulative CPU >> 55.55 sec >> INFO : 2016-05-23 00:27:45,173 Stage-1 map = 50%, reduce = 0%, Cumulative >> CPU 50.93 sec >> INFO : 2016-05-23 00:27:50,316 Stage-1 map = 55%, reduce = 0%, Cumulative >> CPU 55.55 sec >> 2016-05-23 00:27:56,482 Stage-1 map = 59%, reduce = 0%, Cumulative CPU >> 60.25 sec >> INFO : 2016-05-23 00:27:56,482 Stage-1 map = 59%, reduce = 0%, Cumulative >> CPU 60.25 sec >> 2016-05-23 00:28:02,642 Stage-1 map = 64%, reduce = 0%, Cumulative CPU >> 64.86 sec >> INFO : 2016-05-23 00:28:02,642 Stage-1 map = 64%, reduce = 0%, Cumulative >> CPU 64.86 sec >> 2016-05-23 00:28:08,814 Stage-1 map = 68%, reduce = 0%, Cumulative CPU >> 69.41 sec >> INFO : 2016-05-23 00:28:08,814 Stage-1 map = 68%, reduce = 0%, Cumulative >> CPU 69.41 sec >> 2016-05-23 00:28:14,977 Stage-1 map = 73%, reduce = 0%, Cumulative CPU >> 74.06 sec >> INFO : 2016-05-23 00:28:14,977 Stage-1 map = 73%, reduce = 0%, Cumulative >> CPU 74.06 sec >> 2016-05-23 00:28:21,134 Stage-1 map = 77%, reduce = 0%, Cumulative CPU >> 78.72 sec >> INFO : 2016-05-23 00:28:21,134 Stage-1 map = 77%, reduce = 0%, Cumulative >> CPU 78.72 sec >> 2016-05-23 00:28:27,282 Stage-1 map = 82%, reduce = 0%, Cumulative CPU >> 83.32 sec >> INFO : 2016-05-23 00:28:27,282 Stage-1 map = 82%, reduce = 0%, Cumulative >> CPU 83.32 sec >> 2016-05-23 00:28:33,437 Stage-1 map = 86%, reduce = 0%, Cumulative CPU 87.9 >> sec >> INFO : 2016-05-23 00:28:33,437 Stage-1 map = 86%, reduce = 0%, Cumulative >> CPU 87.9 sec >> 2016-05-23 00:28:38,579 Stage-1 map = 91%, reduce = 0%, Cumulative CPU >> 92.52 sec >> INFO : 2016-05-23 00:28:38,579 Stage-1 map = 91%, reduce = 0%, Cumulative >> CPU 92.52 sec >> 2016-05-23 00:28:44,759 Stage-1 map = 95%, reduce = 0%, Cumulative CPU >> 97.35 sec >> INFO : 2016-05-23 00:28:44,759 Stage-1 map = 95%, reduce = 0%, Cumulative >> CPU 97.35 sec >> 2016-05-23 00:28:49,915 Stage-1 map = 100%, reduce = 0%, Cumulative CPU >> 99.6 sec >> INFO : 2016-05-23 00:28:49,915 Stage-1 map = 100%, reduce = 0%, Cumulative >> CPU 99.6 sec >> 2016-05-23 00:28:54,043 Stage-1 map = 100%, reduce = 100%, Cumulative CPU >> 101.4 sec >> MapReduce Total cumulative CPU time: 1 minutes 41 seconds 400 msec >> Ended Job = job_1463956731753_0005 >> MapReduce Jobs Launched: >> Stage-Stage-1: Map: 22 Reduce: 1 Cumulative CPU: 101.4 sec HDFS Read: >> 5318569 HDFS Write: 46 SUCCESS >> Total MapReduce CPU Time Spent: 1 minutes 41 seconds 400 msec >> OK >> INFO : 2016-05-23 00:28:54,043 Stage-1 map = 100%, reduce = 100%, >> Cumulative CPU 101.4 sec >> INFO : MapReduce Total cumulative CPU time: 1 minutes 41 seconds 400 msec >> INFO : Ended Job = job_1463956731753_0005 >> INFO : MapReduce Jobs Launched: >> INFO : Stage-Stage-1: Map: 22 Reduce: 1 Cumulative CPU: 101.4 sec HDFS >> Read: 5318569 HDFS Write: 46 SUCCESS >> INFO : Total MapReduce CPU Time Spent: 1 minutes 41 seconds 400 msec >> INFO : Completed executing >> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc); >> Time taken: 142.525 seconds >> INFO : OK >> +-----+------------+---------------+-----------------------+--+ >> | c0 | c1 | c2 | c3 | >> +-----+------------+---------------+-----------------------+--+ >> | 1 | 100000000 | 5.00000005E7 | 2.8867513459481288E7 | >> +-----+------------+---------------+-----------------------+--+ >> 1 row selected (142.744 seconds) >> >> OK Hive on map-reduce engine took 142 seconds compared to 58 seconds with >> Hive on Spark. So you can obviously gain pretty well by using Hive on Spark. >> >> Please also note that I did not use any vendor's build for this purpose. I >> compiled Spark 1.3.1 myself. >> >> HTH >> >> >> Dr Mich Talebzadeh >> >> LinkedIn >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> >> >> http://talebzadehmich.wordpress.com/ <http://talebzadehmich.wordpress.com/> >>