Spark in relation to Tez can be like a Flink runner for Apache Beam? The use case of Tez however may be interesting (but current implementation only YARN-based?)
Spark is efficient (or faster) for a number of reasons, including its ‘in-memory’ execution (from my understanding and experiments). If one really cares to dive in, just enough to read their papers which explain very well the optimization framework (graph-specific, MPP db, Catalyst, ML pipelines etc.) which Spark become after the initial RDD implementation. What Spark is missing is a way of reaching its users by a good ‘production’ level, good documentation and nice feedback from the masters of this unique piece. Just an opinion. Best, Ovidiu > On 30 May 2016, at 21:49, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > > yep Hortonworks supports Tez for one reason or other which I am going > hopefully to test it as the query engine for hive. Tthough I think Spark will > be faster because of its in-memory support. > > Also if you are independent then you better off dealing with Spark and Hive > without the need to support another stack like Tez. > > Cloudera support Impala instead of Hive but it is not something I have used. . > > HTH > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> > > http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> > > > On 30 May 2016 at 20:19, Michael Segel <msegel_had...@hotmail.com > <mailto:msegel_had...@hotmail.com>> wrote: > Mich, > > Most people use vendor releases because they need to have the support. > Hortonworks is the vendor who has the most skin in the game when it comes to > Tez. > > If memory serves, Tez isn’t going to be M/R but a local execution engine? > Then LLAP is the in-memory piece to speed up Tez? > > HTH > > -Mike > >> On May 29, 2016, at 1:35 PM, Mich Talebzadeh <mich.talebza...@gmail.com >> <mailto:mich.talebza...@gmail.com>> wrote: >> >> thanks I think the problem is that the TEZ user group is exceptionally >> quiet. Just sent an email to Hive user group to see anyone has managed to >> built a vendor independent version. >> >> >> Dr Mich Talebzadeh >> >> LinkedIn >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> >> >> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> >> >> >> On 29 May 2016 at 21:23, Jörn Franke <jornfra...@gmail.com >> <mailto:jornfra...@gmail.com>> wrote: >> Well I think it is different from MR. It has some optimizations which you do >> not find in MR. Especially the LLAP option in Hive2 makes it interesting. >> >> I think hive 1.2 works with 0.7 and 2.0 with 0.8 . At least for 1.2 it is >> integrated in the Hortonworks distribution. >> >> >> On 29 May 2016, at 21:43, Mich Talebzadeh <mich.talebza...@gmail.com >> <mailto:mich.talebza...@gmail.com>> wrote: >> >>> Hi Jorn, >>> >>> I started building apache-tez-0.8.2 but got few errors. Couple of guys from >>> TEZ user group kindly gave a hand but I could not go very far (or may be I >>> did not make enough efforts) making it work. >>> >>> That TEZ user group is very quiet as well. >>> >>> My understanding is TEZ is MR with DAG but of course Spark has both plus >>> in-memory capability. >>> >>> It would be interesting to see what version of TEZ works as execution >>> engine with Hive. >>> >>> Vendors are divided on this (use Hive with TEZ) or use Impala instead of >>> Hive etc as I am sure you already know. >>> >>> Cheers, >>> >>> >>> >>> >>> Dr Mich Talebzadeh >>> >>> LinkedIn >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> >>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> >>> >>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> >>> >>> >>> On 29 May 2016 at 20:19, Jörn Franke <jornfra...@gmail.com >>> <mailto:jornfra...@gmail.com>> wrote: >>> Very interesting do you plan also a test with TEZ? >>> >>> On 29 May 2016, at 13:40, Mich Talebzadeh <mich.talebza...@gmail.com >>> <mailto:mich.talebza...@gmail.com>> wrote: >>> >>>> Hi, >>>> >>>> I did another study of Hive using Spark engine compared to Hive with MR. >>>> >>>> Basically took the original table imported using Sqoop and created and >>>> populated a new ORC table partitioned by year and month into 48 partitions >>>> as follows: >>>> >>>> <sales_partition.PNG> >>>> >>>> Connections use JDBC via beeline. Now for each partition using MR it takes >>>> an average of 17 minutes as seen below for each PARTITION.. Now that is >>>> just an individual partition and there are 48 partitions. >>>> >>>> In contrast doing the same operation with Spark engine took 10 minutes all >>>> inclusive. I just gave up on MR. You can see the StartTime and FinishTime >>>> from below >>>> >>>> <image.png> >>>> >>>> This is by no means indicate that Spark is much better than MR but shows >>>> that some very good results can ve achieved using Spark engine. >>>> >>>> >>>> Dr Mich Talebzadeh >>>> >>>> LinkedIn >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>> >>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> >>>> >>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> >>>> >>>> >>>> On 24 May 2016 at 08:03, Mich Talebzadeh <mich.talebza...@gmail.com >>>> <mailto:mich.talebza...@gmail.com>> wrote: >>>> Hi, >>>> >>>> We use Hive as the database and use Spark as an all purpose query tool. >>>> >>>> Whether Hive is the write database for purpose or one is better off with >>>> something like Phoenix on Hbase, well the answer is it depends and your >>>> mileage varies. >>>> >>>> So fit for purpose. >>>> >>>> Ideally what wants is to use the fastest method to get the results. How >>>> fast we confine it to our SLA agreements in production and that helps us >>>> from unnecessary further work as we technologists like to play around. >>>> >>>> So in short, we use Spark most of the time and use Hive as the backend >>>> engine for data storage, mainly ORC tables. >>>> >>>> We use Hive on Spark and with Hive 2 on Spark 1.3.1 for now we have a >>>> combination that works. Granted it helps to use Hive 2 on Spark 1.6.1 but >>>> at the moment it is one of my projects. >>>> >>>> We do not use any vendor's products as it enables us to move away from >>>> being tied down after years of SAP, Oracle and MS dependency to yet >>>> another vendor. Besides there is some politics going on with one promoting >>>> Tez and another Spark as a backend. That is fine but obviously we prefer >>>> an independent assessment ourselves. >>>> >>>> My gut feeling is that one needs to look at the use case. Recently we had >>>> to import a very large table from Oracle to Hive and decided to use Spark >>>> 1.6.1 with Hive 2 on Spark 1.3.1 and that worked fine. We just used JDBC >>>> connection with temp table and it was good. We could have used sqoop but >>>> decided to settle for Spark so it all depends on use case. >>>> >>>> HTH >>>> >>>> >>>> >>>> Dr Mich Talebzadeh >>>> >>>> LinkedIn >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>> >>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> >>>> >>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> >>>> >>>> >>>> On 24 May 2016 at 03:11, ayan guha <guha.a...@gmail.com >>>> <mailto:guha.a...@gmail.com>> wrote: >>>> Hi >>>> >>>> Thanks for very useful stats. >>>> >>>> Did you have any benchmark for using Spark as backend engine for Hive vs >>>> using Spark thrift server (and run spark code for hive queries)? We are >>>> using later but it will be very useful to remove thriftserver, if we can. >>>> >>>> On Tue, May 24, 2016 at 9:51 AM, Jörn Franke <jornfra...@gmail.com >>>> <mailto:jornfra...@gmail.com>> wrote: >>>> >>>> Hi Mich, >>>> >>>> I think these comparisons are useful. One interesting aspect could be >>>> hardware scalability in this context. Additionally different type of >>>> computations. Furthermore, one could compare Spark and Tez+llap as >>>> execution engines. I have the gut feeling that each one can be justified >>>> by different use cases. >>>> Nevertheless, there should be always a disclaimer for such comparisons, >>>> because Spark and Hive are not good for a lot of concurrent lookups of >>>> single rows. They are not good for frequently write small amounts of data >>>> (eg sensor data). Here hbase could be more interesting. Other use cases >>>> can justify graph databases, such as Titan, or text analytics/ data >>>> matching using Solr on Hadoop. >>>> Finally, even if you have a lot of data you need to think if you always >>>> have to process everything. For instance, I have found valid use cases in >>>> practice where we decided to evaluate 10 machine learning models in >>>> parallel on only a sample of data and only evaluate the "winning" model of >>>> the total of data. >>>> >>>> As always it depends :) >>>> >>>> Best regards >>>> >>>> P.s.: at least Hortonworks has in their distribution spark 1.5 with hive >>>> 1.2 and spark 1.6 with hive 1.2. Maybe they have somewhere described how >>>> to manage bringing both together. You may check also Apache Bigtop (vendor >>>> neutral distribution) on how they managed to bring both together. >>>> >>>> On 23 May 2016, at 01:42, Mich Talebzadeh <mich.talebza...@gmail.com >>>> <mailto:mich.talebza...@gmail.com>> wrote: >>>> >>>>> Hi, >>>>> >>>>> I have done a number of extensive tests using Spark-shell with Hive DB >>>>> and ORC tables. >>>>> >>>>> Now one issue that we typically face is and I quote: >>>>> >>>>> Spark is fast as it uses Memory and DAG. Great but when we save data it >>>>> is not fast enough >>>>> >>>>> OK but there is a solution now. If you use Spark with Hive and you are on >>>>> a descent version of Hive >= 0.14, then you can also deploy Spark as >>>>> execution engine for Hive. That will make your application run pretty >>>>> fast as you no longer rely on the old Map-Reduce for Hive engine. In a >>>>> nutshell what you are gaining speed in both querying and storage. >>>>> >>>>> I have made some comparisons on this set-up and I am sure some of you >>>>> will find it useful. >>>>> >>>>> The version of Spark I use for Spark queries (Spark as query tool) is 1.6. >>>>> The version of Hive I use in Hive 2 >>>>> The version of Spark I use as Hive execution engine is 1.3.1 It works and >>>>> frankly Spark 1.3.1 as an execution engine is adequate (until we sort out >>>>> the Hadoop libraries mismatch). >>>>> >>>>> An example I am using Hive on Spark engine to find the min and max of IDs >>>>> for a table with 1 billion rows: >>>>> >>>>> 0: jdbc:hive2://rhes564:10010/default> select min(id), max(id),avg(id), >>>>> stddev(id) from oraclehadoop.dummy; >>>>> Query ID = hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006 >>>>> >>>>> >>>>> Starting Spark Job = 5e092ef9-d798-4952-b156-74df49da9151 >>>>> >>>>> INFO : Completed compiling >>>>> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006); >>>>> Time taken: 1.911 seconds >>>>> INFO : Executing >>>>> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006): >>>>> select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy >>>>> INFO : Query ID = >>>>> hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006 >>>>> INFO : Total jobs = 1 >>>>> INFO : Launching Job 1 out of 1 >>>>> INFO : Starting task [Stage-1:MAPRED] in serial mode >>>>> >>>>> Query Hive on Spark job[0] stages: >>>>> 0 >>>>> 1 >>>>> Status: Running (Hive on Spark job[0]) >>>>> Job Progress Format >>>>> CurrentTime StageId_StageAttemptId: >>>>> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount >>>>> [StageCost] >>>>> 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1 >>>>> 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22 Stage-1_0: 0/1 >>>>> 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22 Stage-1_0: 0/1 >>>>> 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22 Stage-1_0: 0/1 >>>>> INFO : >>>>> Query Hive on Spark job[0] stages: >>>>> INFO : 0 >>>>> INFO : 1 >>>>> INFO : >>>>> Status: Running (Hive on Spark job[0]) >>>>> INFO : Job Progress Format >>>>> CurrentTime StageId_StageAttemptId: >>>>> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount >>>>> [StageCost] >>>>> INFO : 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1 >>>>> INFO : 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22 Stage-1_0: 0/1 >>>>> INFO : 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22 Stage-1_0: 0/1 >>>>> INFO : 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22 Stage-1_0: 0/1 >>>>> 2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished Stage-1_0: 0(+1)/1 >>>>> 2016-05-23 00:21:30,189 Stage-0_0: 22/22 Finished Stage-1_0: 1/1 >>>>> Finished >>>>> Status: Finished successfully in 53.25 seconds >>>>> OK >>>>> INFO : 2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished >>>>> Stage-1_0: 0(+1)/1 >>>>> INFO : 2016-05-23 00:21:30,189 Stage-0_0: 22/22 Finished >>>>> Stage-1_0: 1/1 Finished >>>>> INFO : Status: Finished successfully in 53.25 seconds >>>>> INFO : Completed executing >>>>> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006); >>>>> Time taken: 56.337 seconds >>>>> INFO : OK >>>>> +-----+------------+---------------+-----------------------+--+ >>>>> | c0 | c1 | c2 | c3 | >>>>> +-----+------------+---------------+-----------------------+--+ >>>>> | 1 | 100000000 | 5.00000005E7 | 2.8867513459481288E7 | >>>>> +-----+------------+---------------+-----------------------+--+ >>>>> 1 row selected (58.529 seconds) >>>>> >>>>> 58 seconds first run with cold cache is pretty good >>>>> >>>>> And let us compare it with running the same query on map-reduce engine >>>>> >>>>> : jdbc:hive2://rhes564:10010/default> set hive.execution.engine=mr; >>>>> Hive-on-MR is deprecated in Hive 2 and may not be available in the future >>>>> versions. Consider using a different execution engine (i.e. spark, tez) >>>>> or using Hive 1.X releases. >>>>> No rows affected (0.007 seconds) >>>>> 0: jdbc:hive2://rhes564:10010/default> select min(id), max(id),avg(id), >>>>> stddev(id) from oraclehadoop.dummy; >>>>> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in >>>>> the future versions. Consider using a different execution engine (i.e. >>>>> spark, tez) or using Hive 1.X releases. >>>>> Query ID = hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc >>>>> Total jobs = 1 >>>>> Launching Job 1 out of 1 >>>>> Number of reduce tasks determined at compile time: 1 >>>>> In order to change the average load for a reducer (in bytes): >>>>> set hive.exec.reducers.bytes.per.reducer=<number> >>>>> In order to limit the maximum number of reducers: >>>>> set hive.exec.reducers.max=<number> >>>>> In order to set a constant number of reducers: >>>>> set mapreduce.job.reduces=<number> >>>>> Starting Job = job_1463956731753_0005, Tracking URL = >>>>> http://localhost.localdomain:8088/proxy/application_1463956731753_0005/ >>>>> <http://localhost.localdomain:8088/proxy/application_1463956731753_0005/> >>>>> Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job -kill >>>>> job_1463956731753_0005 >>>>> Hadoop job information for Stage-1: number of mappers: 22; number of >>>>> reducers: 1 >>>>> 2016-05-23 00:26:38,127 Stage-1 map = 0%, reduce = 0% >>>>> INFO : Compiling >>>>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc): >>>>> select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy >>>>> INFO : Semantic Analysis Completed >>>>> INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:c0, >>>>> type:int, comment:null), FieldSchema(name:c1, type:int, comment:null), >>>>> FieldSchema(name:c2, type:double, comment:null), FieldSchema(name:c3, >>>>> type:double, comment:null)], properties:null) >>>>> INFO : Completed compiling >>>>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc); >>>>> Time taken: 0.144 seconds >>>>> INFO : Executing >>>>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc): >>>>> select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy >>>>> WARN : Hive-on-MR is deprecated in Hive 2 and may not be available in >>>>> the future versions. Consider using a different execution engine (i.e. >>>>> spark, tez) or using Hive 1.X releases. >>>>> INFO : WARNING: Hive-on-MR is deprecated in Hive 2 and may not be >>>>> available in the future versions. Consider using a different execution >>>>> engine (i.e. spark, tez) or using Hive 1.X releases. >>>>> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in >>>>> the future versions. Consider using a different execution engine (i.e. >>>>> spark, tez) or using Hive 1.X releases. >>>>> INFO : Query ID = >>>>> hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc >>>>> INFO : Total jobs = 1 >>>>> INFO : Launching Job 1 out of 1 >>>>> INFO : Starting task [Stage-1:MAPRED] in serial mode >>>>> INFO : Number of reduce tasks determined at compile time: 1 >>>>> INFO : In order to change the average load for a reducer (in bytes): >>>>> INFO : set hive.exec.reducers.bytes.per.reducer=<number> >>>>> INFO : In order to limit the maximum number of reducers: >>>>> INFO : set hive.exec.reducers.max=<number> >>>>> INFO : In order to set a constant number of reducers: >>>>> INFO : set mapreduce.job.reduces=<number> >>>>> WARN : Hadoop command-line option parsing not performed. Implement the >>>>> Tool interface and execute your application with ToolRunner to remedy >>>>> this. >>>>> INFO : number of splits:22 >>>>> INFO : Submitting tokens for job: job_1463956731753_0005 >>>>> INFO : The url to track the job: >>>>> http://localhost.localdomain:8088/proxy/application_1463956731753_0005/ >>>>> <http://localhost.localdomain:8088/proxy/application_1463956731753_0005/> >>>>> INFO : Starting Job = job_1463956731753_0005, Tracking URL = >>>>> http://localhost.localdomain:8088/proxy/application_1463956731753_0005/ >>>>> <http://localhost.localdomain:8088/proxy/application_1463956731753_0005/> >>>>> INFO : Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job -kill >>>>> job_1463956731753_0005 >>>>> INFO : Hadoop job information for Stage-1: number of mappers: 22; number >>>>> of reducers: 1 >>>>> INFO : 2016-05-23 00:26:38,127 Stage-1 map = 0%, reduce = 0% >>>>> 2016-05-23 00:26:44,367 Stage-1 map = 5%, reduce = 0%, Cumulative CPU >>>>> 4.56 sec >>>>> INFO : 2016-05-23 00:26:44,367 Stage-1 map = 5%, reduce = 0%, >>>>> Cumulative CPU 4.56 sec >>>>> 2016-05-23 00:26:50,558 Stage-1 map = 9%, reduce = 0%, Cumulative CPU >>>>> 9.17 sec >>>>> INFO : 2016-05-23 00:26:50,558 Stage-1 map = 9%, reduce = 0%, >>>>> Cumulative CPU 9.17 sec >>>>> 2016-05-23 00:26:56,747 Stage-1 map = 14%, reduce = 0%, Cumulative CPU >>>>> 14.04 sec >>>>> INFO : 2016-05-23 00:26:56,747 Stage-1 map = 14%, reduce = 0%, >>>>> Cumulative CPU 14.04 sec >>>>> 2016-05-23 00:27:02,944 Stage-1 map = 18%, reduce = 0%, Cumulative CPU >>>>> 18.64 sec >>>>> INFO : 2016-05-23 00:27:02,944 Stage-1 map = 18%, reduce = 0%, >>>>> Cumulative CPU 18.64 sec >>>>> 2016-05-23 00:27:08,105 Stage-1 map = 23%, reduce = 0%, Cumulative CPU >>>>> 23.25 sec >>>>> INFO : 2016-05-23 00:27:08,105 Stage-1 map = 23%, reduce = 0%, >>>>> Cumulative CPU 23.25 sec >>>>> 2016-05-23 00:27:14,298 Stage-1 map = 27%, reduce = 0%, Cumulative CPU >>>>> 27.84 sec >>>>> INFO : 2016-05-23 00:27:14,298 Stage-1 map = 27%, reduce = 0%, >>>>> Cumulative CPU 27.84 sec >>>>> 2016-05-23 00:27:20,484 Stage-1 map = 32%, reduce = 0%, Cumulative CPU >>>>> 32.56 sec >>>>> INFO : 2016-05-23 00:27:20,484 Stage-1 map = 32%, reduce = 0%, >>>>> Cumulative CPU 32.56 sec >>>>> 2016-05-23 00:27:26,659 Stage-1 map = 36%, reduce = 0%, Cumulative CPU >>>>> 37.1 sec >>>>> INFO : 2016-05-23 00:27:26,659 Stage-1 map = 36%, reduce = 0%, >>>>> Cumulative CPU 37.1 sec >>>>> 2016-05-23 00:27:32,839 Stage-1 map = 41%, reduce = 0%, Cumulative CPU >>>>> 41.74 sec >>>>> INFO : 2016-05-23 00:27:32,839 Stage-1 map = 41%, reduce = 0%, >>>>> Cumulative CPU 41.74 sec >>>>> 2016-05-23 00:27:39,003 Stage-1 map = 45%, reduce = 0%, Cumulative CPU >>>>> 46.32 sec >>>>> INFO : 2016-05-23 00:27:39,003 Stage-1 map = 45%, reduce = 0%, >>>>> Cumulative CPU 46.32 sec >>>>> 2016-05-23 00:27:45,173 Stage-1 map = 50%, reduce = 0%, Cumulative CPU >>>>> 50.93 sec >>>>> 2016-05-23 00:27:50,316 Stage-1 map = 55%, reduce = 0%, Cumulative CPU >>>>> 55.55 sec >>>>> INFO : 2016-05-23 00:27:45,173 Stage-1 map = 50%, reduce = 0%, >>>>> Cumulative CPU 50.93 sec >>>>> INFO : 2016-05-23 00:27:50,316 Stage-1 map = 55%, reduce = 0%, >>>>> Cumulative CPU 55.55 sec >>>>> 2016-05-23 00:27:56,482 Stage-1 map = 59%, reduce = 0%, Cumulative CPU >>>>> 60.25 sec >>>>> INFO : 2016-05-23 00:27:56,482 Stage-1 map = 59%, reduce = 0%, >>>>> Cumulative CPU 60.25 sec >>>>> 2016-05-23 00:28:02,642 Stage-1 map = 64%, reduce = 0%, Cumulative CPU >>>>> 64.86 sec >>>>> INFO : 2016-05-23 00:28:02,642 Stage-1 map = 64%, reduce = 0%, >>>>> Cumulative CPU 64.86 sec >>>>> 2016-05-23 00:28:08,814 Stage-1 map = 68%, reduce = 0%, Cumulative CPU >>>>> 69.41 sec >>>>> INFO : 2016-05-23 00:28:08,814 Stage-1 map = 68%, reduce = 0%, >>>>> Cumulative CPU 69.41 sec >>>>> 2016-05-23 00:28:14,977 Stage-1 map = 73%, reduce = 0%, Cumulative CPU >>>>> 74.06 sec >>>>> INFO : 2016-05-23 00:28:14,977 Stage-1 map = 73%, reduce = 0%, >>>>> Cumulative CPU 74.06 sec >>>>> 2016-05-23 00:28:21,134 Stage-1 map = 77%, reduce = 0%, Cumulative CPU >>>>> 78.72 sec >>>>> INFO : 2016-05-23 00:28:21,134 Stage-1 map = 77%, reduce = 0%, >>>>> Cumulative CPU 78.72 sec >>>>> 2016-05-23 00:28:27,282 Stage-1 map = 82%, reduce = 0%, Cumulative CPU >>>>> 83.32 sec >>>>> INFO : 2016-05-23 00:28:27,282 Stage-1 map = 82%, reduce = 0%, >>>>> Cumulative CPU 83.32 sec >>>>> 2016-05-23 00:28:33,437 Stage-1 map = 86%, reduce = 0%, Cumulative CPU >>>>> 87.9 sec >>>>> INFO : 2016-05-23 00:28:33,437 Stage-1 map = 86%, reduce = 0%, >>>>> Cumulative CPU 87.9 sec >>>>> 2016-05-23 00:28:38,579 Stage-1 map = 91%, reduce = 0%, Cumulative CPU >>>>> 92.52 sec >>>>> INFO : 2016-05-23 00:28:38,579 Stage-1 map = 91%, reduce = 0%, >>>>> Cumulative CPU 92.52 sec >>>>> 2016-05-23 00:28:44,759 Stage-1 map = 95%, reduce = 0%, Cumulative CPU >>>>> 97.35 sec >>>>> INFO : 2016-05-23 00:28:44,759 Stage-1 map = 95%, reduce = 0%, >>>>> Cumulative CPU 97.35 sec >>>>> 2016-05-23 00:28:49,915 Stage-1 map = 100%, reduce = 0%, Cumulative CPU >>>>> 99.6 sec >>>>> INFO : 2016-05-23 00:28:49,915 Stage-1 map = 100%, reduce = 0%, >>>>> Cumulative CPU 99.6 sec >>>>> 2016-05-23 00:28:54,043 Stage-1 map = 100%, reduce = 100%, Cumulative >>>>> CPU 101.4 sec >>>>> MapReduce Total cumulative CPU time: 1 minutes 41 seconds 400 msec >>>>> Ended Job = job_1463956731753_0005 >>>>> MapReduce Jobs Launched: >>>>> Stage-Stage-1: Map: 22 Reduce: 1 Cumulative CPU: 101.4 sec HDFS >>>>> Read: 5318569 HDFS Write: 46 SUCCESS >>>>> Total MapReduce CPU Time Spent: 1 minutes 41 seconds 400 msec >>>>> OK >>>>> INFO : 2016-05-23 00:28:54,043 Stage-1 map = 100%, reduce = 100%, >>>>> Cumulative CPU 101.4 sec >>>>> INFO : MapReduce Total cumulative CPU time: 1 minutes 41 seconds 400 msec >>>>> INFO : Ended Job = job_1463956731753_0005 >>>>> INFO : MapReduce Jobs Launched: >>>>> INFO : Stage-Stage-1: Map: 22 Reduce: 1 Cumulative CPU: 101.4 sec >>>>> HDFS Read: 5318569 HDFS Write: 46 SUCCESS >>>>> INFO : Total MapReduce CPU Time Spent: 1 minutes 41 seconds 400 msec >>>>> INFO : Completed executing >>>>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc); >>>>> Time taken: 142.525 seconds >>>>> INFO : OK >>>>> +-----+------------+---------------+-----------------------+--+ >>>>> | c0 | c1 | c2 | c3 | >>>>> +-----+------------+---------------+-----------------------+--+ >>>>> | 1 | 100000000 | 5.00000005E7 | 2.8867513459481288E7 | >>>>> +-----+------------+---------------+-----------------------+--+ >>>>> 1 row selected (142.744 seconds) >>>>> >>>>> OK Hive on map-reduce engine took 142 seconds compared to 58 seconds with >>>>> Hive on Spark. So you can obviously gain pretty well by using Hive on >>>>> Spark. >>>>> >>>>> Please also note that I did not use any vendor's build for this purpose. >>>>> I compiled Spark 1.3.1 myself. >>>>> >>>>> HTH >>>>> >>>>> >>>>> Dr Mich Talebzadeh >>>>> >>>>> LinkedIn >>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>> >>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> >>>>> >>>>> http://talebzadehmich.wordpress.com/ >>>>> <http://talebzadehmich.wordpress.com/> >>>>> >>>> >>>> >>>> >>>> -- >>>> Best Regards, >>>> Ayan Guha >>>> >>>> >>> >> > >