thanks I think the problem is that the TEZ user group is exceptionally quiet. Just sent an email to Hive user group to see anyone has managed to built a vendor independent version.
Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 29 May 2016 at 21:23, Jörn Franke <jornfra...@gmail.com> wrote: > Well I think it is different from MR. It has some optimizations which you > do not find in MR. Especially the LLAP option in Hive2 makes it > interesting. > > I think hive 1.2 works with 0.7 and 2.0 with 0.8 . At least for 1.2 it is > integrated in the Hortonworks distribution. > > > On 29 May 2016, at 21:43, Mich Talebzadeh <mich.talebza...@gmail.com> > wrote: > > Hi Jorn, > > I started building apache-tez-0.8.2 but got few errors. Couple of guys > from TEZ user group kindly gave a hand but I could not go very far (or may > be I did not make enough efforts) making it work. > > That TEZ user group is very quiet as well. > > My understanding is TEZ is MR with DAG but of course Spark has both plus > in-memory capability. > > It would be interesting to see what version of TEZ works as execution > engine with Hive. > > Vendors are divided on this (use Hive with TEZ) or use Impala instead of > Hive etc as I am sure you already know. > > Cheers, > > > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 29 May 2016 at 20:19, Jörn Franke <jornfra...@gmail.com> wrote: > >> Very interesting do you plan also a test with TEZ? >> >> On 29 May 2016, at 13:40, Mich Talebzadeh <mich.talebza...@gmail.com> >> wrote: >> >> Hi, >> >> I did another study of Hive using Spark engine compared to Hive with MR. >> >> Basically took the original table imported using Sqoop and created and >> populated a new ORC table partitioned by year and month into 48 partitions >> as follows: >> >> <sales_partition.PNG> >> >> Connections use JDBC via beeline. Now for each partition using MR it >> takes an average of 17 minutes as seen below for each PARTITION.. Now that >> is just an individual partition and there are 48 partitions. >> >> In contrast doing the same operation with Spark engine took 10 minutes >> all inclusive. I just gave up on MR. You can see the StartTime and >> FinishTime from below >> >> <image.png> >> >> This is by no means indicate that Spark is much better than MR but shows >> that some very good results can ve achieved using Spark engine. >> >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> >> On 24 May 2016 at 08:03, Mich Talebzadeh <mich.talebza...@gmail.com> >> wrote: >> >>> Hi, >>> >>> We use Hive as the database and use Spark as an all purpose query tool. >>> >>> Whether Hive is the write database for purpose or one is better off with >>> something like Phoenix on Hbase, well the answer is it depends and your >>> mileage varies. >>> >>> So fit for purpose. >>> >>> Ideally what wants is to use the fastest method to get the results. How >>> fast we confine it to our SLA agreements in production and that helps us >>> from unnecessary further work as we technologists like to play around. >>> >>> So in short, we use Spark most of the time and use Hive as the backend >>> engine for data storage, mainly ORC tables. >>> >>> We use Hive on Spark and with Hive 2 on Spark 1.3.1 for now we have a >>> combination that works. Granted it helps to use Hive 2 on Spark 1.6.1 but >>> at the moment it is one of my projects. >>> >>> We do not use any vendor's products as it enables us to move away from >>> being tied down after years of SAP, Oracle and MS dependency to yet another >>> vendor. Besides there is some politics going on with one promoting Tez and >>> another Spark as a backend. That is fine but obviously we prefer an >>> independent assessment ourselves. >>> >>> My gut feeling is that one needs to look at the use case. Recently we >>> had to import a very large table from Oracle to Hive and decided to use >>> Spark 1.6.1 with Hive 2 on Spark 1.3.1 and that worked fine. We just used >>> JDBC connection with temp table and it was good. We could have used sqoop >>> but decided to settle for Spark so it all depends on use case. >>> >>> HTH >>> >>> >>> >>> Dr Mich Talebzadeh >>> >>> >>> >>> LinkedIn * >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>> >>> >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>> >>> On 24 May 2016 at 03:11, ayan guha <guha.a...@gmail.com> wrote: >>> >>>> Hi >>>> >>>> Thanks for very useful stats. >>>> >>>> Did you have any benchmark for using Spark as backend engine for Hive >>>> vs using Spark thrift server (and run spark code for hive queries)? We are >>>> using later but it will be very useful to remove thriftserver, if we can. >>>> >>>> On Tue, May 24, 2016 at 9:51 AM, Jörn Franke <jornfra...@gmail.com> >>>> wrote: >>>> >>>>> >>>>> Hi Mich, >>>>> >>>>> I think these comparisons are useful. One interesting aspect could be >>>>> hardware scalability in this context. Additionally different type of >>>>> computations. Furthermore, one could compare Spark and Tez+llap as >>>>> execution engines. I have the gut feeling that each one can be justified >>>>> by different use cases. >>>>> Nevertheless, there should be always a disclaimer for such >>>>> comparisons, because Spark and Hive are not good for a lot of concurrent >>>>> lookups of single rows. They are not good for frequently write small >>>>> amounts of data (eg sensor data). Here hbase could be more interesting. >>>>> Other use cases can justify graph databases, such as Titan, or text >>>>> analytics/ data matching using Solr on Hadoop. >>>>> Finally, even if you have a lot of data you need to think if you >>>>> always have to process everything. For instance, I have found valid use >>>>> cases in practice where we decided to evaluate 10 machine learning models >>>>> in parallel on only a sample of data and only evaluate the "winning" model >>>>> of the total of data. >>>>> >>>>> As always it depends :) >>>>> >>>>> Best regards >>>>> >>>>> P.s.: at least Hortonworks has in their distribution spark 1.5 with >>>>> hive 1.2 and spark 1.6 with hive 1.2. Maybe they have somewhere described >>>>> how to manage bringing both together. You may check also Apache Bigtop >>>>> (vendor neutral distribution) on how they managed to bring both together. >>>>> >>>>> On 23 May 2016, at 01:42, Mich Talebzadeh <mich.talebza...@gmail.com> >>>>> wrote: >>>>> >>>>> Hi, >>>>> >>>>> >>>>> >>>>> I have done a number of extensive tests using Spark-shell with Hive DB >>>>> and ORC tables. >>>>> >>>>> >>>>> >>>>> Now one issue that we typically face is and I quote: >>>>> >>>>> >>>>> >>>>> Spark is fast as it uses Memory and DAG. Great but when we save data >>>>> it is not fast enough >>>>> >>>>> OK but there is a solution now. If you use Spark with Hive and you are >>>>> on a descent version of Hive >= 0.14, then you can also deploy Spark as >>>>> execution engine for Hive. That will make your application run pretty fast >>>>> as you no longer rely on the old Map-Reduce for Hive engine. In a nutshell >>>>> what you are gaining speed in both querying and storage. >>>>> >>>>> >>>>> >>>>> I have made some comparisons on this set-up and I am sure some of you >>>>> will find it useful. >>>>> >>>>> >>>>> >>>>> The version of Spark I use for Spark queries (Spark as query tool) is >>>>> 1.6. >>>>> >>>>> The version of Hive I use in Hive 2 >>>>> >>>>> The version of Spark I use as Hive execution engine is 1.3.1 It works >>>>> and frankly Spark 1.3.1 as an execution engine is adequate (until we sort >>>>> out the Hadoop libraries mismatch). >>>>> >>>>> >>>>> >>>>> An example I am using Hive on Spark engine to find the min and max of >>>>> IDs for a table with 1 billion rows: >>>>> >>>>> >>>>> >>>>> 0: jdbc:hive2://rhes564:10010/default> select min(id), >>>>> max(id),avg(id), stddev(id) from oraclehadoop.dummy; >>>>> >>>>> Query ID = hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006 >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Starting Spark Job = 5e092ef9-d798-4952-b156-74df49da9151 >>>>> >>>>> >>>>> >>>>> INFO : Completed compiling >>>>> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006); >>>>> Time taken: 1.911 seconds >>>>> >>>>> INFO : Executing >>>>> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006): >>>>> select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy >>>>> >>>>> INFO : Query ID = >>>>> hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006 >>>>> >>>>> INFO : Total jobs = 1 >>>>> >>>>> INFO : Launching Job 1 out of 1 >>>>> >>>>> INFO : Starting task [Stage-1:MAPRED] in serial mode >>>>> >>>>> >>>>> >>>>> Query Hive on Spark job[0] stages: >>>>> >>>>> 0 >>>>> >>>>> 1 >>>>> >>>>> Status: Running (Hive on Spark job[0]) >>>>> >>>>> Job Progress Format >>>>> >>>>> CurrentTime StageId_StageAttemptId: >>>>> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount >>>>> [StageCost] >>>>> >>>>> 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1 >>>>> >>>>> 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22 Stage-1_0: 0/1 >>>>> >>>>> 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22 Stage-1_0: 0/1 >>>>> >>>>> 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22 Stage-1_0: 0/1 >>>>> >>>>> INFO : >>>>> >>>>> Query Hive on Spark job[0] stages: >>>>> >>>>> INFO : 0 >>>>> >>>>> INFO : 1 >>>>> >>>>> INFO : >>>>> >>>>> Status: Running (Hive on Spark job[0]) >>>>> >>>>> INFO : Job Progress Format >>>>> >>>>> CurrentTime StageId_StageAttemptId: >>>>> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount >>>>> [StageCost] >>>>> >>>>> INFO : 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1 >>>>> >>>>> INFO : 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22 Stage-1_0: 0/1 >>>>> >>>>> INFO : 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22 Stage-1_0: 0/1 >>>>> >>>>> INFO : 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22 Stage-1_0: 0/1 >>>>> >>>>> 2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished Stage-1_0: >>>>> 0(+1)/1 >>>>> >>>>> 2016-05-23 00:21:30,189 Stage-0_0: 22/22 Finished Stage-1_0: 1/1 >>>>> Finished >>>>> >>>>> Status: Finished successfully in 53.25 seconds >>>>> >>>>> OK >>>>> >>>>> INFO : 2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished >>>>> Stage-1_0: 0(+1)/1 >>>>> >>>>> INFO : 2016-05-23 00:21:30,189 Stage-0_0: 22/22 Finished >>>>> Stage-1_0: 1/1 Finished >>>>> >>>>> INFO : Status: Finished successfully in 53.25 seconds >>>>> >>>>> INFO : Completed executing >>>>> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006); >>>>> Time taken: 56.337 seconds >>>>> >>>>> INFO : OK >>>>> >>>>> +-----+------------+---------------+-----------------------+--+ >>>>> >>>>> | c0 | c1 | c2 | c3 | >>>>> >>>>> +-----+------------+---------------+-----------------------+--+ >>>>> >>>>> | 1 | 100000000 | 5.00000005E7 | 2.8867513459481288E7 | >>>>> >>>>> +-----+------------+---------------+-----------------------+--+ >>>>> >>>>> 1 row selected (58.529 seconds) >>>>> >>>>> >>>>> >>>>> 58 seconds first run with cold cache is pretty good >>>>> >>>>> >>>>> >>>>> And let us compare it with running the same query on map-reduce engine >>>>> >>>>> >>>>> >>>>> : jdbc:hive2://rhes564:10010/default> set hive.execution.engine=mr; >>>>> >>>>> Hive-on-MR is deprecated in Hive 2 and may not be available in the >>>>> future versions. Consider using a different execution engine (i.e. spark, >>>>> tez) or using Hive 1.X releases. >>>>> >>>>> No rows affected (0.007 seconds) >>>>> >>>>> 0: jdbc:hive2://rhes564:10010/default> select min(id), >>>>> max(id),avg(id), stddev(id) from oraclehadoop.dummy; >>>>> >>>>> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available >>>>> in the future versions. Consider using a different execution engine (i.e. >>>>> spark, tez) or using Hive 1.X releases. >>>>> >>>>> Query ID = hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc >>>>> >>>>> Total jobs = 1 >>>>> >>>>> Launching Job 1 out of 1 >>>>> >>>>> Number of reduce tasks determined at compile time: 1 >>>>> >>>>> In order to change the average load for a reducer (in bytes): >>>>> >>>>> set hive.exec.reducers.bytes.per.reducer=<number> >>>>> >>>>> In order to limit the maximum number of reducers: >>>>> >>>>> set hive.exec.reducers.max=<number> >>>>> >>>>> In order to set a constant number of reducers: >>>>> >>>>> set mapreduce.job.reduces=<number> >>>>> >>>>> Starting Job = job_1463956731753_0005, Tracking URL = >>>>> http://localhost.localdomain:8088/proxy/application_1463956731753_0005/ >>>>> >>>>> Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job -kill >>>>> job_1463956731753_0005 >>>>> >>>>> Hadoop job information for Stage-1: number of mappers: 22; number of >>>>> reducers: 1 >>>>> >>>>> 2016-05-23 00:26:38,127 Stage-1 map = 0%, reduce = 0% >>>>> >>>>> INFO : Compiling >>>>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc): >>>>> select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy >>>>> >>>>> INFO : Semantic Analysis Completed >>>>> >>>>> INFO : Returning Hive schema: >>>>> Schema(fieldSchemas:[FieldSchema(name:c0, type:int, comment:null), >>>>> FieldSchema(name:c1, type:int, comment:null), FieldSchema(name:c2, >>>>> type:double, comment:null), FieldSchema(name:c3, type:double, >>>>> comment:null)], properties:null) >>>>> >>>>> INFO : Completed compiling >>>>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc); >>>>> Time taken: 0.144 seconds >>>>> >>>>> INFO : Executing >>>>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc): >>>>> select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy >>>>> >>>>> WARN : Hive-on-MR is deprecated in Hive 2 and may not be available in >>>>> the future versions. Consider using a different execution engine (i.e. >>>>> spark, tez) or using Hive 1.X releases. >>>>> >>>>> INFO : WARNING: Hive-on-MR is deprecated in Hive 2 and may not be >>>>> available in the future versions. Consider using a different execution >>>>> engine (i.e. spark, tez) or using Hive 1.X releases. >>>>> >>>>> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available >>>>> in the future versions. Consider using a different execution engine (i.e. >>>>> spark, tez) or using Hive 1.X releases. >>>>> >>>>> INFO : Query ID = >>>>> hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc >>>>> >>>>> INFO : Total jobs = 1 >>>>> >>>>> INFO : Launching Job 1 out of 1 >>>>> >>>>> INFO : Starting task [Stage-1:MAPRED] in serial mode >>>>> >>>>> INFO : Number of reduce tasks determined at compile time: 1 >>>>> >>>>> INFO : In order to change the average load for a reducer (in bytes): >>>>> >>>>> INFO : set hive.exec.reducers.bytes.per.reducer=<number> >>>>> >>>>> INFO : In order to limit the maximum number of reducers: >>>>> >>>>> INFO : set hive.exec.reducers.max=<number> >>>>> >>>>> INFO : In order to set a constant number of reducers: >>>>> >>>>> INFO : set mapreduce.job.reduces=<number> >>>>> >>>>> WARN : Hadoop command-line option parsing not performed. Implement >>>>> the Tool interface and execute your application with ToolRunner to remedy >>>>> this. >>>>> >>>>> INFO : number of splits:22 >>>>> >>>>> INFO : Submitting tokens for job: job_1463956731753_0005 >>>>> >>>>> INFO : The url to track the job: >>>>> http://localhost.localdomain:8088/proxy/application_1463956731753_0005/ >>>>> >>>>> INFO : Starting Job = job_1463956731753_0005, Tracking URL = >>>>> http://localhost.localdomain:8088/proxy/application_1463956731753_0005/ >>>>> >>>>> INFO : Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job -kill >>>>> job_1463956731753_0005 >>>>> >>>>> INFO : Hadoop job information for Stage-1: number of mappers: 22; >>>>> number of reducers: 1 >>>>> >>>>> INFO : 2016-05-23 00:26:38,127 Stage-1 map = 0%, reduce = 0% >>>>> >>>>> 2016-05-23 00:26:44,367 Stage-1 map = 5%, reduce = 0%, Cumulative CPU >>>>> 4.56 sec >>>>> >>>>> INFO : 2016-05-23 00:26:44,367 Stage-1 map = 5%, reduce = 0%, >>>>> Cumulative CPU 4.56 sec >>>>> >>>>> 2016-05-23 00:26:50,558 Stage-1 map = 9%, reduce = 0%, Cumulative CPU >>>>> 9.17 sec >>>>> >>>>> INFO : 2016-05-23 00:26:50,558 Stage-1 map = 9%, reduce = 0%, >>>>> Cumulative CPU 9.17 sec >>>>> >>>>> 2016-05-23 00:26:56,747 Stage-1 map = 14%, reduce = 0%, Cumulative >>>>> CPU 14.04 sec >>>>> >>>>> INFO : 2016-05-23 00:26:56,747 Stage-1 map = 14%, reduce = 0%, >>>>> Cumulative CPU 14.04 sec >>>>> >>>>> 2016-05-23 00:27:02,944 Stage-1 map = 18%, reduce = 0%, Cumulative >>>>> CPU 18.64 sec >>>>> >>>>> INFO : 2016-05-23 00:27:02,944 Stage-1 map = 18%, reduce = 0%, >>>>> Cumulative CPU 18.64 sec >>>>> >>>>> 2016-05-23 00:27:08,105 Stage-1 map = 23%, reduce = 0%, Cumulative >>>>> CPU 23.25 sec >>>>> >>>>> INFO : 2016-05-23 00:27:08,105 Stage-1 map = 23%, reduce = 0%, >>>>> Cumulative CPU 23.25 sec >>>>> >>>>> 2016-05-23 00:27:14,298 Stage-1 map = 27%, reduce = 0%, Cumulative >>>>> CPU 27.84 sec >>>>> >>>>> INFO : 2016-05-23 00:27:14,298 Stage-1 map = 27%, reduce = 0%, >>>>> Cumulative CPU 27.84 sec >>>>> >>>>> 2016-05-23 00:27:20,484 Stage-1 map = 32%, reduce = 0%, Cumulative >>>>> CPU 32.56 sec >>>>> >>>>> INFO : 2016-05-23 00:27:20,484 Stage-1 map = 32%, reduce = 0%, >>>>> Cumulative CPU 32.56 sec >>>>> >>>>> 2016-05-23 00:27:26,659 Stage-1 map = 36%, reduce = 0%, Cumulative >>>>> CPU 37.1 sec >>>>> >>>>> INFO : 2016-05-23 00:27:26,659 Stage-1 map = 36%, reduce = 0%, >>>>> Cumulative CPU 37.1 sec >>>>> >>>>> 2016-05-23 00:27:32,839 Stage-1 map = 41%, reduce = 0%, Cumulative >>>>> CPU 41.74 sec >>>>> >>>>> INFO : 2016-05-23 00:27:32,839 Stage-1 map = 41%, reduce = 0%, >>>>> Cumulative CPU 41.74 sec >>>>> >>>>> 2016-05-23 00:27:39,003 Stage-1 map = 45%, reduce = 0%, Cumulative >>>>> CPU 46.32 sec >>>>> >>>>> INFO : 2016-05-23 00:27:39,003 Stage-1 map = 45%, reduce = 0%, >>>>> Cumulative CPU 46.32 sec >>>>> >>>>> 2016-05-23 00:27:45,173 Stage-1 map = 50%, reduce = 0%, Cumulative >>>>> CPU 50.93 sec >>>>> >>>>> 2016-05-23 00:27:50,316 Stage-1 map = 55%, reduce = 0%, Cumulative >>>>> CPU 55.55 sec >>>>> >>>>> INFO : 2016-05-23 00:27:45,173 Stage-1 map = 50%, reduce = 0%, >>>>> Cumulative CPU 50.93 sec >>>>> >>>>> INFO : 2016-05-23 00:27:50,316 Stage-1 map = 55%, reduce = 0%, >>>>> Cumulative CPU 55.55 sec >>>>> >>>>> 2016-05-23 00:27:56,482 Stage-1 map = 59%, reduce = 0%, Cumulative >>>>> CPU 60.25 sec >>>>> >>>>> INFO : 2016-05-23 00:27:56,482 Stage-1 map = 59%, reduce = 0%, >>>>> Cumulative CPU 60.25 sec >>>>> >>>>> 2016-05-23 00:28:02,642 Stage-1 map = 64%, reduce = 0%, Cumulative >>>>> CPU 64.86 sec >>>>> >>>>> INFO : 2016-05-23 00:28:02,642 Stage-1 map = 64%, reduce = 0%, >>>>> Cumulative CPU 64.86 sec >>>>> >>>>> 2016-05-23 00:28:08,814 Stage-1 map = 68%, reduce = 0%, Cumulative >>>>> CPU 69.41 sec >>>>> >>>>> INFO : 2016-05-23 00:28:08,814 Stage-1 map = 68%, reduce = 0%, >>>>> Cumulative CPU 69.41 sec >>>>> >>>>> 2016-05-23 00:28:14,977 Stage-1 map = 73%, reduce = 0%, Cumulative >>>>> CPU 74.06 sec >>>>> >>>>> INFO : 2016-05-23 00:28:14,977 Stage-1 map = 73%, reduce = 0%, >>>>> Cumulative CPU 74.06 sec >>>>> >>>>> 2016-05-23 00:28:21,134 Stage-1 map = 77%, reduce = 0%, Cumulative >>>>> CPU 78.72 sec >>>>> >>>>> INFO : 2016-05-23 00:28:21,134 Stage-1 map = 77%, reduce = 0%, >>>>> Cumulative CPU 78.72 sec >>>>> >>>>> 2016-05-23 00:28:27,282 Stage-1 map = 82%, reduce = 0%, Cumulative >>>>> CPU 83.32 sec >>>>> >>>>> INFO : 2016-05-23 00:28:27,282 Stage-1 map = 82%, reduce = 0%, >>>>> Cumulative CPU 83.32 sec >>>>> >>>>> 2016-05-23 00:28:33,437 Stage-1 map = 86%, reduce = 0%, Cumulative >>>>> CPU 87.9 sec >>>>> >>>>> INFO : 2016-05-23 00:28:33,437 Stage-1 map = 86%, reduce = 0%, >>>>> Cumulative CPU 87.9 sec >>>>> >>>>> 2016-05-23 00:28:38,579 Stage-1 map = 91%, reduce = 0%, Cumulative >>>>> CPU 92.52 sec >>>>> >>>>> INFO : 2016-05-23 00:28:38,579 Stage-1 map = 91%, reduce = 0%, >>>>> Cumulative CPU 92.52 sec >>>>> >>>>> 2016-05-23 00:28:44,759 Stage-1 map = 95%, reduce = 0%, Cumulative >>>>> CPU 97.35 sec >>>>> >>>>> INFO : 2016-05-23 00:28:44,759 Stage-1 map = 95%, reduce = 0%, >>>>> Cumulative CPU 97.35 sec >>>>> >>>>> 2016-05-23 00:28:49,915 Stage-1 map = 100%, reduce = 0%, Cumulative >>>>> CPU 99.6 sec >>>>> >>>>> INFO : 2016-05-23 00:28:49,915 Stage-1 map = 100%, reduce = 0%, >>>>> Cumulative CPU 99.6 sec >>>>> >>>>> 2016-05-23 00:28:54,043 Stage-1 map = 100%, reduce = 100%, Cumulative >>>>> CPU 101.4 sec >>>>> >>>>> MapReduce Total cumulative CPU time: 1 minutes 41 seconds 400 msec >>>>> >>>>> Ended Job = job_1463956731753_0005 >>>>> >>>>> MapReduce Jobs Launched: >>>>> >>>>> Stage-Stage-1: Map: 22 Reduce: 1 Cumulative CPU: 101.4 sec HDFS >>>>> Read: 5318569 HDFS Write: 46 SUCCESS >>>>> >>>>> Total MapReduce CPU Time Spent: 1 minutes 41 seconds 400 msec >>>>> >>>>> OK >>>>> >>>>> INFO : 2016-05-23 00:28:54,043 Stage-1 map = 100%, reduce = 100%, >>>>> Cumulative CPU 101.4 sec >>>>> >>>>> INFO : MapReduce Total cumulative CPU time: 1 minutes 41 seconds 400 >>>>> msec >>>>> >>>>> INFO : Ended Job = job_1463956731753_0005 >>>>> >>>>> INFO : MapReduce Jobs Launched: >>>>> >>>>> INFO : Stage-Stage-1: Map: 22 Reduce: 1 Cumulative CPU: 101.4 >>>>> sec HDFS Read: 5318569 HDFS Write: 46 SUCCESS >>>>> >>>>> INFO : Total MapReduce CPU Time Spent: 1 minutes 41 seconds 400 msec >>>>> >>>>> INFO : Completed executing >>>>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc); >>>>> Time taken: 142.525 seconds >>>>> >>>>> INFO : OK >>>>> >>>>> +-----+------------+---------------+-----------------------+--+ >>>>> >>>>> | c0 | c1 | c2 | c3 | >>>>> >>>>> +-----+------------+---------------+-----------------------+--+ >>>>> >>>>> | 1 | 100000000 | 5.00000005E7 | 2.8867513459481288E7 | >>>>> >>>>> +-----+------------+---------------+-----------------------+--+ >>>>> >>>>> 1 row selected (142.744 seconds) >>>>> >>>>> >>>>> >>>>> OK Hive on map-reduce engine took 142 seconds compared to 58 seconds >>>>> with Hive on Spark. So you can obviously gain pretty well by using Hive on >>>>> Spark. >>>>> >>>>> >>>>> >>>>> Please also note that I did not use any vendor's build for this >>>>> purpose. I compiled Spark 1.3.1 myself. >>>>> >>>>> >>>>> >>>>> HTH >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Dr Mich Talebzadeh >>>>> >>>>> >>>>> >>>>> LinkedIn >>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>> >>>>> >>>>> >>>>> http://talebzadehmich.wordpress.com/ >>>>> >>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> Best Regards, >>>> Ayan Guha >>>> >>> >>> >> >