Well I think it is different from MR. It has some optimizations which you do not find in MR. Especially the LLAP option in Hive2 makes it interesting.
I think hive 1.2 works with 0.7 and 2.0 with 0.8 . At least for 1.2 it is integrated in the Hortonworks distribution. > On 29 May 2016, at 21:43, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > > Hi Jorn, > > I started building apache-tez-0.8.2 but got few errors. Couple of guys from > TEZ user group kindly gave a hand but I could not go very far (or may be I > did not make enough efforts) making it work. > > That TEZ user group is very quiet as well. > > My understanding is TEZ is MR with DAG but of course Spark has both plus > in-memory capability. > > It would be interesting to see what version of TEZ works as execution engine > with Hive. > > Vendors are divided on this (use Hive with TEZ) or use Impala instead of Hive > etc as I am sure you already know. > > Cheers, > > > > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > http://talebzadehmich.wordpress.com > > >> On 29 May 2016 at 20:19, Jörn Franke <jornfra...@gmail.com> wrote: >> Very interesting do you plan also a test with TEZ? >> >>> On 29 May 2016, at 13:40, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: >>> >>> Hi, >>> >>> I did another study of Hive using Spark engine compared to Hive with MR. >>> >>> Basically took the original table imported using Sqoop and created and >>> populated a new ORC table partitioned by year and month into 48 partitions >>> as follows: >>> >>> <sales_partition.PNG> >>> >>> Connections use JDBC via beeline. Now for each partition using MR it takes >>> an average of 17 minutes as seen below for each PARTITION.. Now that is >>> just an individual partition and there are 48 partitions. >>> >>> In contrast doing the same operation with Spark engine took 10 minutes all >>> inclusive. I just gave up on MR. You can see the StartTime and FinishTime >>> from below >>> >>> <image.png> >>> >>> This is by no means indicate that Spark is much better than MR but shows >>> that some very good results can ve achieved using Spark engine. >>> >>> >>> Dr Mich Talebzadeh >>> >>> LinkedIn >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>>> On 24 May 2016 at 08:03, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: >>>> Hi, >>>> >>>> We use Hive as the database and use Spark as an all purpose query tool. >>>> >>>> Whether Hive is the write database for purpose or one is better off with >>>> something like Phoenix on Hbase, well the answer is it depends and your >>>> mileage varies. >>>> >>>> So fit for purpose. >>>> >>>> Ideally what wants is to use the fastest method to get the results. How >>>> fast we confine it to our SLA agreements in production and that helps us >>>> from unnecessary further work as we technologists like to play around. >>>> >>>> So in short, we use Spark most of the time and use Hive as the backend >>>> engine for data storage, mainly ORC tables. >>>> >>>> We use Hive on Spark and with Hive 2 on Spark 1.3.1 for now we have a >>>> combination that works. Granted it helps to use Hive 2 on Spark 1.6.1 but >>>> at the moment it is one of my projects. >>>> >>>> We do not use any vendor's products as it enables us to move away from >>>> being tied down after years of SAP, Oracle and MS dependency to yet >>>> another vendor. Besides there is some politics going on with one promoting >>>> Tez and another Spark as a backend. That is fine but obviously we prefer >>>> an independent assessment ourselves. >>>> >>>> My gut feeling is that one needs to look at the use case. Recently we had >>>> to import a very large table from Oracle to Hive and decided to use Spark >>>> 1.6.1 with Hive 2 on Spark 1.3.1 and that worked fine. We just used JDBC >>>> connection with temp table and it was good. We could have used sqoop but >>>> decided to settle for Spark so it all depends on use case. >>>> >>>> HTH >>>> >>>> >>>> >>>> Dr Mich Talebzadeh >>>> >>>> LinkedIn >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>> >>>> http://talebzadehmich.wordpress.com >>>> >>>> >>>>> On 24 May 2016 at 03:11, ayan guha <guha.a...@gmail.com> wrote: >>>>> Hi >>>>> >>>>> Thanks for very useful stats. >>>>> >>>>> Did you have any benchmark for using Spark as backend engine for Hive vs >>>>> using Spark thrift server (and run spark code for hive queries)? We are >>>>> using later but it will be very useful to remove thriftserver, if we can. >>>>> >>>>>> On Tue, May 24, 2016 at 9:51 AM, Jörn Franke <jornfra...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> Hi Mich, >>>>>> >>>>>> I think these comparisons are useful. One interesting aspect could be >>>>>> hardware scalability in this context. Additionally different type of >>>>>> computations. Furthermore, one could compare Spark and Tez+llap as >>>>>> execution engines. I have the gut feeling that each one can be >>>>>> justified by different use cases. >>>>>> Nevertheless, there should be always a disclaimer for such comparisons, >>>>>> because Spark and Hive are not good for a lot of concurrent lookups of >>>>>> single rows. They are not good for frequently write small amounts of >>>>>> data (eg sensor data). Here hbase could be more interesting. Other use >>>>>> cases can justify graph databases, such as Titan, or text analytics/ >>>>>> data matching using Solr on Hadoop. >>>>>> Finally, even if you have a lot of data you need to think if you always >>>>>> have to process everything. For instance, I have found valid use cases >>>>>> in practice where we decided to evaluate 10 machine learning models in >>>>>> parallel on only a sample of data and only evaluate the "winning" model >>>>>> of the total of data. >>>>>> >>>>>> As always it depends :) >>>>>> >>>>>> Best regards >>>>>> >>>>>> P.s.: at least Hortonworks has in their distribution spark 1.5 with hive >>>>>> 1.2 and spark 1.6 with hive 1.2. Maybe they have somewhere described how >>>>>> to manage bringing both together. You may check also Apache Bigtop >>>>>> (vendor neutral distribution) on how they managed to bring both together. >>>>>> >>>>>>> On 23 May 2016, at 01:42, Mich Talebzadeh <mich.talebza...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I have done a number of extensive tests using Spark-shell with Hive DB >>>>>>> and ORC tables. >>>>>>> >>>>>>> Now one issue that we typically face is and I quote: >>>>>>> >>>>>>> Spark is fast as it uses Memory and DAG. Great but when we save data it >>>>>>> is not fast enough >>>>>>> >>>>>>> OK but there is a solution now. If you use Spark with Hive and you are >>>>>>> on a descent version of Hive >= 0.14, then you can also deploy Spark as >>>>>>> execution engine for Hive. That will make your application run pretty >>>>>>> fast as you no longer rely on the old Map-Reduce for Hive engine. In a >>>>>>> nutshell what you are gaining speed in both querying and storage. >>>>>>> >>>>>>> I have made some comparisons on this set-up and I am sure some of you >>>>>>> will find it useful. >>>>>>> >>>>>>> The version of Spark I use for Spark queries (Spark as query tool) is >>>>>>> 1.6. >>>>>>> The version of Hive I use in Hive 2 >>>>>>> The version of Spark I use as Hive execution engine is 1.3.1 It works >>>>>>> and frankly Spark 1.3.1 as an execution engine is adequate (until we >>>>>>> sort out the Hadoop libraries mismatch). >>>>>>> >>>>>>> An example I am using Hive on Spark engine to find the min and max of >>>>>>> IDs for a table with 1 billion rows: >>>>>>> >>>>>>> 0: jdbc:hive2://rhes564:10010/default> select min(id), >>>>>>> max(id),avg(id), stddev(id) from oraclehadoop.dummy; >>>>>>> Query ID = hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006 >>>>>>> >>>>>>> >>>>>>> Starting Spark Job = 5e092ef9-d798-4952-b156-74df49da9151 >>>>>>> >>>>>>> INFO : Completed compiling >>>>>>> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006); >>>>>>> Time taken: 1.911 seconds >>>>>>> INFO : Executing >>>>>>> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006): >>>>>>> select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy >>>>>>> INFO : Query ID = >>>>>>> hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006 >>>>>>> INFO : Total jobs = 1 >>>>>>> INFO : Launching Job 1 out of 1 >>>>>>> INFO : Starting task [Stage-1:MAPRED] in serial mode >>>>>>> >>>>>>> Query Hive on Spark job[0] stages: >>>>>>> 0 >>>>>>> 1 >>>>>>> Status: Running (Hive on Spark job[0]) >>>>>>> Job Progress Format >>>>>>> CurrentTime StageId_StageAttemptId: >>>>>>> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount >>>>>>> [StageCost] >>>>>>> 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1 >>>>>>> 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22 Stage-1_0: 0/1 >>>>>>> 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22 Stage-1_0: 0/1 >>>>>>> 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22 Stage-1_0: 0/1 >>>>>>> INFO : >>>>>>> Query Hive on Spark job[0] stages: >>>>>>> INFO : 0 >>>>>>> INFO : 1 >>>>>>> INFO : >>>>>>> Status: Running (Hive on Spark job[0]) >>>>>>> INFO : Job Progress Format >>>>>>> CurrentTime StageId_StageAttemptId: >>>>>>> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount >>>>>>> [StageCost] >>>>>>> INFO : 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1 >>>>>>> INFO : 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22 Stage-1_0: 0/1 >>>>>>> INFO : 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22 Stage-1_0: 0/1 >>>>>>> INFO : 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22 Stage-1_0: 0/1 >>>>>>> 2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished Stage-1_0: >>>>>>> 0(+1)/1 >>>>>>> 2016-05-23 00:21:30,189 Stage-0_0: 22/22 Finished Stage-1_0: 1/1 >>>>>>> Finished >>>>>>> Status: Finished successfully in 53.25 seconds >>>>>>> OK >>>>>>> INFO : 2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished >>>>>>> Stage-1_0: 0(+1)/1 >>>>>>> INFO : 2016-05-23 00:21:30,189 Stage-0_0: 22/22 Finished >>>>>>> Stage-1_0: 1/1 Finished >>>>>>> INFO : Status: Finished successfully in 53.25 seconds >>>>>>> INFO : Completed executing >>>>>>> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006); >>>>>>> Time taken: 56.337 seconds >>>>>>> INFO : OK >>>>>>> +-----+------------+---------------+-----------------------+--+ >>>>>>> | c0 | c1 | c2 | c3 | >>>>>>> +-----+------------+---------------+-----------------------+--+ >>>>>>> | 1 | 100000000 | 5.00000005E7 | 2.8867513459481288E7 | >>>>>>> +-----+------------+---------------+-----------------------+--+ >>>>>>> 1 row selected (58.529 seconds) >>>>>>> >>>>>>> 58 seconds first run with cold cache is pretty good >>>>>>> >>>>>>> And let us compare it with running the same query on map-reduce engine >>>>>>> >>>>>>> : jdbc:hive2://rhes564:10010/default> set hive.execution.engine=mr; >>>>>>> Hive-on-MR is deprecated in Hive 2 and may not be available in the >>>>>>> future versions. Consider using a different execution engine (i.e. >>>>>>> spark, tez) or using Hive 1.X releases. >>>>>>> No rows affected (0.007 seconds) >>>>>>> 0: jdbc:hive2://rhes564:10010/default> select min(id), >>>>>>> max(id),avg(id), stddev(id) from oraclehadoop.dummy; >>>>>>> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in >>>>>>> the future versions. Consider using a different execution engine (i.e. >>>>>>> spark, tez) or using Hive 1.X releases. >>>>>>> Query ID = hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc >>>>>>> Total jobs = 1 >>>>>>> Launching Job 1 out of 1 >>>>>>> Number of reduce tasks determined at compile time: 1 >>>>>>> In order to change the average load for a reducer (in bytes): >>>>>>> set hive.exec.reducers.bytes.per.reducer=<number> >>>>>>> In order to limit the maximum number of reducers: >>>>>>> set hive.exec.reducers.max=<number> >>>>>>> In order to set a constant number of reducers: >>>>>>> set mapreduce.job.reduces=<number> >>>>>>> Starting Job = job_1463956731753_0005, Tracking URL = >>>>>>> http://localhost.localdomain:8088/proxy/application_1463956731753_0005/ >>>>>>> Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job -kill >>>>>>> job_1463956731753_0005 >>>>>>> Hadoop job information for Stage-1: number of mappers: 22; number of >>>>>>> reducers: 1 >>>>>>> 2016-05-23 00:26:38,127 Stage-1 map = 0%, reduce = 0% >>>>>>> INFO : Compiling >>>>>>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc): >>>>>>> select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy >>>>>>> INFO : Semantic Analysis Completed >>>>>>> INFO : Returning Hive schema: >>>>>>> Schema(fieldSchemas:[FieldSchema(name:c0, type:int, comment:null), >>>>>>> FieldSchema(name:c1, type:int, comment:null), FieldSchema(name:c2, >>>>>>> type:double, comment:null), FieldSchema(name:c3, type:double, >>>>>>> comment:null)], properties:null) >>>>>>> INFO : Completed compiling >>>>>>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc); >>>>>>> Time taken: 0.144 seconds >>>>>>> INFO : Executing >>>>>>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc): >>>>>>> select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy >>>>>>> WARN : Hive-on-MR is deprecated in Hive 2 and may not be available in >>>>>>> the future versions. Consider using a different execution engine (i.e. >>>>>>> spark, tez) or using Hive 1.X releases. >>>>>>> INFO : WARNING: Hive-on-MR is deprecated in Hive 2 and may not be >>>>>>> available in the future versions. Consider using a different execution >>>>>>> engine (i.e. spark, tez) or using Hive 1.X releases. >>>>>>> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in >>>>>>> the future versions. Consider using a different execution engine (i.e. >>>>>>> spark, tez) or using Hive 1.X releases. >>>>>>> INFO : Query ID = >>>>>>> hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc >>>>>>> INFO : Total jobs = 1 >>>>>>> INFO : Launching Job 1 out of 1 >>>>>>> INFO : Starting task [Stage-1:MAPRED] in serial mode >>>>>>> INFO : Number of reduce tasks determined at compile time: 1 >>>>>>> INFO : In order to change the average load for a reducer (in bytes): >>>>>>> INFO : set hive.exec.reducers.bytes.per.reducer=<number> >>>>>>> INFO : In order to limit the maximum number of reducers: >>>>>>> INFO : set hive.exec.reducers.max=<number> >>>>>>> INFO : In order to set a constant number of reducers: >>>>>>> INFO : set mapreduce.job.reduces=<number> >>>>>>> WARN : Hadoop command-line option parsing not performed. Implement the >>>>>>> Tool interface and execute your application with ToolRunner to remedy >>>>>>> this. >>>>>>> INFO : number of splits:22 >>>>>>> INFO : Submitting tokens for job: job_1463956731753_0005 >>>>>>> INFO : The url to track the job: >>>>>>> http://localhost.localdomain:8088/proxy/application_1463956731753_0005/ >>>>>>> INFO : Starting Job = job_1463956731753_0005, Tracking URL = >>>>>>> http://localhost.localdomain:8088/proxy/application_1463956731753_0005/ >>>>>>> INFO : Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job -kill >>>>>>> job_1463956731753_0005 >>>>>>> INFO : Hadoop job information for Stage-1: number of mappers: 22; >>>>>>> number of reducers: 1 >>>>>>> INFO : 2016-05-23 00:26:38,127 Stage-1 map = 0%, reduce = 0% >>>>>>> 2016-05-23 00:26:44,367 Stage-1 map = 5%, reduce = 0%, Cumulative CPU >>>>>>> 4.56 sec >>>>>>> INFO : 2016-05-23 00:26:44,367 Stage-1 map = 5%, reduce = 0%, >>>>>>> Cumulative CPU 4.56 sec >>>>>>> 2016-05-23 00:26:50,558 Stage-1 map = 9%, reduce = 0%, Cumulative CPU >>>>>>> 9.17 sec >>>>>>> INFO : 2016-05-23 00:26:50,558 Stage-1 map = 9%, reduce = 0%, >>>>>>> Cumulative CPU 9.17 sec >>>>>>> 2016-05-23 00:26:56,747 Stage-1 map = 14%, reduce = 0%, Cumulative CPU >>>>>>> 14.04 sec >>>>>>> INFO : 2016-05-23 00:26:56,747 Stage-1 map = 14%, reduce = 0%, >>>>>>> Cumulative CPU 14.04 sec >>>>>>> 2016-05-23 00:27:02,944 Stage-1 map = 18%, reduce = 0%, Cumulative CPU >>>>>>> 18.64 sec >>>>>>> INFO : 2016-05-23 00:27:02,944 Stage-1 map = 18%, reduce = 0%, >>>>>>> Cumulative CPU 18.64 sec >>>>>>> 2016-05-23 00:27:08,105 Stage-1 map = 23%, reduce = 0%, Cumulative CPU >>>>>>> 23.25 sec >>>>>>> INFO : 2016-05-23 00:27:08,105 Stage-1 map = 23%, reduce = 0%, >>>>>>> Cumulative CPU 23.25 sec >>>>>>> 2016-05-23 00:27:14,298 Stage-1 map = 27%, reduce = 0%, Cumulative CPU >>>>>>> 27.84 sec >>>>>>> INFO : 2016-05-23 00:27:14,298 Stage-1 map = 27%, reduce = 0%, >>>>>>> Cumulative CPU 27.84 sec >>>>>>> 2016-05-23 00:27:20,484 Stage-1 map = 32%, reduce = 0%, Cumulative CPU >>>>>>> 32.56 sec >>>>>>> INFO : 2016-05-23 00:27:20,484 Stage-1 map = 32%, reduce = 0%, >>>>>>> Cumulative CPU 32.56 sec >>>>>>> 2016-05-23 00:27:26,659 Stage-1 map = 36%, reduce = 0%, Cumulative CPU >>>>>>> 37.1 sec >>>>>>> INFO : 2016-05-23 00:27:26,659 Stage-1 map = 36%, reduce = 0%, >>>>>>> Cumulative CPU 37.1 sec >>>>>>> 2016-05-23 00:27:32,839 Stage-1 map = 41%, reduce = 0%, Cumulative CPU >>>>>>> 41.74 sec >>>>>>> INFO : 2016-05-23 00:27:32,839 Stage-1 map = 41%, reduce = 0%, >>>>>>> Cumulative CPU 41.74 sec >>>>>>> 2016-05-23 00:27:39,003 Stage-1 map = 45%, reduce = 0%, Cumulative CPU >>>>>>> 46.32 sec >>>>>>> INFO : 2016-05-23 00:27:39,003 Stage-1 map = 45%, reduce = 0%, >>>>>>> Cumulative CPU 46.32 sec >>>>>>> 2016-05-23 00:27:45,173 Stage-1 map = 50%, reduce = 0%, Cumulative CPU >>>>>>> 50.93 sec >>>>>>> 2016-05-23 00:27:50,316 Stage-1 map = 55%, reduce = 0%, Cumulative CPU >>>>>>> 55.55 sec >>>>>>> INFO : 2016-05-23 00:27:45,173 Stage-1 map = 50%, reduce = 0%, >>>>>>> Cumulative CPU 50.93 sec >>>>>>> INFO : 2016-05-23 00:27:50,316 Stage-1 map = 55%, reduce = 0%, >>>>>>> Cumulative CPU 55.55 sec >>>>>>> 2016-05-23 00:27:56,482 Stage-1 map = 59%, reduce = 0%, Cumulative CPU >>>>>>> 60.25 sec >>>>>>> INFO : 2016-05-23 00:27:56,482 Stage-1 map = 59%, reduce = 0%, >>>>>>> Cumulative CPU 60.25 sec >>>>>>> 2016-05-23 00:28:02,642 Stage-1 map = 64%, reduce = 0%, Cumulative CPU >>>>>>> 64.86 sec >>>>>>> INFO : 2016-05-23 00:28:02,642 Stage-1 map = 64%, reduce = 0%, >>>>>>> Cumulative CPU 64.86 sec >>>>>>> 2016-05-23 00:28:08,814 Stage-1 map = 68%, reduce = 0%, Cumulative CPU >>>>>>> 69.41 sec >>>>>>> INFO : 2016-05-23 00:28:08,814 Stage-1 map = 68%, reduce = 0%, >>>>>>> Cumulative CPU 69.41 sec >>>>>>> 2016-05-23 00:28:14,977 Stage-1 map = 73%, reduce = 0%, Cumulative CPU >>>>>>> 74.06 sec >>>>>>> INFO : 2016-05-23 00:28:14,977 Stage-1 map = 73%, reduce = 0%, >>>>>>> Cumulative CPU 74.06 sec >>>>>>> 2016-05-23 00:28:21,134 Stage-1 map = 77%, reduce = 0%, Cumulative CPU >>>>>>> 78.72 sec >>>>>>> INFO : 2016-05-23 00:28:21,134 Stage-1 map = 77%, reduce = 0%, >>>>>>> Cumulative CPU 78.72 sec >>>>>>> 2016-05-23 00:28:27,282 Stage-1 map = 82%, reduce = 0%, Cumulative CPU >>>>>>> 83.32 sec >>>>>>> INFO : 2016-05-23 00:28:27,282 Stage-1 map = 82%, reduce = 0%, >>>>>>> Cumulative CPU 83.32 sec >>>>>>> 2016-05-23 00:28:33,437 Stage-1 map = 86%, reduce = 0%, Cumulative CPU >>>>>>> 87.9 sec >>>>>>> INFO : 2016-05-23 00:28:33,437 Stage-1 map = 86%, reduce = 0%, >>>>>>> Cumulative CPU 87.9 sec >>>>>>> 2016-05-23 00:28:38,579 Stage-1 map = 91%, reduce = 0%, Cumulative CPU >>>>>>> 92.52 sec >>>>>>> INFO : 2016-05-23 00:28:38,579 Stage-1 map = 91%, reduce = 0%, >>>>>>> Cumulative CPU 92.52 sec >>>>>>> 2016-05-23 00:28:44,759 Stage-1 map = 95%, reduce = 0%, Cumulative CPU >>>>>>> 97.35 sec >>>>>>> INFO : 2016-05-23 00:28:44,759 Stage-1 map = 95%, reduce = 0%, >>>>>>> Cumulative CPU 97.35 sec >>>>>>> 2016-05-23 00:28:49,915 Stage-1 map = 100%, reduce = 0%, Cumulative >>>>>>> CPU 99.6 sec >>>>>>> INFO : 2016-05-23 00:28:49,915 Stage-1 map = 100%, reduce = 0%, >>>>>>> Cumulative CPU 99.6 sec >>>>>>> 2016-05-23 00:28:54,043 Stage-1 map = 100%, reduce = 100%, Cumulative >>>>>>> CPU 101.4 sec >>>>>>> MapReduce Total cumulative CPU time: 1 minutes 41 seconds 400 msec >>>>>>> Ended Job = job_1463956731753_0005 >>>>>>> MapReduce Jobs Launched: >>>>>>> Stage-Stage-1: Map: 22 Reduce: 1 Cumulative CPU: 101.4 sec HDFS >>>>>>> Read: 5318569 HDFS Write: 46 SUCCESS >>>>>>> Total MapReduce CPU Time Spent: 1 minutes 41 seconds 400 msec >>>>>>> OK >>>>>>> INFO : 2016-05-23 00:28:54,043 Stage-1 map = 100%, reduce = 100%, >>>>>>> Cumulative CPU 101.4 sec >>>>>>> INFO : MapReduce Total cumulative CPU time: 1 minutes 41 seconds 400 >>>>>>> msec >>>>>>> INFO : Ended Job = job_1463956731753_0005 >>>>>>> INFO : MapReduce Jobs Launched: >>>>>>> INFO : Stage-Stage-1: Map: 22 Reduce: 1 Cumulative CPU: 101.4 sec >>>>>>> HDFS Read: 5318569 HDFS Write: 46 SUCCESS >>>>>>> INFO : Total MapReduce CPU Time Spent: 1 minutes 41 seconds 400 msec >>>>>>> INFO : Completed executing >>>>>>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc); >>>>>>> Time taken: 142.525 seconds >>>>>>> INFO : OK >>>>>>> +-----+------------+---------------+-----------------------+--+ >>>>>>> | c0 | c1 | c2 | c3 | >>>>>>> +-----+------------+---------------+-----------------------+--+ >>>>>>> | 1 | 100000000 | 5.00000005E7 | 2.8867513459481288E7 | >>>>>>> +-----+------------+---------------+-----------------------+--+ >>>>>>> 1 row selected (142.744 seconds) >>>>>>> >>>>>>> OK Hive on map-reduce engine took 142 seconds compared to 58 seconds >>>>>>> with Hive on Spark. So you can obviously gain pretty well by using Hive >>>>>>> on Spark. >>>>>>> >>>>>>> Please also note that I did not use any vendor's build for this >>>>>>> purpose. I compiled Spark 1.3.1 myself. >>>>>>> >>>>>>> HTH >>>>>>> >>>>>>> >>>>>>> Dr Mich Talebzadeh >>>>>>> >>>>>>> LinkedIn >>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>>> >>>>>>> http://talebzadehmich.wordpress.com/ >>>>> >>>>> >>>>> >>>>> -- >>>>> Best Regards, >>>>> Ayan Guha >