Have a look at this thread Dr Mich Talebzadeh
LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 23 May 2016 at 09:10, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Hi Timur and everyone. > > I will answer your first question as it is very relevant > > 1) How to make 2 versions of Spark live together on the same cluster > (libraries clash, paths, etc.) ? > Most of the Spark users perform ETL, ML operations on Spark as well. So, > we may have 3 Spark installations simultaneously > > There are two distinct points here. > > Using Spark as a query engine. That is BAU and most forum members use it > everyday. You run Spark with either Standalone, Yarn or Mesos as Cluster > managers. You start master that does the management of resources and you > start slaves to create workers. > > You deploy Spark either by Spark-shell, Spark-sql or submit jobs through > spark-submit etc. You may or may not use Hive as your database. You may use > Hbase via Phoenix etc > If you choose to use Hive as your database, on every host of cluster > including your master host, you ensure that Hive APIs are installed > (meaning Hive installed). In $SPARK_HOME/conf, you create a soft link to > cd $SPARK_HOME/conf > hduser@rhes564: /usr/lib/spark-1.6.1-bin-hadoop2.6/conf> ltr hive-site.xml > lrwxrwxrwx 1 hduser hadoop 32 May 3 17:48 *hive-site.xml -> > /usr/lib/hive/conf/hive-site.xml* > Now in hive-site.xml you can define all the parameters needed for Spark > connectivity. Remember we are making Hive use spark1.3.1 engine. WE ARE > NOT RUNNING SPARK 1.3.1 AS A QUERY TOOL. We do not need to start master or > workers for Spark 1.3.1! It is just an execution engine like mr etc. > > Let us look at how we do that in hive-site,xml. Noting the settings for > hive.execution.engine=spark and spark.home=/usr/lib/spark-1.3.1-bin-hadoop2 > below. That tells Hive to use spark 1.3.1 as the execution engine. You just > install spark 1.3.1 on the host just the binary download it is > /usr/lib/spark-1.3.1-bin-hadoop2.6 > > In hive-site.xml, you set the properties. > > <property> > <name>hive.execution.engine</name> > <value>spark</value> > <description> > Expects one of [mr, tez, spark]. > Chooses execution engine. Options are: mr (Map reduce, default), > tez, spark. While MR > remains the default engine for historical reasons, it is itself a > historical engine > and is deprecated in Hive 2 line. It may be removed without further > warning. > </description> > </property> > > <property> > <name>spark.home</name> > <value>/usr/lib/spark-1.3.1-bin-hadoop2</value> > <description>something</description> > </property> > > <property> > <name>hive.merge.sparkfiles</name> > <value>false</value> > <description>Merge small files at the end of a Spark DAG > Transformation</description> > </property> > > <property> > <name>hive.spark.client.future.timeout</name> > <value>60s</value> > <description> > Expects a time value with unit (d/day, h/hour, m/min, s/sec, > ms/msec, us/usec, ns/nsec), which is sec if not specified. > Timeout for requests from Hive client to remote Spark driver. > </description> > </property> > <property> > <name>hive.spark.job.monitor.timeout</name> > <value>60s</value> > <description> > Expects a time value with unit (d/day, h/hour, m/min, s/sec, > ms/msec, us/usec, ns/nsec), which is sec if not specified. > Timeout for job monitor to get Spark job state. > </description> > </property> > > <property> > <name>hive.spark.client.connect.timeout</name> > <value>1000ms</value> > <description> > Expects a time value with unit (d/day, h/hour, m/min, s/sec, > ms/msec, us/usec, ns/nsec), which is msec if not specified. > Timeout for remote Spark driver in connecting back to Hive client. > </description> > </property> > > <property> > <name>hive.spark.client.server.connect.timeout</name> > <value>90000ms</value> > <description> > Expects a time value with unit (d/day, h/hour, m/min, s/sec, > ms/msec, us/usec, ns/nsec), which is msec if not specified. > Timeout for handshake between Hive client and remote Spark driver. > Checked by both processes. > </description> > </property> > <property> > <name>hive.spark.client.secret.bits</name> > <value>256</value> > <description>Number of bits of randomness in the generated secret for > communication between Hive client and remote Spark driver. Rounded down to > the nearest multiple of 8.</description> > </property> > <property> > <name>hive.spark.client.rpc.threads</name> > <value>8</value> > <description>Maximum number of threads for remote Spark driver's RPC > event loop.</description> > </property> > > And other settings as well > > That was the Hive stuff for your Spark BAU. So there are two distinct > things. Now going to Hive itself, you will need to add the correct assembly > jar file for Hadoop. These are called > > spark-assembly-x.y.z-hadoop2.4.0.jar > > Where x.y.z in this case is 1.3.1 > > The assembly file is > > spark-assembly-1.3.1-hadoop2.4.0.jar > > So you add that spark-assembly-1.3.1-hadoop2.4.0.jar to $HIVE_HOME/libs > > ls $HIVE_HOME/lib/spark-assembly-1.3.1-hadoop2.4.0.jar > /usr/lib/hive/lib/spark-assembly-1.3.1-hadoop2.4.0.jar > > And you need to compile spark from source excluding Hadoop dependencies > > > ./make-distribution.sh --name "hadoop2-without-hive" --tgz > "-Pyarn,hadoop-provided,hadoop-2.4,parquet-provided" > > > So Hive uses spark engine by default > > If you want to use mr in hive you just do > > 0: jdbc:hive2://rhes564:10010/default> > *set hive.execution.engine=mr;*Hive-on-MR is deprecated in Hive 2 and may > not be available in the future versions. Consider using a different > execution engine (i.e. spark, tez) or using Hive 1.X releases. > No rows affected (0.007 seconds) > > With regard to the second question > > *2) How stable such construction is on INSERT / UPDATE / CTAS operations? > Any problems with writing into specific tables / directories, ORC / Parquet > peculiarities, memory / timeout parameters tuning ?* > With this set up that is Hive using spark as execution engine, my tests > look OK. Basically I can do whatever I do with Hive using map-reduceengine. > The caveat as usual is the amount of memory used by Spark for in-memory > work. I am afraid that resource constraint will be there no matter how you > want to deploy Spark > > *3)) How stable such construction is in multi-user / multi-tenant > production environment when several people make different queries > simultaneously?* > > This is subjective how you are going to deploy it and how scalable it is. > Your mileage varies and you really need to test it for yourself to find > out. > Also worth noting that with Spark app using Hive ORC tables you may have > issues with ORC tables defined ass transactional. You do not have that > issue with Hive on Spark engine. There are certainly limitations with > HiveSql construct. For example some properties are not implemented. Case in > point with Spark-sql > > spark-sql> CREATE TEMPORARY TABLE tmp as select * from oraclehadoop.sales > limit 10; > Error in query: Unhandled clauses: TEMPORARY 1, 2,2, 7 > . > You are likely trying to use an unsupported Hive feature."; > However, no issue with hive on spark engine > > set hive.execution.engine=spark; > 0: jdbc:hive2://rhes564:10010/default> CREATE TEMPORARY TABLE tmp as > select * from oraclehadoop.sales limit 10; > Starting Spark Job = d87e6c68-03f1-4c37-a9d4-f77e117039a4 > Query Hive on Spark job[0] stages: > INFO : Completed executing > command(queryId=hduser_20160523090757_a474efb8-cea8-473e-8899-60bc7934a887); > Time taken: 43.894 seconds > INFO : OK > > > HTH > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 23 May 2016 at 05:57, Mohanraj Ragupathiraj <mohanaug...@gmail.com> > wrote: > >> Great Comparison !! thanks >> >> On Mon, May 23, 2016 at 7:42 AM, Mich Talebzadeh < >> mich.talebza...@gmail.com> wrote: >> >>> Hi, >>> >>> >>> >>> I have done a number of extensive tests using Spark-shell with Hive DB >>> and ORC tables. >>> >>> >>> >>> Now one issue that we typically face is and I quote: >>> >>> >>> >>> Spark is fast as it uses Memory and DAG. Great but when we save data it >>> is not fast enough >>> >>> OK but there is a solution now. If you use Spark with Hive and you are >>> on a descent version of Hive >= 0.14, then you can also deploy Spark as >>> execution engine for Hive. That will make your application run pretty fast >>> as you no longer rely on the old Map-Reduce for Hive engine. In a nutshell >>> what you are gaining speed in both querying and storage. >>> >>> >>> >>> I have made some comparisons on this set-up and I am sure some of you >>> will find it useful. >>> >>> >>> >>> The version of Spark I use for Spark queries (Spark as query tool) is >>> 1.6. >>> >>> The version of Hive I use in Hive 2 >>> >>> The version of Spark I use as Hive execution engine is 1.3.1 It works >>> and frankly Spark 1.3.1 as an execution engine is adequate (until we sort >>> out the Hadoop libraries mismatch). >>> >>> >>> >>> An example I am using Hive on Spark engine to find the min and max of >>> IDs for a table with 1 billion rows: >>> >>> >>> >>> 0: jdbc:hive2://rhes564:10010/default> select min(id), max(id),avg(id), >>> stddev(id) from oraclehadoop.dummy; >>> >>> Query ID = hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006 >>> >>> >>> >>> >>> >>> Starting Spark Job = 5e092ef9-d798-4952-b156-74df49da9151 >>> >>> >>> >>> INFO : Completed compiling >>> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006); >>> Time taken: 1.911 seconds >>> >>> INFO : Executing >>> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006): >>> select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy >>> >>> INFO : Query ID = >>> hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006 >>> >>> INFO : Total jobs = 1 >>> >>> INFO : Launching Job 1 out of 1 >>> >>> INFO : Starting task [Stage-1:MAPRED] in serial mode >>> >>> >>> >>> Query Hive on Spark job[0] stages: >>> >>> 0 >>> >>> 1 >>> >>> Status: Running (Hive on Spark job[0]) >>> >>> Job Progress Format >>> >>> CurrentTime StageId_StageAttemptId: >>> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount >>> [StageCost] >>> >>> 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1 >>> >>> 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22 Stage-1_0: 0/1 >>> >>> 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22 Stage-1_0: 0/1 >>> >>> 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22 Stage-1_0: 0/1 >>> >>> INFO : >>> >>> Query Hive on Spark job[0] stages: >>> >>> INFO : 0 >>> >>> INFO : 1 >>> >>> INFO : >>> >>> Status: Running (Hive on Spark job[0]) >>> >>> INFO : Job Progress Format >>> >>> CurrentTime StageId_StageAttemptId: >>> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount >>> [StageCost] >>> >>> INFO : 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1 >>> >>> INFO : 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22 Stage-1_0: 0/1 >>> >>> INFO : 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22 Stage-1_0: 0/1 >>> >>> INFO : 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22 Stage-1_0: 0/1 >>> >>> 2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished Stage-1_0: >>> 0(+1)/1 >>> >>> 2016-05-23 00:21:30,189 Stage-0_0: 22/22 Finished Stage-1_0: 1/1 >>> Finished >>> >>> Status: Finished successfully in 53.25 seconds >>> >>> OK >>> >>> INFO : 2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished >>> Stage-1_0: 0(+1)/1 >>> >>> INFO : 2016-05-23 00:21:30,189 Stage-0_0: 22/22 Finished >>> Stage-1_0: 1/1 Finished >>> >>> INFO : Status: Finished successfully in 53.25 seconds >>> >>> INFO : Completed executing >>> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006); >>> Time taken: 56.337 seconds >>> >>> INFO : OK >>> >>> +-----+------------+---------------+-----------------------+--+ >>> >>> | c0 | c1 | c2 | c3 | >>> >>> +-----+------------+---------------+-----------------------+--+ >>> >>> | 1 | 100000000 | 5.00000005E7 | 2.8867513459481288E7 | >>> >>> +-----+------------+---------------+-----------------------+--+ >>> >>> 1 row selected (58.529 seconds) >>> >>> >>> >>> 58 seconds first run with cold cache is pretty good >>> >>> >>> >>> And let us compare it with running the same query on map-reduce engine >>> >>> >>> >>> : jdbc:hive2://rhes564:10010/default> set hive.execution.engine=mr; >>> >>> Hive-on-MR is deprecated in Hive 2 and may not be available in the >>> future versions. Consider using a different execution engine (i.e. spark, >>> tez) or using Hive 1.X releases. >>> >>> No rows affected (0.007 seconds) >>> >>> 0: jdbc:hive2://rhes564:10010/default> select min(id), max(id),avg(id), >>> stddev(id) from oraclehadoop.dummy; >>> >>> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in >>> the future versions. Consider using a different execution engine (i.e. >>> spark, tez) or using Hive 1.X releases. >>> >>> Query ID = hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc >>> >>> Total jobs = 1 >>> >>> Launching Job 1 out of 1 >>> >>> Number of reduce tasks determined at compile time: 1 >>> >>> In order to change the average load for a reducer (in bytes): >>> >>> set hive.exec.reducers.bytes.per.reducer=<number> >>> >>> In order to limit the maximum number of reducers: >>> >>> set hive.exec.reducers.max=<number> >>> >>> In order to set a constant number of reducers: >>> >>> set mapreduce.job.reduces=<number> >>> >>> Starting Job = job_1463956731753_0005, Tracking URL = >>> http://localhost.localdomain:8088/proxy/application_1463956731753_0005/ >>> >>> Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job -kill >>> job_1463956731753_0005 >>> >>> Hadoop job information for Stage-1: number of mappers: 22; number of >>> reducers: 1 >>> >>> 2016-05-23 00:26:38,127 Stage-1 map = 0%, reduce = 0% >>> >>> INFO : Compiling >>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc): >>> select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy >>> >>> INFO : Semantic Analysis Completed >>> >>> INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:c0, >>> type:int, comment:null), FieldSchema(name:c1, type:int, comment:null), >>> FieldSchema(name:c2, type:double, comment:null), FieldSchema(name:c3, >>> type:double, comment:null)], properties:null) >>> >>> INFO : Completed compiling >>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc); >>> Time taken: 0.144 seconds >>> >>> INFO : Executing >>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc): >>> select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy >>> >>> WARN : Hive-on-MR is deprecated in Hive 2 and may not be available in >>> the future versions. Consider using a different execution engine (i.e. >>> spark, tez) or using Hive 1.X releases. >>> >>> INFO : WARNING: Hive-on-MR is deprecated in Hive 2 and may not be >>> available in the future versions. Consider using a different execution >>> engine (i.e. spark, tez) or using Hive 1.X releases. >>> >>> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in >>> the future versions. Consider using a different execution engine (i.e. >>> spark, tez) or using Hive 1.X releases. >>> >>> INFO : Query ID = >>> hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc >>> >>> INFO : Total jobs = 1 >>> >>> INFO : Launching Job 1 out of 1 >>> >>> INFO : Starting task [Stage-1:MAPRED] in serial mode >>> >>> INFO : Number of reduce tasks determined at compile time: 1 >>> >>> INFO : In order to change the average load for a reducer (in bytes): >>> >>> INFO : set hive.exec.reducers.bytes.per.reducer=<number> >>> >>> INFO : In order to limit the maximum number of reducers: >>> >>> INFO : set hive.exec.reducers.max=<number> >>> >>> INFO : In order to set a constant number of reducers: >>> >>> INFO : set mapreduce.job.reduces=<number> >>> >>> WARN : Hadoop command-line option parsing not performed. Implement the >>> Tool interface and execute your application with ToolRunner to remedy this. >>> >>> INFO : number of splits:22 >>> >>> INFO : Submitting tokens for job: job_1463956731753_0005 >>> >>> INFO : The url to track the job: >>> http://localhost.localdomain:8088/proxy/application_1463956731753_0005/ >>> >>> INFO : Starting Job = job_1463956731753_0005, Tracking URL = >>> http://localhost.localdomain:8088/proxy/application_1463956731753_0005/ >>> >>> INFO : Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job -kill >>> job_1463956731753_0005 >>> >>> INFO : Hadoop job information for Stage-1: number of mappers: 22; >>> number of reducers: 1 >>> >>> INFO : 2016-05-23 00:26:38,127 Stage-1 map = 0%, reduce = 0% >>> >>> 2016-05-23 00:26:44,367 Stage-1 map = 5%, reduce = 0%, Cumulative CPU >>> 4.56 sec >>> >>> INFO : 2016-05-23 00:26:44,367 Stage-1 map = 5%, reduce = 0%, >>> Cumulative CPU 4.56 sec >>> >>> 2016-05-23 00:26:50,558 Stage-1 map = 9%, reduce = 0%, Cumulative CPU >>> 9.17 sec >>> >>> INFO : 2016-05-23 00:26:50,558 Stage-1 map = 9%, reduce = 0%, >>> Cumulative CPU 9.17 sec >>> >>> 2016-05-23 00:26:56,747 Stage-1 map = 14%, reduce = 0%, Cumulative CPU >>> 14.04 sec >>> >>> INFO : 2016-05-23 00:26:56,747 Stage-1 map = 14%, reduce = 0%, >>> Cumulative CPU 14.04 sec >>> >>> 2016-05-23 00:27:02,944 Stage-1 map = 18%, reduce = 0%, Cumulative CPU >>> 18.64 sec >>> >>> INFO : 2016-05-23 00:27:02,944 Stage-1 map = 18%, reduce = 0%, >>> Cumulative CPU 18.64 sec >>> >>> 2016-05-23 00:27:08,105 Stage-1 map = 23%, reduce = 0%, Cumulative CPU >>> 23.25 sec >>> >>> INFO : 2016-05-23 00:27:08,105 Stage-1 map = 23%, reduce = 0%, >>> Cumulative CPU 23.25 sec >>> >>> 2016-05-23 00:27:14,298 Stage-1 map = 27%, reduce = 0%, Cumulative CPU >>> 27.84 sec >>> >>> INFO : 2016-05-23 00:27:14,298 Stage-1 map = 27%, reduce = 0%, >>> Cumulative CPU 27.84 sec >>> >>> 2016-05-23 00:27:20,484 Stage-1 map = 32%, reduce = 0%, Cumulative CPU >>> 32.56 sec >>> >>> INFO : 2016-05-23 00:27:20,484 Stage-1 map = 32%, reduce = 0%, >>> Cumulative CPU 32.56 sec >>> >>> 2016-05-23 00:27:26,659 Stage-1 map = 36%, reduce = 0%, Cumulative CPU >>> 37.1 sec >>> >>> INFO : 2016-05-23 00:27:26,659 Stage-1 map = 36%, reduce = 0%, >>> Cumulative CPU 37.1 sec >>> >>> 2016-05-23 00:27:32,839 Stage-1 map = 41%, reduce = 0%, Cumulative CPU >>> 41.74 sec >>> >>> INFO : 2016-05-23 00:27:32,839 Stage-1 map = 41%, reduce = 0%, >>> Cumulative CPU 41.74 sec >>> >>> 2016-05-23 00:27:39,003 Stage-1 map = 45%, reduce = 0%, Cumulative CPU >>> 46.32 sec >>> >>> INFO : 2016-05-23 00:27:39,003 Stage-1 map = 45%, reduce = 0%, >>> Cumulative CPU 46.32 sec >>> >>> 2016-05-23 00:27:45,173 Stage-1 map = 50%, reduce = 0%, Cumulative CPU >>> 50.93 sec >>> >>> 2016-05-23 00:27:50,316 Stage-1 map = 55%, reduce = 0%, Cumulative CPU >>> 55.55 sec >>> >>> INFO : 2016-05-23 00:27:45,173 Stage-1 map = 50%, reduce = 0%, >>> Cumulative CPU 50.93 sec >>> >>> INFO : 2016-05-23 00:27:50,316 Stage-1 map = 55%, reduce = 0%, >>> Cumulative CPU 55.55 sec >>> >>> 2016-05-23 00:27:56,482 Stage-1 map = 59%, reduce = 0%, Cumulative CPU >>> 60.25 sec >>> >>> INFO : 2016-05-23 00:27:56,482 Stage-1 map = 59%, reduce = 0%, >>> Cumulative CPU 60.25 sec >>> >>> 2016-05-23 00:28:02,642 Stage-1 map = 64%, reduce = 0%, Cumulative CPU >>> 64.86 sec >>> >>> INFO : 2016-05-23 00:28:02,642 Stage-1 map = 64%, reduce = 0%, >>> Cumulative CPU 64.86 sec >>> >>> 2016-05-23 00:28:08,814 Stage-1 map = 68%, reduce = 0%, Cumulative CPU >>> 69.41 sec >>> >>> INFO : 2016-05-23 00:28:08,814 Stage-1 map = 68%, reduce = 0%, >>> Cumulative CPU 69.41 sec >>> >>> 2016-05-23 00:28:14,977 Stage-1 map = 73%, reduce = 0%, Cumulative CPU >>> 74.06 sec >>> >>> INFO : 2016-05-23 00:28:14,977 Stage-1 map = 73%, reduce = 0%, >>> Cumulative CPU 74.06 sec >>> >>> 2016-05-23 00:28:21,134 Stage-1 map = 77%, reduce = 0%, Cumulative CPU >>> 78.72 sec >>> >>> INFO : 2016-05-23 00:28:21,134 Stage-1 map = 77%, reduce = 0%, >>> Cumulative CPU 78.72 sec >>> >>> 2016-05-23 00:28:27,282 Stage-1 map = 82%, reduce = 0%, Cumulative CPU >>> 83.32 sec >>> >>> INFO : 2016-05-23 00:28:27,282 Stage-1 map = 82%, reduce = 0%, >>> Cumulative CPU 83.32 sec >>> >>> 2016-05-23 00:28:33,437 Stage-1 map = 86%, reduce = 0%, Cumulative CPU >>> 87.9 sec >>> >>> INFO : 2016-05-23 00:28:33,437 Stage-1 map = 86%, reduce = 0%, >>> Cumulative CPU 87.9 sec >>> >>> 2016-05-23 00:28:38,579 Stage-1 map = 91%, reduce = 0%, Cumulative CPU >>> 92.52 sec >>> >>> INFO : 2016-05-23 00:28:38,579 Stage-1 map = 91%, reduce = 0%, >>> Cumulative CPU 92.52 sec >>> >>> 2016-05-23 00:28:44,759 Stage-1 map = 95%, reduce = 0%, Cumulative CPU >>> 97.35 sec >>> >>> INFO : 2016-05-23 00:28:44,759 Stage-1 map = 95%, reduce = 0%, >>> Cumulative CPU 97.35 sec >>> >>> 2016-05-23 00:28:49,915 Stage-1 map = 100%, reduce = 0%, Cumulative CPU >>> 99.6 sec >>> >>> INFO : 2016-05-23 00:28:49,915 Stage-1 map = 100%, reduce = 0%, >>> Cumulative CPU 99.6 sec >>> >>> 2016-05-23 00:28:54,043 Stage-1 map = 100%, reduce = 100%, Cumulative >>> CPU 101.4 sec >>> >>> MapReduce Total cumulative CPU time: 1 minutes 41 seconds 400 msec >>> >>> Ended Job = job_1463956731753_0005 >>> >>> MapReduce Jobs Launched: >>> >>> Stage-Stage-1: Map: 22 Reduce: 1 Cumulative CPU: 101.4 sec HDFS >>> Read: 5318569 HDFS Write: 46 SUCCESS >>> >>> Total MapReduce CPU Time Spent: 1 minutes 41 seconds 400 msec >>> >>> OK >>> >>> INFO : 2016-05-23 00:28:54,043 Stage-1 map = 100%, reduce = 100%, >>> Cumulative CPU 101.4 sec >>> >>> INFO : MapReduce Total cumulative CPU time: 1 minutes 41 seconds 400 >>> msec >>> >>> INFO : Ended Job = job_1463956731753_0005 >>> >>> INFO : MapReduce Jobs Launched: >>> >>> INFO : Stage-Stage-1: Map: 22 Reduce: 1 Cumulative CPU: 101.4 sec >>> HDFS Read: 5318569 HDFS Write: 46 SUCCESS >>> >>> INFO : Total MapReduce CPU Time Spent: 1 minutes 41 seconds 400 msec >>> >>> INFO : Completed executing >>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc); >>> Time taken: 142.525 seconds >>> >>> INFO : OK >>> >>> +-----+------------+---------------+-----------------------+--+ >>> >>> | c0 | c1 | c2 | c3 | >>> >>> +-----+------------+---------------+-----------------------+--+ >>> >>> | 1 | 100000000 | 5.00000005E7 | 2.8867513459481288E7 | >>> >>> +-----+------------+---------------+-----------------------+--+ >>> >>> 1 row selected (142.744 seconds) >>> >>> >>> >>> OK Hive on map-reduce engine took 142 seconds compared to 58 seconds >>> with Hive on Spark. So you can obviously gain pretty well by using Hive on >>> Spark. >>> >>> >>> >>> Please also note that I did not use any vendor's build for this purpose. >>> I compiled Spark 1.3.1 myself. >>> >>> >>> >>> HTH >>> >>> >>> >>> >>> >>> Dr Mich Talebzadeh >>> >>> >>> >>> LinkedIn >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> >>> >>> >>> http://talebzadehmich.wordpress.com/ >>> >>> >>> >> >> >> >> -- >> Thanks and Regards >> Mohan >> VISA Pte Limited, Singapore. >> > >