Thank you very much. This is very informative. Do you know how to set these in hive-site.xml?
hive> set spark.master=<Spark Master URL> hive> set spark.eventLog.enabled=true; hive> set spark.eventLog.dir=<Spark event log folder (must exist)> hive> set spark.executor.memory=512m; hive> set spark.serializer=org.apache.spark.serializer.KryoSerializer; If these set these in hive-site I think we will be able to get through On Mon, Nov 23, 2015 at 3:05 PM, Mich Talebzadeh <m...@peridale.co.uk> wrote: > Hi, > > > > I am looking at the set up here > > > > > https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started > . > > > > First this is about configuration of Hive to work with Spark. These are my > understanding > > > > 1. Hive uses Yarn as its resource manager regardless > > 2. Hive uses MapReduce as its execution engine by default > > 3. Changing the execution engine to that of Spark at the configuration > level. If you look at Hive configuration file -> > $HIVE_HOME/conf/hive-site.xml, you will see that default is mr MapReduce > > <property> > > <name>hive.execution.engine</name> > > *<value>mr</value>* > > <description> > > Expects one of [mr, tez]. > > Chooses execution engine. Options are: mr (Map reduce, default) or > tez (hadoop 2 only) > > </description> > > </property> > > > > 4. If you change that to *spark and restart Hive, *you will force Hive > to use spark as its engine. So the choice is either do it at the > configuration level or session level (i.e set set > hive.execution.engine=spark;). For the rest of parameters you can do the > same. i.e. at hive-core.xml or at session level. Personally I would still > want hive to use MR engine so I will create spark-defaults.conf as > mentioned. > > 5. I then start spark as standalone that works fine > > *hduser@rhes564::/usr/lib/spark> ./sbin/start-master.sh* > > starting org.apache.spark.deploy.master.Master, logging to > /usr/lib/spark/sbin/../logs/spark-hduser-org.apache.spark.deploy.master.Master-1-rhes564.out > > hduser@rhes564::/usr/lib/spark> more > /usr/lib/spark/sbin/../logs/spark-hduser-org.apache.spark.deploy.master.Master-1-rhes564.out > > Spark Command: /usr/java/latest/bin/java -cp > /usr/lib/spark/sbin/../conf/:/usr/lib/spark/lib/spark-assembly-1.5.2-hadoop2.6.0.jar:/usr/lib/spark/lib/datanucleus-core-3.2.10.jar:/usr/lib/spark/lib/datanucleus-ap > > i-jdo-3.2.6.jar:/usr/lib/spark/lib/datanucleus-rdbms-3.2.9.jar -Xms1g > -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.master.Master --ip > rhes564 --port 7077 --webui-port 8080 > > ======================================== > > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > > 15/11/21 21:41:58 INFO Master: Registered signal handlers for [TERM, HUP, > INT] > > 15/11/21 21:41:58 WARN Utils: Your hostname, rhes564 resolves to a > loopback address: 127.0.0.1; using 50.140.197.217 instead (on interface > eth0) > > 15/11/21 21:41:58 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to > another address > > 15/11/21 21:41:59 WARN NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > > 15/11/21 21:41:59 INFO SecurityManager: Changing view acls to: hduser > > 15/11/21 21:41:59 INFO SecurityManager: Changing modify acls to: hduser > > 15/11/21 21:41:59 INFO SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(hduser); users > with modify permissions: Set(hduser) > > 15/11/21 21:41:59 INFO Slf4jLogger: Slf4jLogger started > > 15/11/21 21:42:00 INFO Remoting: Starting remoting > > 15/11/21 21:42:00 INFO Remoting: Remoting started; listening on addresses > :[akka.tcp://sparkMaster@rhes564:7077] > > 15/11/21 21:42:00 INFO Utils: Successfully started service 'sparkMaster' > on port 7077. > > 15/11/21 21:42:00 INFO Master: Starting Spark master at > spark://rhes564:7077 > > 15/11/21 21:42:00 INFO Master: Running Spark version 1.5.2 > > 15/11/21 21:42:00 INFO Utils: Successfully started service 'MasterUI' on > port 8080. > > 15/11/21 21:42:00 INFO MasterWebUI: Started MasterWebUI at > http://50.140.197.217:8080 > > 15/11/21 21:42:00 INFO Utils: Successfully started service on port 6066. > > 15/11/21 21:42:00 INFO StandaloneRestServer: Started REST server for > submitting applications on port 6066 > > 15/11/21 21:42:00 INFO Master: I have been elected leader! New state: ALIVE > > 6. Then I try to start interactive spark-shell and it fails with an > error that I reported before > > *hduser@rhes564::/usr/lib/spark/bin> ./spark-shell --master > spark://rhes564:7077* > > log4j:WARN No appenders could be found for logger > (org.apache.hadoop.metrics2.lib.MutableMetricsFactory). > > log4j:WARN Please initialize the log4j system properly. > > log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for > more info. > > Using Spark's repl log4j profile: > org/apache/spark/log4j-defaults-repl.properties > > To adjust logging level use sc.setLogLevel("INFO") > > Welcome to > > ____ __ > > / __/__ ___ _____/ /__ > > _\ \/ _ \/ _ `/ __/ '_/ > > /___/ .__/\_,_/_/ /_/\_\ version 1.5.2 > > /_/ > > > > Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java > 1.7.0_25) > > Type in expressions to have them evaluated. > > Type :help for more information. > > 15/11/23 09:33:56 WARN Utils: Your hostname, rhes564 resolves to a > loopback address: 127.0.0.1; using 50.140.197.217 instead (on interface > eth0) > > 15/11/23 09:33:56 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to > another address > > 15/11/23 09:33:57 WARN MetricsSystem: Using default name DAGScheduler for > source because spark.app.id is not set. > > Spark context available as sc. > > 15/11/23 09:34:00 WARN HiveConf: HiveConf of name > hive.server2.thrift.http.min.worker.threads does not exist > > 15/11/23 09:34:00 WARN HiveConf: HiveConf of name > hive.mapjoin.optimized.keys does not exist > > 15/11/23 09:34:00 WARN HiveConf: HiveConf of name > hive.mapjoin.lazy.hashtable does not exist > > 15/11/23 09:34:00 WARN HiveConf: HiveConf of name > hive.server2.thrift.http.max.worker.threads does not exist > > 15/11/23 09:34:00 WARN HiveConf: HiveConf of name > hive.server2.logging.operation.verbose does not exist > > 15/11/23 09:34:00 WARN HiveConf: HiveConf of name > hive.optimize.multigroupby.common.distincts does not exist > > *java.lang.RuntimeException: java.lang.RuntimeException: The root scratch > dir: /tmp/hive on HDFS should be writable. Current permissions are: > rwx------* > > > > That is where I am now and I have reported this spark user group but no > luck yet. > > > > > > Mich Talebzadeh > > > > *Sybase ASE 15 Gold Medal Award 2008* > > A Winning Strategy: Running the most Critical Financial Data on ASE 15 > > > http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf > > Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE > 15", ISBN 978-0-9563693-0-7*. > > co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN > 978-0-9759693-0-4* > > *Publications due shortly:* > > *Complex Event Processing in Heterogeneous Environments*, ISBN: > 978-0-9563693-3-8 > > *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume > one out shortly > > > > http://talebzadehmich.wordpress.com > > > > NOTE: The information in this email is proprietary and confidential. This > message is for the designated recipient only, if you are not the intended > recipient, you should destroy it immediately. Any information in this > message shall not be understood as given or endorsed by Peridale Technology > Ltd, its subsidiaries or their employees, unless expressly so stated. It is > the responsibility of the recipient to ensure that this email is virus > free, therefore neither Peridale Ltd, its subsidiaries nor their employees > accept any responsibility. > > > > *From:* Dasun Hegoda [mailto:dasunheg...@gmail.com] > *Sent:* 23 November 2015 07:05 > *To:* user@hive.apache.org > *Subject:* Re: Hive on Spark - Hadoop 2 - Installation - Ubuntu > > > > Anyone???? > > > > On Sat, Nov 21, 2015 at 1:32 PM, Dasun Hegoda <dasunheg...@gmail.com> > wrote: > > Thank you very much but I would like to do the integration of these > components myself rather than using a packaged distribution. I think I have > come to right place. Can you please kindly tell me the configuration > steps run Hive on Spark? > > > > At least someone please elaborate these steps. > > > https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started > . > > > > Because at the latter part of the guide configurations are set in the > Hive runtime shell which is not permanent according to my knowledge. > > > > Please help me to get this done. Also I'm planning write a detailed guide > with configuration steps to run Hive on Spark. So others can benefited from > it and not troubled like me. > > > > Can someone please kindly tell me the configuration steps run Hive on > Spark? > > > > > > On Sat, Nov 21, 2015 at 12:28 PM, Sai Gopalakrishnan < > sai.gopalakrish...@aspiresys.com> wrote: > > Hi everyone, > > > > Thank you for your responses. I think Mich's suggestion is a great one, > will go with it. As Alan suggested, using compactor in Hive should help out > with managing the delta files. > > > > @Dasun, pardon me for deviating from the topic. Regarding configuration, > you could try a packaged distribution (Hortonworks , Cloudera or MapR) > like Jörn Franke said. I use Hortonworks, its open-source and compatible > with Linux and Windows, provides detailed documentation for installation > and can be installed in less than a day provided you're all set with the > hardware. http://hortonworks.com/hdp/downloads/ > > [image: Image removed by sender.] <http://hortonworks.com/hdp/downloads/> > > Download Hadoop - Hortonworks > > Download Apache Hadoop for the enterprise with Hortonworks Data Platform. > Data access, storage, governance, security and operations across Linux and > Windows > > Read more... <http://hortonworks.com/hdp/downloads/> > > > > > > Regards, > > Sai > > > ------------------------------ > > *From:* Dasun Hegoda <dasunheg...@gmail.com> > *Sent:* Saturday, November 21, 2015 8:00 AM > *To:* user@hive.apache.org > *Subject:* Re: Hive on Spark - Hadoop 2 - Installation - Ubuntu > > > > Hi Mich, Hi Sai, Hi Jorn, > > Thank you very much for the information. I think we are deviating from the > original question. Hive on Spark on Ubuntu. Can you please kindly tell me > the configuration steps? > > > > > > > > On Fri, Nov 20, 2015 at 11:10 PM, Jörn Franke <jornfra...@gmail.com> > wrote: > > I think the most recent versions of cloudera or Hortonworks should include > all these components - try their Sandboxes. > > > On 20 Nov 2015, at 12:54, Dasun Hegoda <dasunheg...@gmail.com> wrote: > > Where can I get a Hadoop distribution containing these technologies? Link? > > > > On Fri, Nov 20, 2015 at 5:22 PM, Jörn Franke <jornfra...@gmail.com> wrote: > > I recommend to use a Hadoop distribution containing these technologies. I > think you get also other useful tools for your scenario, such as Auditing > using sentry or ranger. > > > On 20 Nov 2015, at 10:48, Mich Talebzadeh <m...@peridale.co.uk> wrote: > > Well > > > > “I'm planning to deploy Hive on Spark but I can't find the installation > steps. I tried to read the official '[Hive on Spark][1]' guide but it has > problems. As an example it says under 'Configuring Yarn' > `yarn.resourcemanager.scheduler.class=org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler` > but does not imply where should I do it. Also as per the guide > configurations are set in the Hive runtime shell which is not permanent > according to my knowledge.” > > > > You can do that in yarn-site.xml file which is normally under > $HADOOP_HOME/etc/hadoop. > > > > > > HTH > > > > > > > > Mich Talebzadeh > > > > *Sybase ASE 15 Gold Medal Award 2008* > > A Winning Strategy: Running the most Critical Financial Data on ASE 15 > > > http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf > > Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE > 15", ISBN 978-0-9563693-0-7*. > > co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN > 978-0-9759693-0-4* > > *Publications due shortly:* > > *Complex Event Processing in Heterogeneous Environments*, ISBN: > 978-0-9563693-3-8 > > *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, > volume one out shortly > > > > http://talebzadehmich.wordpress.com > > > > NOTE: The information in this email is proprietary and confidential. This > message is for the designated recipient only, if you are not the intended > recipient, you should destroy it immediately. Any information in this > message shall not be understood as given or endorsed by Peridale Technology > Ltd, its subsidiaries or their employees, unless expressly so stated. It is > the responsibility of the recipient to ensure that this email is virus > free, therefore neither Peridale Ltd, its subsidiaries nor their employees > accept any responsibility. > > > > *From:* Dasun Hegoda [mailto:dasunheg...@gmail.com <dasunheg...@gmail.com>] > > *Sent:* 20 November 2015 09:36 > *To:* user@hive.apache.org > *Subject:* Hive on Spark - Hadoop 2 - Installation - Ubuntu > > > > Hi, > > > > What I'm planning to do is develop a reporting platform using existing > data. I have an existing RDBMS which has large number of records. So I'm > using. ( > http://stackoverflow.com/questions/33635234/hadoop-2-7-spark-hive-jasperreports-scoop-architecuture > ) > > > > - Scoop - Extract data from RDBMS to Hadoop > > - Hadoop - Storage platform -> *Deployment Completed* > > - Hive - Datawarehouse > > - Spark - Read time processing -> *Deployment Completed* > > > > I'm planning to deploy Hive on Spark but I can't find the installation > steps. I tried to read the official '[Hive on Spark][1]' guide but it has > problems. As an example it says under 'Configuring Yarn' > `yarn.resourcemanager.scheduler.class=org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler` > but does not imply where should I do it. Also as per the guide > configurations are set in the Hive runtime shell which is not permanent > according to my knowledge. > > > > Given that I read [this][2] but it does not have any steps. > > > > Please provide me the steps to run Hive on Spark on Ubuntu as a production > system? > > > > > > [1]: > https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started > > [2]: > http://stackoverflow.com/questions/26018306/how-to-configure-hive-to-use-spark > > > > -- > > Regards, > > Dasun Hegoda, Software Engineer > www.dasunhegoda.com | dasunheg...@gmail.com > > > > > > -- > > Regards, > > Dasun Hegoda, Software Engineer > www.dasunhegoda.com | dasunheg...@gmail.com > > > > > > -- > > Regards, > > Dasun Hegoda, Software Engineer > www.dasunhegoda.com | dasunheg...@gmail.com > > [image: Image removed by sender. Aspire Systems] > > This e-mail message and any attachments are for the sole use of the > intended recipient(s) and may contain proprietary, confidential, trade > secret or privileged information. Any unauthorized review, use, disclosure > or distribution is prohibited and may be a violation of law. If you are not > the intended recipient, please contact the sender by reply e-mail and > destroy all copies of the original message. > > > > > > -- > > Regards, > > Dasun Hegoda, Software Engineer > www.dasunhegoda.com | dasunheg...@gmail.com > > > > > > -- > > Regards, > > Dasun Hegoda, Software Engineer > www.dasunhegoda.com | dasunheg...@gmail.com > -- Regards, Dasun Hegoda, Software Engineer www.dasunhegoda.com | dasunheg...@gmail.com