I confirm, Christopher was very kind helping me out here. The solution presented in the linked doc worked perfectly. IMO it should be linked in the official Spark documentation.
Thanks again, Roberto > On 20 Jun 2015, at 19:25, Bozeman, Christopher <bozem...@amazon.com> wrote: > > We worked it out. There was multiple items (like location of remote > metastore and db user auth) to make HiveContext happy in yarn-cluster mode. > > For reference > https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/examples/using-hivecontext-yarn-cluster.md > > -Christopher Bozeman > > > On Jun 20, 2015, at 7:24 AM, Andrew Lee <alee...@hotmail.com> wrote: > >> Hi Roberto, >> >> I'm not an EMR person, but it looks like option -h is deploying the >> necessary dataneucleus JARs for you. >> The req for HiveContext is the hive-site.xml and dataneucleus JARs. As long >> as these 2 are there, and Spark is compiled with -Phive, it should work. >> >> spark-shell runs in yarn-client mode. Not sure whether your other >> application is running under the same mode or a different one. Try >> specifying yarn-client mode and see if you get the same result as >> spark-shell. >> >> From: roberto.coluc...@gmail.com >> Date: Wed, 10 Jun 2015 14:32:04 +0200 >> Subject: [Spark 1.3.1 on YARN on EMR] Unable to instantiate >> org.apache.hadoop.hive.metastore.HiveMetaStoreClient >> To: user@spark.apache.org >> >> Hi! >> >> I'm struggling with an issue with Spark 1.3.1 running on YARN, running on an >> AWS EMR cluster. Such cluster is based on AMI 3.7.0 (hence Amazon Linux >> 2015.03, Hive 0.13 already installed and configured on the cluster, Hadoop >> 2.4, etc...). I make use of the AWS emr-bootstrap-action "install-spark" >> (https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark) with >> the option/version "-v1.3.1e" so to get the latest Spark for EMR installed >> and available. >> >> I also have a simple Spark Streaming driver in my project. Such driver is >> part of a larger Maven project: in the pom.xml I'm currently using >> >> [...] >> >> >> >> <scala.binary.version>2.10</scala.binary.version> >> >> <scala.version>2.10.4</scala.version> >> >> <java.version>1.7</java.version> >> >> <spark.version>1.3.1</spark.version> >> >> <hadoop.version>2.4.1</hadoop.version> >> >> >> >> [....] >> >> >> <dependency> >> >> <groupId>org.apache.spark</groupId> >> >> <artifactId>spark-streaming_${scala.binary.version}</artifactId> >> >> <version>${spark.version}</version> >> >> <scope>provided</scope> >> >> <exclusions> >> >> <exclusion> >> >> <groupId>org.apache.hadoop</groupId> >> >> <artifactId>hadoop-client</artifactId> >> >> </exclusion> >> >> </exclusions> >> >> </dependency> >> >> >> <dependency> >> >> <groupId>org.apache.hadoop</groupId> >> >> <artifactId>hadoop-client</artifactId> >> >> <version>${hadoop.version}</version> >> >> <scope>provided</scope> >> >> </dependency> >> >> >> <dependency> >> >> <groupId>org.apache.spark</groupId> >> >> <artifactId>spark-hive_${scala.binary.version}</artifactId> >> >> <version>${spark.version}</version> >> >> <scope>provided</scope> >> >> </dependency> >> >> >> >> >> >> In fact, at compile and build time everything works just fine if, in my >> driver, I have: >> >> >> >> ------------- >> >> >> >> val sparkConf = new SparkConf() >> >> .setAppName(appName) >> >> .set("spark.local.dir", "/tmp/" + appName) >> >> .set("spark.streaming.unpersist", "true") >> >> .set("spark.serializer", >> "org.apache.spark.serializer.KryoSerializer") >> >> .registerKryoClasses(Array(classOf[java.net.URI], classOf[String])) >> >> >> val sc = new SparkContext(sparkConf) >> >> >> val ssc = new StreamingContext(sc, config.batchDuration) >> import org.apache.spark.streaming.StreamingContext._ >> >> ssc.checkpoint(sparkConf.get("spark.local.dir") + checkpointRelativeDir) >> >> >> < some input reading actions > >> >> >> < some input transformation actions > >> >> >> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) >> >> import sqlContext.implicits._ >> >> sqlContext.sql(<an-HiveQL-query>) >> >> >> ssc.start() >> >> ssc.awaitTerminationOrTimeout(config.timeout) >> >> >> >> --------------- >> >> >> >> What happens is that, right after have been launched, the driver fails with >> the exception: >> >> >> >> 15/06/10 11:38:18 ERROR yarn.ApplicationMaster: User class threw exception: >> java.lang.RuntimeException: Unable to instantiate >> org.apache.hadoop.hive.metastore.HiveMetaStoreClient >> java.lang.RuntimeException: java.lang.RuntimeException: Unable to >> instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient >> at >> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:346) >> at >> org.apache.spark.sql.hive.HiveContext.sessionState$lzycompute(HiveContext.scala:239) >> at >> org.apache.spark.sql.hive.HiveContext.sessionState(HiveContext.scala:235) >> at >> org.apache.spark.sql.hive.HiveContext.hiveconf$lzycompute(HiveContext.scala:251) >> at org.apache.spark.sql.hive.HiveContext.hiveconf(HiveContext.scala:250) >> at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:95) >> at .... myDriver.scala: < line of the sqlContext.sql(query) > >> Caused by < some stuff > >> Caused by: javax.jdo.JDOFatalUserException: Class >> org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found. >> NestedThrowables: >> java.lang.ClassNotFoundException: >> org.datanucleus.api.jdo.JDOPersistenceManagerFactory >> ... >> Caused by: java.lang.ClassNotFoundException: >> org.datanucleus.api.jdo.JDOPersistenceManagerFactory >> >> Thinking about a wrong Hive installation/configuration or libs/classpath >> definition, I SSHed into the cluster and launched a spark-shell. Excluding >> the app configuration and StreamingContext usage/definition, I then carried >> out all the actions listed in the driver implementation, in particular all >> the Hive-related ones and they all went through smoothly! >> >> >> I also tried to use the optional "-h" argument >> (https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/README.md#arguments-optional) >> in the install-spark emr-bootstrap-action, but the driver failed the very >> same way. Furthermore, when launching a spark-shell (on the EMR cluster with >> Spark installed with the "-h" option), I also got: >> >> >> >> 15/06/09 14:20:51 WARN conf.HiveConf: hive-default.xml not found on CLASSPATH >> 15/06/09 14:20:52 INFO metastore.HiveMetaStore: 0: Opening raw store with >> implemenation class:org.apache.hadoop.hive.metastore.ObjectStore >> 15/06/09 14:20:52 INFO metastore.ObjectStore: ObjectStore, initialize called >> 15/06/09 14:20:52 WARN DataNucleus.General: Plugin (Bundle) >> "org.datanucleus" is already registered. Ensure you dont have multiple JAR >> versions of the same plugin in the classpath. The URL >> "file:/home/hadoop/spark/classpath/hive/datanucleus-core-3.2.10.jar" is >> already registered, and you are trying to register an identical plugin >> located at URL >> "file:/home/hadoop/.versions/spark-1.3.1.e/lib/datanucleus-core-3.2.10.jar." >> 15/06/09 14:20:52 WARN DataNucleus.General: Plugin (Bundle) >> "org.datanucleus.store.rdbms" is already registered. Ensure you dont have >> multiple JAR versions of the same plugin in the classpath. The URL >> "file:/home/hadoop/.versions/spark-1.3.1.e/lib/datanucleus-rdbms-3.2.9.jar" >> is already registered, and you are trying to register an identical plugin >> located at URL >> "file:/home/hadoop/.versions/spark-1.3.1.e/classpath/hive/datanucleus-rdbms-3.2.9.jar." >> 15/06/09 14:20:52 WARN DataNucleus.General: Plugin (Bundle) >> "org.datanucleus.store.rdbms" is already registered. Ensure you dont have >> multiple JAR versions of the same plugin in the classpath. The URL >> "file:/home/hadoop/spark/classpath/hive/datanucleus-rdbms-3.2.9.jar" is >> already registered, and you are trying to register an identical plugin >> located at URL >> "file:/home/hadoop/.versions/spark-1.3.1.e/classpath/hive/datanucleus-rdbms-3.2.9.jar." >> 15/06/09 14:20:52 WARN DataNucleus.General: Plugin (Bundle) >> "org.datanucleus" is already registered. Ensure you dont have multiple JAR >> versions of the same plugin in the classpath. The URL >> "file:/home/hadoop/spark/lib/datanucleus-core-3.2.10.jar" is already >> registered, and you are trying to register an identical plugin located at >> URL >> "file:/home/hadoop/.versions/spark-1.3.1.e/lib/datanucleus-core-3.2.10.jar." >> 15/06/09 14:20:52 WARN DataNucleus.General: Plugin (Bundle) >> "org.datanucleus.api.jdo" is already registered. Ensure you dont have >> multiple JAR versions of the same plugin in the classpath. The URL >> "file:/home/hadoop/.versions/spark-1.3.1.e/lib/datanucleus-api-jdo-3.2.6.jar" >> is already registered, and you are trying to register an identical plugin >> located at URL >> "file:/home/hadoop/.versions/spark-1.3.1.e/classpath/hive/datanucleus-api-jdo-3.2.6.jar." >> 15/06/09 14:20:52 WARN DataNucleus.General: Plugin (Bundle) >> "org.datanucleus.store.rdbms" is already registered. Ensure you dont have >> multiple JAR versions of the same plugin in the classpath. The URL >> "file:/home/hadoop/spark/lib/datanucleus-rdbms-3.2.9.jar" is already >> registered, and you are trying to register an identical plugin located at >> URL >> "file:/home/hadoop/.versions/spark-1.3.1.e/classpath/hive/datanucleus-rdbms-3.2.9.jar." >> 15/06/09 14:20:52 WARN DataNucleus.General: Plugin (Bundle) >> "org.datanucleus.api.jdo" is already registered. Ensure you dont have >> multiple JAR versions of the same plugin in the classpath. The URL >> "file:/home/hadoop/spark/lib/datanucleus-api-jdo-3.2.6.jar" is already >> registered, and you are trying to register an identical plugin located at >> URL >> "file:/home/hadoop/.versions/spark-1.3.1.e/classpath/hive/datanucleus-api-jdo-3.2.6.jar." >> 15/06/09 14:20:52 WARN DataNucleus.General: Plugin (Bundle) >> "org.datanucleus" is already registered. Ensure you dont have multiple JAR >> versions of the same plugin in the classpath. The URL >> "file:/home/hadoop/.versions/spark-1.3.1.e/classpath/hive/datanucleus-core-3.2.10.jar" >> is already registered, and you are trying to register an identical plugin >> located at URL >> "file:/home/hadoop/.versions/spark-1.3.1.e/lib/datanucleus-core-3.2.10.jar." >> 15/06/09 14:20:52 WARN DataNucleus.General: Plugin (Bundle) >> "org.datanucleus.api.jdo" is already registered. Ensure you dont have >> multiple JAR versions of the same plugin in the classpath. The URL >> "file:/home/hadoop/spark/classpath/hive/datanucleus-api-jdo-3.2.6.jar" is >> already registered, and you are trying to register an identical plugin >> located at URL >> "file:/home/hadoop/.versions/spark-1.3.1.e/classpath/hive/datanucleus-api-jdo-3.2.6.jar." >> >> >> >> >> Hence, since the queries with the HiveContext worked from the spark-shell on >> an EMR cluster with Spark installed without the "-h" option, I can assume >> the required libs were actually placed where they were supposed to be. >> >> >> >> Finally the question: why my driver fails whereas the spark-shell doesn't >> when executing a query using the HiveContext ? >> >> >> >> Sorry for the length of the mail, but I tried to describe the environment >> and the action I carried out so to better explain the problem >> >> >> >> Thanks to everyone will try to help me fix this. >> >> >> Roberto