Hi! I'm struggling with an issue with Spark 1.3.1 running on YARN, running on an AWS EMR cluster. Such cluster is based on AMI 3.7.0 (hence Amazon Linux 2015.03, Hive 0.13 already installed and configured on the cluster, Hadoop 2.4, etc...). I make use of the AWS emr-bootstrap-action "*install-spark*" ( https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark) with the option/version* "-v1.3.1e"* so to get the latest Spark for EMR installed and available.
I also have a simple Spark Streaming driver in my project. Such driver is part of a larger Maven project: in the *pom.xml* I'm currently using [...] <scala.binary.version>2.10</scala.binary.version> <scala.version>2.10.4</scala.version> <java.version>1.7</java.version> <spark.version>1.3.1</spark.version> <hadoop.version>2.4.1</hadoop.version> [....] <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming_${scala.binary.version}</artifactId> <version>${spark.version}</version> <scope>provided</scope> <exclusions> <exclusion> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> </exclusion> </exclusions> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>${hadoop.version}</version> <scope>provided</scope> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-hive_${scala.binary.version}</artifactId> <version>${spark.version}</version> <scope>provided</scope> </dependency> In fact, at compile and build time everything works just fine if, in my driver, I have: ------------- *val* sparkConf = *new* SparkConf() .setAppName(appName) .set("spark.local.dir", "/tmp/" + appName) .set("spark.streaming.unpersist", "true") .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .registerKryoClasses(Array(classOf[java.net.URI], classOf[String])) *val* sc = *new* SparkContext(sparkConf) *val* ssc = *new* StreamingContext(sc, config.batchDuration) *import* org.apache.spark.streaming.StreamingContext._ ssc.checkpoint(sparkConf.get("spark.local.dir") + checkpointRelativeDir) < some input reading actions > < some input transformation actions > *val* sqlContext = *new* org.apache.spark.sql.hive.HiveContext(sc) *import* sqlContext.implicits._ sqlContext.sql(<an-HiveQL-query>) ssc.start() ssc.awaitTerminationOrTimeout(config.timeout) --------------- What happens is that, right after have been launched, the driver fails with the exception: 15/06/10 11:38:18 ERROR yarn.ApplicationMaster: User class threw exception: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:346) at org.apache.spark.sql.hive.HiveContext.sessionState$lzycompute(HiveContext.scala:239) at org.apache.spark.sql.hive.HiveContext.sessionState(HiveContext.scala:235) at org.apache.spark.sql.hive.HiveContext.hiveconf$lzycompute(HiveContext.scala:251) at org.apache.spark.sql.hive.HiveContext.hiveconf(HiveContext.scala:250) at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:95) at .... myDriver.scala: < line of the sqlContext.sql(query) > Caused by < some stuff > Caused by: javax.jdo.JDOFatalUserException: Class org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found. NestedThrowables: java.lang.ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory ... Caused by: java.lang.ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory Thinking about a wrong Hive installation/configuration or libs/classpath definition, I SSHed into the cluster and launched a *spark-shell.* Excluding the app configuration and StreamingContext usage/definition, I then carried out all the actions listed in the driver implementation, in particular all the Hive-related ones and they all went through smoothly! I also tried to use the optional *"-h"* argument ( https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/README.md#arguments-optional) in the install-spark emr-bootstrap-action, but the driver failed the very same way. Furthermore, when launching a spark-shell (on the EMR cluster with Spark installed with the "-h" option), I also got: 15/06/09 14:20:51 WARN conf.HiveConf: hive-default.xml not found on CLASSPATH 15/06/09 14:20:52 INFO metastore.HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 15/06/09 14:20:52 INFO metastore.ObjectStore: ObjectStore, initialize called 15/06/09 14:20:52 WARN DataNucleus.General: Plugin (Bundle) "org.datanucleus" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/home/hadoop/spark/classpath/hive/datanucleus-core-3.2.10.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/home/hadoop/.versions/spark-1.3.1.e/lib/datanucleus-core-3.2.10.jar." 15/06/09 14:20:52 WARN DataNucleus.General: Plugin (Bundle) "org.datanucleus.store.rdbms" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/home/hadoop/.versions/spark-1.3.1.e/lib/datanucleus-rdbms-3.2.9.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/home/hadoop/.versions/spark-1.3.1.e/classpath/hive/datanucleus-rdbms-3.2.9.jar." 15/06/09 14:20:52 WARN DataNucleus.General: Plugin (Bundle) "org.datanucleus.store.rdbms" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/home/hadoop/spark/classpath/hive/datanucleus-rdbms-3.2.9.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/home/hadoop/.versions/spark-1.3.1.e/classpath/hive/datanucleus-rdbms-3.2.9.jar." 15/06/09 14:20:52 WARN DataNucleus.General: Plugin (Bundle) "org.datanucleus" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/home/hadoop/spark/lib/datanucleus-core-3.2.10.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/home/hadoop/.versions/spark-1.3.1.e/lib/datanucleus-core-3.2.10.jar." 15/06/09 14:20:52 WARN DataNucleus.General: Plugin (Bundle) "org.datanucleus.api.jdo" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/home/hadoop/.versions/spark-1.3.1.e/lib/datanucleus-api-jdo-3.2.6.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/home/hadoop/.versions/spark-1.3.1.e/classpath/hive/datanucleus-api-jdo-3.2.6.jar." 15/06/09 14:20:52 WARN DataNucleus.General: Plugin (Bundle) "org.datanucleus.store.rdbms" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/home/hadoop/spark/lib/datanucleus-rdbms-3.2.9.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/home/hadoop/.versions/spark-1.3.1.e/classpath/hive/datanucleus-rdbms-3.2.9.jar." 15/06/09 14:20:52 WARN DataNucleus.General: Plugin (Bundle) "org.datanucleus.api.jdo" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/home/hadoop/spark/lib/datanucleus-api-jdo-3.2.6.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/home/hadoop/.versions/spark-1.3.1.e/classpath/hive/datanucleus-api-jdo-3.2.6.jar." 15/06/09 14:20:52 WARN DataNucleus.General: Plugin (Bundle) "org.datanucleus" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/home/hadoop/.versions/spark-1.3.1.e/classpath/hive/datanucleus-core-3.2.10.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/home/hadoop/.versions/spark-1.3.1.e/lib/datanucleus-core-3.2.10.jar." 15/06/09 14:20:52 WARN DataNucleus.General: Plugin (Bundle) "org.datanucleus.api.jdo" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/home/hadoop/spark/classpath/hive/datanucleus-api-jdo-3.2.6.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/home/hadoop/.versions/spark-1.3.1.e/classpath/hive/datanucleus-api-jdo-3.2.6.jar." Hence, since the queries with the HiveContext worked from the spark-shell on an EMR cluster with Spark installed without the "-h" option, I can assume the required libs were actually placed where they were supposed to be. Finally the question: *why my driver fails whereas the spark-shell doesn't when executing a query using the HiveContext ?* Sorry for the length of the mail, but I tried to describe the environment and the action I carried out so to better explain the problem Thanks to everyone will try to help me fix this. *Roberto*