Hi Roberto,
I'm not an EMR person, but it looks like option -h is deploying the necessary 
dataneucleus JARs for you.The req for HiveContext is the hive-site.xml and 
dataneucleus JARs. As long as these 2 are there, and Spark is compiled with 
-Phive, it should work.
spark-shell runs in yarn-client mode. Not sure whether your other application 
is running under the same mode or a different one. Try specifying yarn-client 
mode and see if you get the same result as spark-shell.
From: roberto.coluc...@gmail.com
Date: Wed, 10 Jun 2015 14:32:04 +0200
Subject: [Spark 1.3.1 on YARN on EMR] Unable to instantiate 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient
To: user@spark.apache.org

Hi!
I'm struggling with an issue with Spark 1.3.1 running on YARN, running on an 
AWS EMR cluster. Such cluster is based on AMI 3.7.0 (hence Amazon Linux 
2015.03, Hive 0.13 already installed and configured on the cluster, Hadoop 2.4, 
etc...). I make use of the AWS emr-bootstrap-action "install-spark" 
(https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark) with the 
option/version "-v1.3.1e" so to get the latest Spark for EMR installed and 
available.
I also have a simple Spark Streaming driver in my project. Such driver is part 
of a larger Maven project: in the pom.xml I'm currently using   
[...]
    <scala.binary.version>2.10</scala.binary.version>
    <scala.version>2.10.4</scala.version>
    <java.version>1.7</java.version>
    <spark.version>1.3.1</spark.version>
    <hadoop.version>2.4.1</hadoop.version>
[....]
<dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming_${scala.binary.version}</artifactId>
      <version>${spark.version}</version>
      <scope>provided</scope>
      <exclusions>
        <exclusion>
          <groupId>org.apache.hadoop</groupId>
          <artifactId>hadoop-client</artifactId>
        </exclusion>
      </exclusions>
    </dependency>


    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-client</artifactId>
      <version>${hadoop.version}</version>
      <scope>provided</scope>
    </dependency>


    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-hive_${scala.binary.version}</artifactId>
      <version>${spark.version}</version>
      <scope>provided</scope>
    </dependency>

In fact, at compile and build time everything works just fine if, in my driver, 
I have:
-------------
val sparkConf = new SparkConf()          .setAppName(appName)          
.set("spark.local.dir", "/tmp/" + appName)          
.set("spark.streaming.unpersist", "true")          .set("spark.serializer", 
"org.apache.spark.serializer.KryoSerializer")          
.registerKryoClasses(Array(classOf[java.net.URI], classOf[String]))
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, config.batchDuration)
import org.apache.spark.streaming.StreamingContext._












ssc.checkpoint(sparkConf.get("spark.local.dir") + checkpointRelativeDir)
< some input reading actions >
< some input transformation actions >
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
import sqlContext.implicits._sqlContext.sql(<an-HiveQL-query>)
ssc.start()ssc.awaitTerminationOrTimeout(config.timeout)

--------------- 
What happens is that, right after have been launched, the driver fails with the 
exception:
15/06/10 11:38:18 ERROR yarn.ApplicationMaster: User class threw exception: 
java.lang.RuntimeException: Unable to instantiate 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient
java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient
    at 
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:346)
    at 
org.apache.spark.sql.hive.HiveContext.sessionState$lzycompute(HiveContext.scala:239)
    at org.apache.spark.sql.hive.HiveContext.sessionState(HiveContext.scala:235)
    at 
org.apache.spark.sql.hive.HiveContext.hiveconf$lzycompute(HiveContext.scala:251)
    at org.apache.spark.sql.hive.HiveContext.hiveconf(HiveContext.scala:250)
    at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:95)
        at .... myDriver.scala: < line of the sqlContext.sql(query) >
Caused by < some stuff >
Caused by: javax.jdo.JDOFatalUserException: Class 
org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found.
NestedThrowables:
java.lang.ClassNotFoundException: 
org.datanucleus.api.jdo.JDOPersistenceManagerFactory
...
Caused by: java.lang.ClassNotFoundException: 
org.datanucleus.api.jdo.JDOPersistenceManagerFactory
Thinking about a wrong Hive installation/configuration or libs/classpath 
definition, I SSHed into the cluster and launched a spark-shell. Excluding the 
app configuration and StreamingContext usage/definition, I then carried out all 
the actions listed in the driver implementation, in particular all the 
Hive-related ones and they all went through smoothly!

I also tried to use the optional "-h" argument 
(https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/README.md#arguments-optional)
 in the install-spark emr-bootstrap-action, but the driver failed the very same 
way. Furthermore, when launching a spark-shell (on the EMR cluster with Spark 
installed with the "-h" option), I also got:
15/06/09 14:20:51 WARN conf.HiveConf: hive-default.xml not found on CLASSPATH
15/06/09 14:20:52 INFO metastore.HiveMetaStore: 0: Opening raw store with 
implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
15/06/09 14:20:52 INFO metastore.ObjectStore: ObjectStore, initialize called
15/06/09 14:20:52 WARN DataNucleus.General: Plugin (Bundle) "org.datanucleus" 
is already registered. Ensure you dont have multiple JAR versions of the same 
plugin in the classpath. The URL 
"file:/home/hadoop/spark/classpath/hive/datanucleus-core-3.2.10.jar" is already 
registered, and you are trying to register an identical plugin located at URL 
"file:/home/hadoop/.versions/spark-1.3.1.e/lib/datanucleus-core-3.2.10.jar."
15/06/09 14:20:52 WARN DataNucleus.General: Plugin (Bundle) 
"org.datanucleus.store.rdbms" is already registered. Ensure you dont have 
multiple JAR versions of the same plugin in the classpath. The URL 
"file:/home/hadoop/.versions/spark-1.3.1.e/lib/datanucleus-rdbms-3.2.9.jar" is 
already registered, and you are trying to register an identical plugin located 
at URL 
"file:/home/hadoop/.versions/spark-1.3.1.e/classpath/hive/datanucleus-rdbms-3.2.9.jar."
15/06/09 14:20:52 WARN DataNucleus.General: Plugin (Bundle) 
"org.datanucleus.store.rdbms" is already registered. Ensure you dont have 
multiple JAR versions of the same plugin in the classpath. The URL 
"file:/home/hadoop/spark/classpath/hive/datanucleus-rdbms-3.2.9.jar" is already 
registered, and you are trying to register an identical plugin located at URL 
"file:/home/hadoop/.versions/spark-1.3.1.e/classpath/hive/datanucleus-rdbms-3.2.9.jar."
15/06/09 14:20:52 WARN DataNucleus.General: Plugin (Bundle) "org.datanucleus" 
is already registered. Ensure you dont have multiple JAR versions of the same 
plugin in the classpath. The URL 
"file:/home/hadoop/spark/lib/datanucleus-core-3.2.10.jar" is already 
registered, and you are trying to register an identical plugin located at URL 
"file:/home/hadoop/.versions/spark-1.3.1.e/lib/datanucleus-core-3.2.10.jar."
15/06/09 14:20:52 WARN DataNucleus.General: Plugin (Bundle) 
"org.datanucleus.api.jdo" is already registered. Ensure you dont have multiple 
JAR versions of the same plugin in the classpath. The URL 
"file:/home/hadoop/.versions/spark-1.3.1.e/lib/datanucleus-api-jdo-3.2.6.jar" 
is already registered, and you are trying to register an identical plugin 
located at URL 
"file:/home/hadoop/.versions/spark-1.3.1.e/classpath/hive/datanucleus-api-jdo-3.2.6.jar."
15/06/09 14:20:52 WARN DataNucleus.General: Plugin (Bundle) 
"org.datanucleus.store.rdbms" is already registered. Ensure you dont have 
multiple JAR versions of the same plugin in the classpath. The URL 
"file:/home/hadoop/spark/lib/datanucleus-rdbms-3.2.9.jar" is already 
registered, and you are trying to register an identical plugin located at URL 
"file:/home/hadoop/.versions/spark-1.3.1.e/classpath/hive/datanucleus-rdbms-3.2.9.jar."
15/06/09 14:20:52 WARN DataNucleus.General: Plugin (Bundle) 
"org.datanucleus.api.jdo" is already registered. Ensure you dont have multiple 
JAR versions of the same plugin in the classpath. The URL 
"file:/home/hadoop/spark/lib/datanucleus-api-jdo-3.2.6.jar" is already 
registered, and you are trying to register an identical plugin located at URL 
"file:/home/hadoop/.versions/spark-1.3.1.e/classpath/hive/datanucleus-api-jdo-3.2.6.jar."
15/06/09 14:20:52 WARN DataNucleus.General: Plugin (Bundle) "org.datanucleus" 
is already registered. Ensure you dont have multiple JAR versions of the same 
plugin in the classpath. The URL 
"file:/home/hadoop/.versions/spark-1.3.1.e/classpath/hive/datanucleus-core-3.2.10.jar"
 is already registered, and you are trying to register an identical plugin 
located at URL 
"file:/home/hadoop/.versions/spark-1.3.1.e/lib/datanucleus-core-3.2.10.jar."
15/06/09 14:20:52 WARN DataNucleus.General: Plugin (Bundle) 
"org.datanucleus.api.jdo" is already registered. Ensure you dont have multiple 
JAR versions of the same plugin in the classpath. The URL 
"file:/home/hadoop/spark/classpath/hive/datanucleus-api-jdo-3.2.6.jar" is 
already registered, and you are trying to register an identical plugin located 
at URL 
"file:/home/hadoop/.versions/spark-1.3.1.e/classpath/hive/datanucleus-api-jdo-3.2.6.jar."

Hence, since the queries with the HiveContext worked from the spark-shell on an 
EMR cluster with Spark installed without the "-h" option, I can assume the 
required libs were actually placed where they were supposed to be.
Finally the question: why my driver fails whereas the spark-shell doesn't when 
executing a query using the HiveContext ?
Sorry for the length of the mail, but I tried to describe the environment and 
the action I carried out so to better explain the problem
Thanks to everyone will try to help me fix this.
Roberto


                                          

Reply via email to