Hi!

I'm struggling with an issue with Spark 1.3.1 running on YARN, running on
an AWS EMR cluster. Such cluster is based on AMI 3.7.0 (hence Amazon Linux
2015.03, Hive 0.13 already installed and configured on the cluster, Hadoop
2.4, etc...). I make use of the AWS emr-bootstrap-action "*install-spark*" (
https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark) with
the option/version* "-v1.3.1e"* so to get the latest Spark for EMR
installed and available.

I also have a simple Spark Streaming driver in my project. Such driver is
part of a larger Maven project: in the *pom.xml* I'm currently using

[...]


    <scala.binary.version>2.10</scala.binary.version>

    <scala.version>2.10.4</scala.version>

    <java.version>1.7</java.version>

    <spark.version>1.3.1</spark.version>

    <hadoop.version>2.4.1</hadoop.version>


[....]

<dependency>

      <groupId>org.apache.spark</groupId>

      <artifactId>spark-streaming_${scala.binary.version}</artifactId>

      <version>${spark.version}</version>

      <scope>provided</scope>

      <exclusions>

        <exclusion>

          <groupId>org.apache.hadoop</groupId>

          <artifactId>hadoop-client</artifactId>

        </exclusion>

      </exclusions>

    </dependency>


    <dependency>

      <groupId>org.apache.hadoop</groupId>

      <artifactId>hadoop-client</artifactId>

      <version>${hadoop.version}</version>

      <scope>provided</scope>

    </dependency>


    <dependency>

      <groupId>org.apache.spark</groupId>

      <artifactId>spark-hive_${scala.binary.version}</artifactId>

      <version>${spark.version}</version>

      <scope>provided</scope>

    </dependency>



In fact, at compile and build time everything works just fine if, in my
driver, I have:


-------------


*val* sparkConf = *new* SparkConf()

          .setAppName(appName)

          .set("spark.local.dir", "/tmp/" + appName)

          .set("spark.streaming.unpersist", "true")

          .set("spark.serializer",
"org.apache.spark.serializer.KryoSerializer")

          .registerKryoClasses(Array(classOf[java.net.URI],
classOf[String]))


*val* sc = *new* SparkContext(sparkConf)


*val* ssc = *new* StreamingContext(sc, config.batchDuration)

*import* org.apache.spark.streaming.StreamingContext._

ssc.checkpoint(sparkConf.get("spark.local.dir") + checkpointRelativeDir)


< some input reading actions >


< some input transformation actions >


*val* sqlContext = *new* org.apache.spark.sql.hive.HiveContext(sc)

*import* sqlContext.implicits._

sqlContext.sql(<an-HiveQL-query>)


ssc.start()

ssc.awaitTerminationOrTimeout(config.timeout)



---------------


What happens is that, right after have been launched, the driver fails with
the exception:


15/06/10 11:38:18 ERROR yarn.ApplicationMaster: User class threw
exception: java.lang.RuntimeException: Unable to instantiate
org.apache.hadoop.hive.metastore.HiveMetaStoreClient
java.lang.RuntimeException: java.lang.RuntimeException: Unable to
instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
    at 
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:346)
    at 
org.apache.spark.sql.hive.HiveContext.sessionState$lzycompute(HiveContext.scala:239)
    at org.apache.spark.sql.hive.HiveContext.sessionState(HiveContext.scala:235)
    at 
org.apache.spark.sql.hive.HiveContext.hiveconf$lzycompute(HiveContext.scala:251)
    at org.apache.spark.sql.hive.HiveContext.hiveconf(HiveContext.scala:250)
    at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:95)
        at .... myDriver.scala: < line of the sqlContext.sql(query) >
Caused by < some stuff >
Caused by: javax.jdo.JDOFatalUserException: Class
org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found.
NestedThrowables:
java.lang.ClassNotFoundException:
org.datanucleus.api.jdo.JDOPersistenceManagerFactory
...
Caused by: java.lang.ClassNotFoundException:
org.datanucleus.api.jdo.JDOPersistenceManagerFactory


Thinking about a wrong Hive installation/configuration or libs/classpath
definition, I SSHed into the cluster and launched a *spark-shell.*
Excluding the app configuration and StreamingContext usage/definition, I
then carried out all the actions listed in the driver implementation, in
particular all the Hive-related ones and they all went through smoothly!


I also tried to use the optional *"-h"* argument (
https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/README.md#arguments-optional)
in the install-spark emr-bootstrap-action, but the driver failed the very
same way. Furthermore, when launching a spark-shell (on the EMR cluster
with Spark installed with the "-h" option), I also got:


15/06/09 14:20:51 WARN conf.HiveConf: hive-default.xml not found on CLASSPATH
15/06/09 14:20:52 INFO metastore.HiveMetaStore: 0: Opening raw store
with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
15/06/09 14:20:52 INFO metastore.ObjectStore: ObjectStore, initialize called
15/06/09 14:20:52 WARN DataNucleus.General: Plugin (Bundle)
"org.datanucleus" is already registered. Ensure you dont have multiple
JAR versions of the same plugin in the classpath. The URL
"file:/home/hadoop/spark/classpath/hive/datanucleus-core-3.2.10.jar"
is already registered, and you are trying to register an identical
plugin located at URL
"file:/home/hadoop/.versions/spark-1.3.1.e/lib/datanucleus-core-3.2.10.jar."
15/06/09 14:20:52 WARN DataNucleus.General: Plugin (Bundle)
"org.datanucleus.store.rdbms" is already registered. Ensure you dont
have multiple JAR versions of the same plugin in the classpath. The
URL "file:/home/hadoop/.versions/spark-1.3.1.e/lib/datanucleus-rdbms-3.2.9.jar"
is already registered, and you are trying to register an identical
plugin located at URL
"file:/home/hadoop/.versions/spark-1.3.1.e/classpath/hive/datanucleus-rdbms-3.2.9.jar."
15/06/09 14:20:52 WARN DataNucleus.General: Plugin (Bundle)
"org.datanucleus.store.rdbms" is already registered. Ensure you dont
have multiple JAR versions of the same plugin in the classpath. The
URL "file:/home/hadoop/spark/classpath/hive/datanucleus-rdbms-3.2.9.jar"
is already registered, and you are trying to register an identical
plugin located at URL
"file:/home/hadoop/.versions/spark-1.3.1.e/classpath/hive/datanucleus-rdbms-3.2.9.jar."
15/06/09 14:20:52 WARN DataNucleus.General: Plugin (Bundle)
"org.datanucleus" is already registered. Ensure you dont have multiple
JAR versions of the same plugin in the classpath. The URL
"file:/home/hadoop/spark/lib/datanucleus-core-3.2.10.jar" is already
registered, and you are trying to register an identical plugin located
at URL 
"file:/home/hadoop/.versions/spark-1.3.1.e/lib/datanucleus-core-3.2.10.jar."
15/06/09 14:20:52 WARN DataNucleus.General: Plugin (Bundle)
"org.datanucleus.api.jdo" is already registered. Ensure you dont have
multiple JAR versions of the same plugin in the classpath. The URL
"file:/home/hadoop/.versions/spark-1.3.1.e/lib/datanucleus-api-jdo-3.2.6.jar"
is already registered, and you are trying to register an identical
plugin located at URL
"file:/home/hadoop/.versions/spark-1.3.1.e/classpath/hive/datanucleus-api-jdo-3.2.6.jar."
15/06/09 14:20:52 WARN DataNucleus.General: Plugin (Bundle)
"org.datanucleus.store.rdbms" is already registered. Ensure you dont
have multiple JAR versions of the same plugin in the classpath. The
URL "file:/home/hadoop/spark/lib/datanucleus-rdbms-3.2.9.jar" is
already registered, and you are trying to register an identical plugin
located at URL 
"file:/home/hadoop/.versions/spark-1.3.1.e/classpath/hive/datanucleus-rdbms-3.2.9.jar."
15/06/09 14:20:52 WARN DataNucleus.General: Plugin (Bundle)
"org.datanucleus.api.jdo" is already registered. Ensure you dont have
multiple JAR versions of the same plugin in the classpath. The URL
"file:/home/hadoop/spark/lib/datanucleus-api-jdo-3.2.6.jar" is already
registered, and you are trying to register an identical plugin located
at URL 
"file:/home/hadoop/.versions/spark-1.3.1.e/classpath/hive/datanucleus-api-jdo-3.2.6.jar."
15/06/09 14:20:52 WARN DataNucleus.General: Plugin (Bundle)
"org.datanucleus" is already registered. Ensure you dont have multiple
JAR versions of the same plugin in the classpath. The URL
"file:/home/hadoop/.versions/spark-1.3.1.e/classpath/hive/datanucleus-core-3.2.10.jar"
is already registered, and you are trying to register an identical
plugin located at URL
"file:/home/hadoop/.versions/spark-1.3.1.e/lib/datanucleus-core-3.2.10.jar."
15/06/09 14:20:52 WARN DataNucleus.General: Plugin (Bundle)
"org.datanucleus.api.jdo" is already registered. Ensure you dont have
multiple JAR versions of the same plugin in the classpath. The URL
"file:/home/hadoop/spark/classpath/hive/datanucleus-api-jdo-3.2.6.jar"
is already registered, and you are trying to register an identical
plugin located at URL
"file:/home/hadoop/.versions/spark-1.3.1.e/classpath/hive/datanucleus-api-jdo-3.2.6.jar."



Hence, since the queries with the HiveContext worked from the spark-shell
on an EMR cluster with Spark installed without the "-h" option, I can
assume the required libs were actually placed where they were supposed to
be.


Finally the question: *why my driver fails whereas the spark-shell doesn't
when executing a query using the HiveContext ?*


Sorry for the length of the mail, but I tried to describe the environment
and the action I carried out so to better explain the problem


Thanks to everyone will try to help me fix this.


*Roberto*

Reply via email to