I confirm,

Christopher was very kind helping me out here. The solution presented in the 
linked doc worked perfectly. IMO it should be linked in the official Spark 
documentation.

Thanks again,

Roberto


> On 20 Jun 2015, at 19:25, Bozeman, Christopher <bozem...@amazon.com> wrote:
> 
> We worked it out.  There was multiple items (like location of remote 
> metastore and db user auth) to make HiveContext happy in yarn-cluster mode. 
> 
> For reference 
> https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/examples/using-hivecontext-yarn-cluster.md
> 
> -Christopher Bozeman 
> 
> 
> On Jun 20, 2015, at 7:24 AM, Andrew Lee <alee...@hotmail.com> wrote:
> 
>> Hi Roberto,
>> 
>> I'm not an EMR person, but it looks like option -h is deploying the 
>> necessary dataneucleus JARs for you.
>> The req for HiveContext is the hive-site.xml and dataneucleus JARs. As long 
>> as these 2 are there, and Spark is compiled with -Phive, it should work.
>> 
>> spark-shell runs in yarn-client mode. Not sure whether your other 
>> application is running under the same mode or a different one. Try 
>> specifying yarn-client mode and see if you get the same result as 
>> spark-shell.
>> 
>> From: roberto.coluc...@gmail.com
>> Date: Wed, 10 Jun 2015 14:32:04 +0200
>> Subject: [Spark 1.3.1 on YARN on EMR] Unable to instantiate 
>> org.apache.hadoop.hive.metastore.HiveMetaStoreClient
>> To: user@spark.apache.org
>> 
>> Hi!
>> 
>> I'm struggling with an issue with Spark 1.3.1 running on YARN, running on an 
>> AWS EMR cluster. Such cluster is based on AMI 3.7.0 (hence Amazon Linux 
>> 2015.03, Hive 0.13 already installed and configured on the cluster, Hadoop 
>> 2.4, etc...). I make use of the AWS emr-bootstrap-action "install-spark" 
>> (https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark) with 
>> the option/version "-v1.3.1e" so to get the latest Spark for EMR installed 
>> and available.
>> 
>> I also have a simple Spark Streaming driver in my project. Such driver is 
>> part of a larger Maven project: in the pom.xml I'm currently using   
>> 
>> [...]
>> 
>> 
>> 
>>     <scala.binary.version>2.10</scala.binary.version>
>> 
>>     <scala.version>2.10.4</scala.version>
>> 
>>     <java.version>1.7</java.version>
>> 
>>     <spark.version>1.3.1</spark.version>
>> 
>>     <hadoop.version>2.4.1</hadoop.version>
>> 
>> 
>> 
>> [....]
>> 
>> 
>> <dependency>
>> 
>>       <groupId>org.apache.spark</groupId>
>> 
>>       <artifactId>spark-streaming_${scala.binary.version}</artifactId>
>> 
>>       <version>${spark.version}</version>
>> 
>>       <scope>provided</scope>
>> 
>>       <exclusions>
>> 
>>         <exclusion>
>> 
>>           <groupId>org.apache.hadoop</groupId>
>> 
>>           <artifactId>hadoop-client</artifactId>
>> 
>>         </exclusion>
>> 
>>       </exclusions>
>> 
>>     </dependency>
>> 
>> 
>>     <dependency>
>> 
>>       <groupId>org.apache.hadoop</groupId>
>> 
>>       <artifactId>hadoop-client</artifactId>
>> 
>>       <version>${hadoop.version}</version>
>> 
>>       <scope>provided</scope>
>> 
>>     </dependency>
>> 
>> 
>>     <dependency>
>> 
>>       <groupId>org.apache.spark</groupId>
>> 
>>       <artifactId>spark-hive_${scala.binary.version}</artifactId>
>> 
>>       <version>${spark.version}</version>
>> 
>>       <scope>provided</scope>
>> 
>>     </dependency>
>> 
>> 
>> 
>> 
>> 
>> In fact, at compile and build time everything works just fine if, in my 
>> driver, I have:
>> 
>> 
>> 
>> -------------
>> 
>> 
>> 
>> val sparkConf = new SparkConf()
>> 
>>           .setAppName(appName)
>> 
>>           .set("spark.local.dir", "/tmp/" + appName)
>> 
>>           .set("spark.streaming.unpersist", "true")
>> 
>>           .set("spark.serializer", 
>> "org.apache.spark.serializer.KryoSerializer")
>> 
>>           .registerKryoClasses(Array(classOf[java.net.URI], classOf[String]))
>> 
>> 
>> val sc = new SparkContext(sparkConf)
>> 
>> 
>> val ssc = new StreamingContext(sc, config.batchDuration)
>> import org.apache.spark.streaming.StreamingContext._
>> 
>> ssc.checkpoint(sparkConf.get("spark.local.dir") + checkpointRelativeDir)
>> 
>> 
>> < some input reading actions >
>> 
>> 
>> < some input transformation actions >
>> 
>> 
>> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
>> 
>> import sqlContext.implicits._
>> 
>> sqlContext.sql(<an-HiveQL-query>)
>> 
>> 
>> ssc.start()
>> 
>> ssc.awaitTerminationOrTimeout(config.timeout)
>> 
>> 
>> 
>> --------------- 
>> 
>> 
>> 
>> What happens is that, right after have been launched, the driver fails with 
>> the exception:
>> 
>> 
>> 
>> 15/06/10 11:38:18 ERROR yarn.ApplicationMaster: User class threw exception: 
>> java.lang.RuntimeException: Unable to instantiate 
>> org.apache.hadoop.hive.metastore.HiveMetaStoreClient
>> java.lang.RuntimeException: java.lang.RuntimeException: Unable to 
>> instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
>>     at 
>> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:346)
>>     at 
>> org.apache.spark.sql.hive.HiveContext.sessionState$lzycompute(HiveContext.scala:239)
>>     at 
>> org.apache.spark.sql.hive.HiveContext.sessionState(HiveContext.scala:235)
>>     at 
>> org.apache.spark.sql.hive.HiveContext.hiveconf$lzycompute(HiveContext.scala:251)
>>     at org.apache.spark.sql.hive.HiveContext.hiveconf(HiveContext.scala:250)
>>     at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:95)
>>         at .... myDriver.scala: < line of the sqlContext.sql(query) >
>> Caused by < some stuff >
>> Caused by: javax.jdo.JDOFatalUserException: Class 
>> org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found.
>> NestedThrowables:
>> java.lang.ClassNotFoundException: 
>> org.datanucleus.api.jdo.JDOPersistenceManagerFactory
>> ...
>> Caused by: java.lang.ClassNotFoundException: 
>> org.datanucleus.api.jdo.JDOPersistenceManagerFactory
>> 
>> Thinking about a wrong Hive installation/configuration or libs/classpath 
>> definition, I SSHed into the cluster and launched a spark-shell. Excluding 
>> the app configuration and StreamingContext usage/definition, I then carried 
>> out all the actions listed in the driver implementation, in particular all 
>> the Hive-related ones and they all went through smoothly!
>> 
>> 
>> I also tried to use the optional "-h" argument 
>> (https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/README.md#arguments-optional)
>>  in the install-spark emr-bootstrap-action, but the driver failed the very 
>> same way. Furthermore, when launching a spark-shell (on the EMR cluster with 
>> Spark installed with the "-h" option), I also got:
>> 
>> 
>> 
>> 15/06/09 14:20:51 WARN conf.HiveConf: hive-default.xml not found on CLASSPATH
>> 15/06/09 14:20:52 INFO metastore.HiveMetaStore: 0: Opening raw store with 
>> implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
>> 15/06/09 14:20:52 INFO metastore.ObjectStore: ObjectStore, initialize called
>> 15/06/09 14:20:52 WARN DataNucleus.General: Plugin (Bundle) 
>> "org.datanucleus" is already registered. Ensure you dont have multiple JAR 
>> versions of the same plugin in the classpath. The URL 
>> "file:/home/hadoop/spark/classpath/hive/datanucleus-core-3.2.10.jar" is 
>> already registered, and you are trying to register an identical plugin 
>> located at URL 
>> "file:/home/hadoop/.versions/spark-1.3.1.e/lib/datanucleus-core-3.2.10.jar."
>> 15/06/09 14:20:52 WARN DataNucleus.General: Plugin (Bundle) 
>> "org.datanucleus.store.rdbms" is already registered. Ensure you dont have 
>> multiple JAR versions of the same plugin in the classpath. The URL 
>> "file:/home/hadoop/.versions/spark-1.3.1.e/lib/datanucleus-rdbms-3.2.9.jar" 
>> is already registered, and you are trying to register an identical plugin 
>> located at URL 
>> "file:/home/hadoop/.versions/spark-1.3.1.e/classpath/hive/datanucleus-rdbms-3.2.9.jar."
>> 15/06/09 14:20:52 WARN DataNucleus.General: Plugin (Bundle) 
>> "org.datanucleus.store.rdbms" is already registered. Ensure you dont have 
>> multiple JAR versions of the same plugin in the classpath. The URL 
>> "file:/home/hadoop/spark/classpath/hive/datanucleus-rdbms-3.2.9.jar" is 
>> already registered, and you are trying to register an identical plugin 
>> located at URL 
>> "file:/home/hadoop/.versions/spark-1.3.1.e/classpath/hive/datanucleus-rdbms-3.2.9.jar."
>> 15/06/09 14:20:52 WARN DataNucleus.General: Plugin (Bundle) 
>> "org.datanucleus" is already registered. Ensure you dont have multiple JAR 
>> versions of the same plugin in the classpath. The URL 
>> "file:/home/hadoop/spark/lib/datanucleus-core-3.2.10.jar" is already 
>> registered, and you are trying to register an identical plugin located at 
>> URL 
>> "file:/home/hadoop/.versions/spark-1.3.1.e/lib/datanucleus-core-3.2.10.jar."
>> 15/06/09 14:20:52 WARN DataNucleus.General: Plugin (Bundle) 
>> "org.datanucleus.api.jdo" is already registered. Ensure you dont have 
>> multiple JAR versions of the same plugin in the classpath. The URL 
>> "file:/home/hadoop/.versions/spark-1.3.1.e/lib/datanucleus-api-jdo-3.2.6.jar"
>>  is already registered, and you are trying to register an identical plugin 
>> located at URL 
>> "file:/home/hadoop/.versions/spark-1.3.1.e/classpath/hive/datanucleus-api-jdo-3.2.6.jar."
>> 15/06/09 14:20:52 WARN DataNucleus.General: Plugin (Bundle) 
>> "org.datanucleus.store.rdbms" is already registered. Ensure you dont have 
>> multiple JAR versions of the same plugin in the classpath. The URL 
>> "file:/home/hadoop/spark/lib/datanucleus-rdbms-3.2.9.jar" is already 
>> registered, and you are trying to register an identical plugin located at 
>> URL 
>> "file:/home/hadoop/.versions/spark-1.3.1.e/classpath/hive/datanucleus-rdbms-3.2.9.jar."
>> 15/06/09 14:20:52 WARN DataNucleus.General: Plugin (Bundle) 
>> "org.datanucleus.api.jdo" is already registered. Ensure you dont have 
>> multiple JAR versions of the same plugin in the classpath. The URL 
>> "file:/home/hadoop/spark/lib/datanucleus-api-jdo-3.2.6.jar" is already 
>> registered, and you are trying to register an identical plugin located at 
>> URL 
>> "file:/home/hadoop/.versions/spark-1.3.1.e/classpath/hive/datanucleus-api-jdo-3.2.6.jar."
>> 15/06/09 14:20:52 WARN DataNucleus.General: Plugin (Bundle) 
>> "org.datanucleus" is already registered. Ensure you dont have multiple JAR 
>> versions of the same plugin in the classpath. The URL 
>> "file:/home/hadoop/.versions/spark-1.3.1.e/classpath/hive/datanucleus-core-3.2.10.jar"
>>  is already registered, and you are trying to register an identical plugin 
>> located at URL 
>> "file:/home/hadoop/.versions/spark-1.3.1.e/lib/datanucleus-core-3.2.10.jar."
>> 15/06/09 14:20:52 WARN DataNucleus.General: Plugin (Bundle) 
>> "org.datanucleus.api.jdo" is already registered. Ensure you dont have 
>> multiple JAR versions of the same plugin in the classpath. The URL 
>> "file:/home/hadoop/spark/classpath/hive/datanucleus-api-jdo-3.2.6.jar" is 
>> already registered, and you are trying to register an identical plugin 
>> located at URL 
>> "file:/home/hadoop/.versions/spark-1.3.1.e/classpath/hive/datanucleus-api-jdo-3.2.6.jar."
>> 
>> 
>> 
>> 
>> Hence, since the queries with the HiveContext worked from the spark-shell on 
>> an EMR cluster with Spark installed without the "-h" option, I can assume 
>> the required libs were actually placed where they were supposed to be.
>> 
>> 
>> 
>> Finally the question: why my driver fails whereas the spark-shell doesn't 
>> when executing a query using the HiveContext ?
>> 
>> 
>> 
>> Sorry for the length of the mail, but I tried to describe the environment 
>> and the action I carried out so to better explain the problem
>> 
>> 
>> 
>> Thanks to everyone will try to help me fix this.
>> 
>> 
>> Roberto

Reply via email to