RE: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

Andrew Lee Mon, 29 Dec 2014 16:14:33 -0800

A follow up on the hive-site.xml, if you 
1. Specify it in spark/conf, then you can NOT apply it via the 
--driver-class-path option, otherwise, you will get the following exceptions 
when initializing SparkContext.









org.apache.spark.SparkException: Found both spark.driver.extraClassPath and 
SPARK_CLASSPATH. Use only the former.
2. If you use the --driver-class-path, then you need to unset SPARK_CLASSPATH. 
However, the flip side is that you will need to provide all the related JARs 
(hadoop-yarn, hadoop-common, hdfs, etc) that are part of the "hadoop-provided" 
if you built your JARs with -Phadoop-provided, and other common libraries that 
are required.

From: alee...@hotmail.com
To: user@spark.apache.org
CC: lian.cs....@gmail.com; linlin200...@gmail.com; huaiyin....@gmail.com
Subject: RE: Spark sql failed in yarn-cluster mode when connecting to 
non-default hive database
Date: Mon, 29 Dec 2014 16:01:26 -0800




Hi All,
I have tried to pass the properties via the SparkContext.setLocalProperty and 
HiveContext.setConf, both failed. Based on the results (haven't get a chance to 
look into the code yet), HiveContext will try to initiate the JDBC connection 
right away, I couldn't set other properties dynamically prior to any SQL 
statement.  
The only way to get it work is to put these properties in hive-site.xml which 
did work for me. I'm wondering if there's a better way to dynamically specify 
these Hive configurations like --hiveconf or other ways such as a user-path 
hive-site.xml?  
On a shared cluster, hive-site.xml is shared and cannot be managed in a 
multiple user mode on the same edge server, especially when it contains 
personal password for metastore access. What will be the best way to pass on 
these 3 properties to spark-shell?
javax.jdo.option.ConnectionUserNamejavax.jdo.option. 
ConnectionPasswordjavax.jdo.option. ConnectionURL
According to HiveContext document, hive-site.xml is picked up from the 
classpath. Anyway to specify this dynamically for each spark-shell session?
"An instance of the Spark SQL execution engine that integrates with data stored 
in Hive. Configuration for Hive is read from hive-site.xml on the classpath."

Here are the test case I ran.
Spark 1.2.0

Test Case 1








import org.apache.spark.SparkContext
import org.apache.spark.sql.hive._


sc.setLocalProperty("javax.jdo.option.ConnectionUserName","foo")
sc.setLocalProperty("javax.jdo.option.ConnectionPassword","xxxxxxxxxx")
sc.setLocalProperty("javax.jdo.option.ConnectionURL","jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true")


val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)


import hiveContext._


// Create table and clean up data
hiveContext.hql("CREATE TABLE IF NOT EXISTS spark_hive_test_table (key INT, 
value STRING)")
// Encounter error, picking up default user 'APP'@'localhost' and creating 
metastore_db in current local directory, not honoring the JDBC settings for the 
metastore on mysql.

Test Case 2
import org.apache.spark.SparkContextimport org.apache.spark.sql.hive._
val hiveContext = new 
org.apache.spark.sql.hive.HiveContext(sc)hiveContext.setConf("javax.jdo.option.ConnectionUserName","foo")//
 Encounter error right here, it looks like HiveContext tries to initiate the 
JDBC connection prior to any settings from 
setConf.hiveContext.setConf("javax.jdo.option.ConnectionPassword","xxxxxxxxxxx")









hiveContext.setConf("javax.jdo.option.ConnectionURL","jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true")




From: huaiyin....@gmail.com
Date: Wed, 13 Aug 2014 16:56:13 -0400
Subject: Re: Spark sql failed in yarn-cluster mode when connecting to 
non-default hive database
To: linlin200...@gmail.com
CC: lian.cs....@gmail.com; user@spark.apache.org

I think the problem is that when you are using yarn-cluster mode, because the 
Spark driver runs inside the application master, the hive-conf is not 
accessible by the driver. Can you try to set those confs by using 
hiveContext.set(...)? Or, maybe you can copy hive-site.xml to spark/conf in the 
node running the application master.



On Tue, Aug 12, 2014 at 8:38 PM, Jenny Zhao <linlin200...@gmail.com> wrote:



Hi Yin,

hive-site.xml was copied to spark/conf and the same as the one under 
$HIVE_HOME/conf. 



through hive cli, I don't see any problem. but for spark on yarn-cluster mode, 
I am not able to switch to a database other than the default one, for 
Yarn-client mode, it works fine.  


Thanks!

Jenny




On Tue, Aug 12, 2014 at 12:53 PM, Yin Huai <huaiyin....@gmail.com> wrote:

Hi Jenny,
Have you copied hive-site.xml to spark/conf directory? If not, can you put it 
in conf/ and try again?





Thanks,
Yin






On Mon, Aug 11, 2014 at 8:57 PM, Jenny Zhao <linlin200...@gmail.com> wrote:






Thanks Yin! 

here is my hive-site.xml,  which I copied from $HIVE_HOME/conf, didn't 
experience problem connecting to the metastore through hive. which uses DB2 as 
metastore database. 







<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
   Licensed to the Apache Software Foundation (ASF) under one or more






   contributor license agreements.  See the NOTICE file distributed with
   this work for additional information regarding copyright ownership.
   The ASF licenses this file to You under the Apache License, Version 2.0






   (the "License"); you may not use this file except in compliance with
   the License.  You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0







   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.






   See the License for the specific language governing permissions and
   limitations under the License.
-->
<configuration>
 <property>
  <name>hive.hwi.listen.port</name>
  <value>9999</value>






 </property>
 <property>
  <name>hive.querylog.location</name>
  <value>/var/ibm/biginsights/hive/query/${user.name}</value>





 </property>

 <property>
  <name>hive.metastore.warehouse.dir</name>
  <value>/biginsights/hive/warehouse</value>
 </property>
 <property>
  <name>hive.hwi.war.file</name>






  <value>lib/hive-hwi-0.12.0.war</value>
 </property>
 <property>
  <name>hive.metastore.metrics.enabled</name>
  <value>true</value>
 </property>
 <property>






  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:db2://hdtest022.svl.ibm.com:50001/BIDB</value>
 </property>





 <property>

  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>com.ibm.db2.jcc.DB2Driver</value>
 </property>
 <property>
  <name>hive.stats.autogather</name>






  <value>false</value>
 </property>
 <property>
  <name>javax.jdo.mapping.Schema</name>
  <value>HIVE</value>
 </property>
 <property>
  <name>javax.jdo.option.ConnectionUserName</name>






  <value>catalog</value>
 </property>
 <property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>V2pJNWMxbFlVbWhaZHowOQ==</value>
 </property>






 <property>
  <name>hive.metastore.password.encrypt</name>
  <value>true</value>
 </property>
 <property>
  <name>org.jpox.autoCreateSchema</name>
  <value>true</value>






 </property>
 <property>
  <name>hive.server2.thrift.min.worker.threads</name>
  <value>5</value>
 </property>
 <property>
  <name>hive.server2.thrift.max.worker.threads</name>






  <value>100</value>
 </property>
 <property>
  <name>hive.server2.thrift.port</name>
  <value>10000</value>
 </property>
 <property>
  <name>hive.server2.thrift.bind.host</name>






  <value>hdtest022.svl.ibm.com</value>
 </property>
 <property>
  <name>hive.server2.authentication</name>
  <value>CUSTOM</value>






 </property>
 <property>
  <name>hive.server2.custom.authentication.class</name>
  
<value>org.apache.hive.service.auth.WebConsoleAuthenticationProviderImpl</value>
 </property>






 <property>
  <name>hive.server2.enable.impersonation</name>
  <value>true</value>
 </property>
 <property>
  <name>hive.security.webconsole.url</name>






  <value>http://hdtest022.svl.ibm.com:8080</value>
 </property>
 <property>
  <name>hive.security.authorization.enabled</name>






  <value>true</value>
 </property>
 <property>
  <name>hive.security.authorization.createtable.owner.grants</name>
  <value>ALL</value>
 </property>






</configuration>



On Mon, Aug 11, 2014 at 4:29 PM, Yin Huai <huaiyin....@gmail.com> wrote:






Hi Jenny,

How's your metastore configured for both Hive and Spark SQL? Which metastore 
mode are you using (based on 
https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin)?









Thanks,

Yin


On Mon, Aug 11, 2014 at 6:15 PM, Jenny Zhao <linlin200...@gmail.com> wrote:










you can reproduce this issue with the following steps (assuming you have Yarn 
cluster + Hive 12): 









1) using hive shell, create a database, e.g: create database ttt

   
2) write a simple spark sql program 

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql._
import org.apache.spark.sql.hive.HiveContext










object HiveSpark {
  case class Record(key: Int, value: String)

  def main(args: Array[String]) {
    val sparkConf = new SparkConf().setAppName("HiveSpark")
    val sc = new SparkContext(sparkConf)










    // A hive context creates an instance of the Hive Metastore in process,
    val hiveContext = new HiveContext(sc)
    import hiveContext._

    hql("use ttt")
    hql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")









    hql("LOAD DATA INPATH '/user/biadmin/kv1.txt' INTO TABLE src")

    // Queries are expressed in HiveQL
    println("Result of 'SELECT *': ")
    hql("SELECT * FROM src").collect.foreach(println)









    sc.stop()
  }
}
3) run it in yarn-cluster mode. 


On Mon, Aug 11, 2014 at 9:44 AM, Cheng Lian <lian.cs....@gmail.com> wrote:









Since you were using hql(...), it’s probably not related to JDBC driver. But I 
failed to reproduce this issue locally with a single node pseudo distributed 
YARN cluster. Would you mind to elaborate more about steps to reproduce this 
bug? Thanks













On Sun, Aug 10, 2014 at 9:36 PM, Cheng Lian <lian.cs....@gmail.com> wrote:











Hi Jenny, does this issue only happen when running Spark SQL with YARN in your 
environment?



On Sat, Aug 9, 2014 at 3:56 AM, Jenny Zhao <linlin200...@gmail.com> wrote:













Hi,

I am able to run my hql query on yarn cluster mode when connecting to the 
default hive metastore defined in hive-site.xml. 













however, if I want to switch to a different database, like: 


  hql("use other-database") 


it only works in yarn client mode, but failed on yarn-cluster mode with the 
following stack: 

14/08/08 12:09:11 INFO HiveMetaStore: 0: get_database: tt
14/08/08 12:09:11 INFO audit: ugi=biadmin       ip=unknown-ip-addr      
cmd=get_database: tt    
14/08/08 12:09:11 ERROR RetryingHMSHandler: NoSuchObjectException(message:There 
is no database named tt)
        at 
org.apache.hadoop.hive.metastore.ObjectStore.getMDatabase(ObjectStore.java:431)
        at 
org.apache.hadoop.hive.metastore.ObjectStore.getDatabase(ObjectStore.java:441)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
        at java.lang.reflect.Method.invoke(Method.java:611)
        at 
org.apache.hadoop.hive.metastore.RetryingRawStore.invoke(RetryingRawStore.java:124)
        at $Proxy15.getDatabase(Unknown Source)
        at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_database(HiveMetaStore.java:628)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
        at java.lang.reflect.Method.invoke(Method.java:611)
        at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:103)
        at $Proxy17.get_database(Unknown Source)
        at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getDatabase(HiveMetaStoreClient.java:810)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
        at java.lang.reflect.Method.invoke(Method.java:611)
        at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
        at $Proxy18.getDatabase(Unknown Source)
        at org.apache.hadoop.hive.ql.metadata.Hive.getDatabase(Hive.java:1139)
        at 
org.apache.hadoop.hive.ql.metadata.Hive.databaseExists(Hive.java:1128)
        at 
org.apache.hadoop.hive.ql.exec.DDLTask.switchDatabase(DDLTask.java:3479)
        at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:237)
        at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151)
        at 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65)
        at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1414)
        at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1192)
        at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1020)
        at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888)
        at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:208)
        at 
org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:182)
        at 
org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:272)
        at 
org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:269)
        at org.apache.spark.sql.hive.HiveContext.hiveql(HiveContext.scala:86)
        at org.apache.spark.sql.hive.HiveContext.hql(HiveContext.scala:91)
        at 
org.apache.spark.examples.sql.hive.HiveSpark$.main(HiveSpark.scala:35)
        at org.apache.spark.examples.sql.hive.HiveSpark.main(HiveSpark.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
        at java.lang.reflect.Method.invoke(Method.java:611)
        at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:186)

14/08/08 12:09:11 ERROR DDLTask: 
org.apache.hadoop.hive.ql.metadata.HiveException: Database does not exist: tt
        at 
org.apache.hadoop.hive.ql.exec.DDLTask.switchDatabase(DDLTask.java:3480)
        at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:237)
        at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151)
        at 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65)
        at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1414)
        at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1192)
        at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1020)
        at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888)
        at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:208)
        at 
org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:182)
        at 
org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:272)
        at 
org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:269)
        at org.apache.spark.sql.hive.HiveContext.hiveql(HiveContext.scala:86)
        at org.apache.spark.sql.hive.HiveContext.hql(HiveContext.scala:91)
        at 
org.apache.spark.examples.sql.hive.HiveSpark$.main(HiveSpark.scala:35)
        at org.apache.spark.examples.sql.hive.HiveSpark.main(HiveSpark.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
        at java.lang.reflect.Method.invoke(Method.java:611)
        at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:186)
nono    why is that? not sure if this is something to do with hive jdbc driver? 

Thank you!


Jenny

RE: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

Reply via email to