[jira] [Created] (HUDI-281) HiveSync failure through Spark when useJdbc is set to false

Udit Mehrotra (Jira) Wed, 25 Sep 2019 17:08:18 -0700

Udit Mehrotra created HUDI-281:
----------------------------------

             Summary: HiveSync failure through Spark when useJdbc is set to 
false
                 Key: HUDI-281
                 URL: https://issues.apache.org/jira/browse/HUDI-281
             Project: Apache Hudi (incubating)
          Issue Type: Improvement
          Components: Hive Integration, Spark datasource
            Reporter: Udit Mehrotra



Table creation with Hive sync through Spark fails, when I set *useJdbc* to 
*false*. Currently I had to modify the code to set *useJdbc* to *false* as 
there is not *DataSourceOption* through which I can specify this field when 
running Hudi code.

Here is the failure:
{noformat}
java.lang.NoSuchMethodError: 
org.apache.hadoop.hive.ql.session.SessionState.start(Lorg/apache/hudi/org/apache/hadoop_hive/conf/HiveConf;)Lorg/apache/hadoop/hive/ql/session/SessionState;
  at 
org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLs(HoodieHiveClient.java:527)
  at 
org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLUsingHiveDriver(HoodieHiveClient.java:517)
  at 
org.apache.hudi.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:507)
  at 
org.apache.hudi.hive.HoodieHiveClient.createTable(HoodieHiveClient.java:272)
  at org.apache.hudi.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:132)
  at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:96)
  at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:68)
  at 
org.apache.hudi.HoodieSparkSqlWriter$.syncHive(HoodieSparkSqlWriter.scala:235)
  at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:169)
  at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
  at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:156)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
  at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
  at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
  at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
  at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
  at 
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
  at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
  at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
  at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
  at 
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229){noformat}
I was expecting this to fail through Spark, becuase *hive-exec* is not shaded 
inside *hudi-spark-bundle*, while *HiveConf* is shaded and relocated. This 
*SessionState* is coming from the spark-hive jar and obviously it does not 
accept the relocated *HiveConf*. 

 

We in *EMR*  are running into same problem when trying to integrate with Glue 
Catalog. For this we have to create Hive metastore client through 
*Hive.get(conf).getMsc()* instead of how it is being down now, so that 
alternate implementations of metastore can get created. However, because 
hive-exec is not shaded but HiveConf is relocated we run into same issues there.

It would not be recommended to shade *hive-exec* either because it itself is an 
Uber jar that shades a lot of things, and all of them would end up in 
*hudi-spark-bundle* jar. We would not want to head that route. That is why, we 
would suggest if we consider removing any shading of Hive libraries.

We can add a *Maven Profile* to shade, but that means it has to be activated by 
default otherwise it will fail default if *useJdbc* is set to false, and later 
when we commit *Glue Catalog*  changes.

 

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-281) HiveSync failure through Spark when useJdbc is set to false

Reply via email to