Udit Mehrotra created HUDI-281: ---------------------------------- Summary: HiveSync failure through Spark when useJdbc is set to false Key: HUDI-281 URL: https://issues.apache.org/jira/browse/HUDI-281 Project: Apache Hudi (incubating) Issue Type: Improvement Components: Hive Integration, Spark datasource Reporter: Udit Mehrotra
Table creation with Hive sync through Spark fails, when I set *useJdbc* to *false*. Currently I had to modify the code to set *useJdbc* to *false* as there is not *DataSourceOption* through which I can specify this field when running Hudi code. Here is the failure: {noformat} java.lang.NoSuchMethodError: org.apache.hadoop.hive.ql.session.SessionState.start(Lorg/apache/hudi/org/apache/hadoop_hive/conf/HiveConf;)Lorg/apache/hadoop/hive/ql/session/SessionState; at org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLs(HoodieHiveClient.java:527) at org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLUsingHiveDriver(HoodieHiveClient.java:517) at org.apache.hudi.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:507) at org.apache.hudi.hive.HoodieHiveClient.createTable(HoodieHiveClient.java:272) at org.apache.hudi.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:132) at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:96) at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:68) at org.apache.hudi.HoodieSparkSqlWriter$.syncHive(HoodieSparkSqlWriter.scala:235) at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:169) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:156) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229){noformat} I was expecting this to fail through Spark, becuase *hive-exec* is not shaded inside *hudi-spark-bundle*, while *HiveConf* is shaded and relocated. This *SessionState* is coming from the spark-hive jar and obviously it does not accept the relocated *HiveConf*. We in *EMR* are running into same problem when trying to integrate with Glue Catalog. For this we have to create Hive metastore client through *Hive.get(conf).getMsc()* instead of how it is being down now, so that alternate implementations of metastore can get created. However, because hive-exec is not shaded but HiveConf is relocated we run into same issues there. It would not be recommended to shade *hive-exec* either because it itself is an Uber jar that shades a lot of things, and all of them would end up in *hudi-spark-bundle* jar. We would not want to head that route. That is why, we would suggest if we consider removing any shading of Hive libraries. We can add a *Maven Profile* to shade, but that means it has to be activated by default otherwise it will fail default if *useJdbc* is set to false, and later when we commit *Glue Catalog* changes. -- This message was sent by Atlassian Jira (v8.3.4#803005)