[jira] [Assigned] (HUDI-281) HiveSync failure through Spark when useJdbc is set to false

Raymond Xu (Jira) Wed, 29 Dec 2021 23:17:10 -0800


     [ 
https://issues.apache.org/jira/browse/HUDI-281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Raymond Xu reassigned HUDI-281:
-------------------------------

    Assignee:     (was: Raymond Xu)

> HiveSync failure through Spark when useJdbc is set to false
> -----------------------------------------------------------
>
>                 Key: HUDI-281
>                 URL: https://issues.apache.org/jira/browse/HUDI-281
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: Hive Integration, Spark Integration, Usability
>            Reporter: Udit Mehrotra
>            Priority: Major
>              Labels: query-eng, user-support-issues
>             Fix For: 0.11.0, 0.10.1
>
>
> Table creation with Hive sync through Spark fails, when I set *useJdbc* to 
> *false*. Currently I had to modify the code to set *useJdbc* to *false* as 
> there is not *DataSourceOption* through which I can specify this field when 
> running Hudi code.
> Here is the failure:
> {noformat}
> java.lang.NoSuchMethodError: 
> org.apache.hadoop.hive.ql.session.SessionState.start(Lorg/apache/hudi/org/apache/hadoop_hive/conf/HiveConf;)Lorg/apache/hadoop/hive/ql/session/SessionState;
>   at 
> org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLs(HoodieHiveClient.java:527)
>   at 
> org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLUsingHiveDriver(HoodieHiveClient.java:517)
>   at 
> org.apache.hudi.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:507)
>   at 
> org.apache.hudi.hive.HoodieHiveClient.createTable(HoodieHiveClient.java:272)
>   at org.apache.hudi.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:132)
>   at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:96)
>   at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:68)
>   at 
> org.apache.hudi.HoodieSparkSqlWriter$.syncHive(HoodieSparkSqlWriter.scala:235)
>   at 
> org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:169)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
>   at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:156)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>   at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
>   at 
> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229){noformat}
> I was expecting this to fail through Spark, becuase *hive-exec* is not shaded 
> inside *hudi-spark-bundle*, while *HiveConf* is shaded and relocated. This 
> *SessionState* is coming from the spark-hive jar and obviously it does not 
> accept the relocated *HiveConf*.
> We in *EMR* are running into same problem when trying to integrate with Glue 
> Catalog. For this we have to create Hive metastore client through 
> *Hive.get(conf).getMsc()* instead of how it is being down now, so that 
> alternate implementations of metastore can get created. However, because 
> hive-exec is not shaded but HiveConf is relocated we run into same issues 
> there.
> It would not be recommended to shade *hive-exec* either because it itself is 
> an Uber jar that shades a lot of things, and all of them would end up in 
> *hudi-spark-bundle* jar. We would not want to head that route. That is why, 
> we would suggest if we consider removing any shading of Hive libraries.
> We can add a *Maven Profile* to shade, but that means it has to be activated 
> by default otherwise it will fail default if *useJdbc* is set to false, and 
> later when we commit *Glue Catalog* changes.
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Assigned] (HUDI-281) HiveSync failure through Spark when useJdbc is set to false

Reply via email to