compains opened a new issue, #10057: URL: https://github.com/apache/hudi/issues/10057
**Describe the problem you faced** Until few days ago I was able to read json files, from S3 do some operations and save as Hudi. Hudi was configured to sync metadata with hive-metastore so I was able to query data using Trino. For some reason we had to move to another server it has stopped working. **To Reproduce** Steps to reproduce the behavior: 1. read som json from S3 2. write that DF in Hudi format to S3 **Expected behavior** The new files are present in S3 (This still happens) and hive metastore is updated (here fails, if I set hoodie.datasource.hive_sync.enable to False, the task finish properly. **Environment Description** * Hudi version : org.apache.hudi:hudi-spark3.4-bundle_2.12:0.14.0 * Spark version : 3.4.1 * Hive version : bitsondatadev/hive-metastore:latest * Hadoop version : org.apache.hadoop:hadoop-aws:3.3.2 * Storage (HDFS/S3/GCS..) : S3, same aplication is reading and writing on that bucket * Running on Docker? (yes/no) : Hive yes, Spark, no **Additional context** Spark configuration: ``` import pyspark from pyspark import SparkConf conf = SparkConf() conf.set('spark.hadoop.fs.s3a.aws.credentials.provider', 'com.amazonaws.auth.InstanceProfileCredentialsProvider') conf.set('spark.hadoop.fs.s3a.access.key', 'the access id') conf.set('spark.hadoop.fs.s3a.secret.key', 'the secret') conf.set('spark.hadoop.fs.s3a.awsAccessKeyId', 'the access id') conf.set('spark.hadoop.fs.s3a.awsSecretAccessKey', 'the secret') conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") spark = pyspark.sql.SparkSession.builder.appName( "hive_sync_test" ).config( "spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension" ).config( "spark.sql.catalog.spark_catalog", "org.apache.spark.sql.hudi.catalog.HoodieCatalog" ).config( "spark.serializer", "org.apache.spark.serializer.KryoSerializer" ).config( 'spark.jars.packages', 'org.apache.hudi:hudi-spark3.4-bundle_2.12:0.14.0,org.apache.hadoop:hadoop-aws:3.3.2' ).config( "spark.hadoop.datanucleus.schema.autoCreateTables", "true" ).config( "spark.sql.legacy.parquet.nanosAsLong", "false" ).config( "spark.sql.parquet.binaryAsString", "false" ).config( "spark.sql.parquet.int96AsTimestamp", "true" ).config( "spark.sql.caseSensitive", "false" ).config( "spark.worker.cleanup.enabled", True ).config( "spark.worker.cleanup.interval", 60 ).config( "spark.worker.cleanup.appDataTtl", 108000 ).config( "spark.cores.max", 2 ).config( conf=conf ).master( 'spark://master:7077' ).getOrCreate() spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.access.key", "the key again") spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "the secret again") spark.sparkContext.setLogLevel("WARN") hadoop_conf = spark.sparkContext._jsc.hadoopConfiguration() hadoop_conf.set("spark.files.overwrite", "true") ``` Hudi options ``` s3_bucket_name='s3a://the-bucket-name table_name='some_test' hive_meta_store_url: str = 'thrift://hive-metastore:9083' hudiOptions = { 'hoodie.table.name': table_name, 'hoodie.datasource.write.table.type': "COPY_ON_WRITE", 'hoodie.datasource.write.recordkey.field': 'guid', 'hoodie.datasource.write.table.name': table_name, 'hoodie.datasource.write.precombine.field': 'sent_to_dl', 'hoodie.upsert.shuffle.parallelism': 2, 'hoodie.insert.shuffle.parallelism': 2, 'hoodie.datasource.hive_sync.enable': True, 'hoodie.datasource.hive_sync.table': table_name, 'hoodie.datasource.hive_sync.use_jdbc': False, 'hoodie.datasource.hive_sync.mode': 'hms', 'hoodie.datasource.hive_sync.metastore.uris': f'{hive_meta_store_url}/{table_name}', 'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor', 'hoodie.datasource.write.hive_style_partitioning': 'true', 'hive.input.format': 'org.apache.hudi.hadoop.HoodieParquetInputFormat' } ``` The repeated configurations are because I have tried any possibility I found out there. **Stacktrace** ```Py4JJavaError: An error occurred while calling o100.save. : org.apache.hudi.exception.HoodieMetaSyncException: Could not sync using the meta sync class org.apache.hudi.hive.HiveSyncTool at org.apache.hudi.sync.common.util.SyncUtilHelpers.runHoodieMetaSync(SyncUtilHelpers.java:81) at org.apache.hudi.HoodieSparkSqlWriter$.$anonfun$metaSync$2(HoodieSparkSqlWriter.scala:993) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.hudi.HoodieSparkSqlWriter$.metaSync(HoodieSparkSqlWriter.scala:991) at org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:1089) at org.apache.hudi.HoodieSparkSqlWriter$.writeInternal(HoodieSparkSqlWriter.scala:441) at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:132) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:150) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:118) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:195) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:103) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:512) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:104) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:512) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:31) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:31) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:31) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:488) at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:94) at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:81) at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:79) at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:133) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:856) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:387) at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:360) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:568) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.base/java.lang.Thread.run(Thread.java:833) Caused by: org.apache.hudi.exception.HoodieException: Got runtime exception when hive syncing some_test at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:168) at org.apache.hudi.sync.common.util.SyncUtilHelpers.runHoodieMetaSync(SyncUtilHelpers.java:79) ... 48 more Caused by: org.apache.hudi.hive.HoodieHiveSyncException: failed to create table some_test at org.apache.hudi.hive.ddl.HMSDDLExecutor.createTable(HMSDDLExecutor.java:140) at org.apache.hudi.hive.HoodieHiveSyncClient.createTable(HoodieHiveSyncClient.java:235) at org.apache.hudi.hive.HiveSyncTool.syncFirstTime(HiveSyncTool.java:329) at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:251) at org.apache.hudi.hive.HiveSyncTool.doSync(HiveSyncTool.java:177) at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:165) ... 49 more Caused by: MetaException(message:Got exception: java.nio.file.AccessDeniedException the-bucket-name: org.apache.hadoop.fs.s3a.auth.NoAuthWithAWSException: No AWS Credentials provided by SimpleAWSCredentialsProvider EnvironmentVariableCredentialsProvider InstanceProfileCredentialsProvider : com.amazonaws.AmazonServiceException: Unauthorized (Service: null; Status Code: 401; Error Code: null; Request ID: null)) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$create_table_with_environment_context_result$create_table_with_environment_context_resultStandardScheme.read(ThriftHiveMetastore.java:42225) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$create_table_with_environment_context_result$create_table_with_environment_context_resultStandardScheme.read(ThriftHiveMetastore.java:42193) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$create_table_with_environment_context_result.read(ThriftHiveMetastore.java:42119) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:88) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_create_table_with_environment_context(ThriftHiveMetastore.java:1203) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.create_table_with_environment_context(ThriftHiveMetastore.java:1189) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.create_table_with_environment_context(HiveMetaStoreClient.java:2396) at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.create_table_with_environment_context(SessionHiveMetaStoreClient.java:93) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:750) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:738) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:568) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:173) at jdk.proxy2/jdk.proxy2.$Proxy67.createTable(Unknown Source) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:568) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2327) at jdk.proxy2/jdk.proxy2.$Proxy67.createTable(Unknown Source) at org.apache.hudi.hive.ddl.HMSDDLExecutor.createTable(HMSDDLExecutor.java:137) ... 54 more``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org