jonathantransb opened a new issue, #10232: URL: https://github.com/apache/hudi/issues/10232
**Describe the problem you faced** I'm attempting to read a Hudi table on Glue Catalog using SparkSQL with metadata enabled. However, my job appears to hang indefinitely at a certain step. Despite enabling DEBUG logs, I'm unable to find any indications of what may be causing this issue. Notably, this problem only occurs with Hudi tables where `clean` is the latest action in the timeline. **To Reproduce** Steps to reproduce the behavior: 1. Create a Hudi table where `clean` is the latest action in the timeline <img width="1061" alt="image" src="https://github.com/apache/hudi/assets/60864800/813c8190-8a47-48ed-8d8c-31dc2a29ff4b"> 2. Open spark-shell ```bash spark-shell \ --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \ --conf "spark.sql.parquet.filterPushdown=true" \ --conf "spark.sql.parquet.mergeSchema=false" \ --conf "spark.speculation=false" \ --conf "spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2" \ --conf "spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem" \ --conf "spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem" \ --conf "spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain" \ --conf "spark.sql.catalogImplementation=hive" \ --conf "spark.sql.catalog.spark_catalog.type=hive" \ --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog" \ --conf "spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension" \ --conf "spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar" ``` 3. Run spark.sql(): ```bash scala> spark.sql("SET hoodie.metadata.enable=true") scala> spark.sql("SELECT * FROM <schema>.<table> LIMIT 50").show() ``` **Expected behavior** Spark job can read the table without hanging **Environment Description** * Hudi version : 0.14.0 * Spark version : 3.4.1 * Hive version : 2.3.9 * Hadoop version : 3.3.6 * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : yes **Additional context** I encountered no issues while using Hudi version 0.13.1. However, upon trying the new Hudi 0.14.0 version, I experienced this problem. For tables where `commit` is the latest action in the timeline, Hudi 0.14.0 can read the table without any hanging issues. <img width="1069" alt="image" src="https://github.com/apache/hudi/assets/60864800/e8375fe0-7858-4779-b397-dc23f006c7dc"> The driver pod consistently uses up to 1 CPU core, although I'm uncertain about the processes that are running: <img width="554" alt="image" src="https://github.com/apache/hudi/assets/60864800/d11979f7-c4ac-4703-80dd-6e122152b8fb"> **Stacktrace** ``` 23/12/03 12:21:23 INFO HiveConf: Found configuration file file:/opt/spark/conf/hive-site.xml 23/12/03 12:21:23 INFO HiveClientImpl: Warehouse location for Hive client (version 2.3.9) is file:/opt/spark/work-dir/spark-warehouse 23/12/03 12:21:23 INFO AWSGlueClientFactory: Using region from ec2 metadata : ap-southeast-1 23/12/03 12:21:24 INFO AWSGlueClientFactory: Using region from ec2 metadata : ap-southeast-1 23/12/03 12:21:26 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties 23/12/03 12:21:26 INFO MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s). 23/12/03 12:21:26 INFO MetricsSystemImpl: s3a-file-system metrics system started 23/12/03 12:21:27 WARN SDKV2Upgrade: Directly referencing AWS SDK V1 credential provider com.amazonaws.auth.DefaultAWSCredentialsProviderChain. AWS SDK V1 credential providers will be removed once S3A is upgraded to SDK V2 23/12/03 12:21:28 WARN DFSPropertiesConfiguration: Cannot find HUDI_CONF_DIR, please set it as the dir of hudi-defaults.conf 23/12/03 12:21:28 WARN DFSPropertiesConfiguration: Properties file file:/etc/hudi/conf/hudi-defaults.conf not found. Ignoring to load props file 23/12/03 12:21:28 INFO DataSourceUtils: Getting table path.. 23/12/03 12:21:28 INFO TablePathUtils: Getting table path from path : s3://<bucket>/<schema>/<table> 23/12/03 12:21:28 INFO DefaultSource: Obtained hudi table path: s3://<bucket>/<schema>/<table> 23/12/03 12:21:28 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from s3://<bucket>/<schema>/<table> 23/12/03 12:21:28 INFO HoodieTableConfig: Loading table properties from s3://<bucket>/<schema>/<table>/.hoodie/hoodie.properties 23/12/03 12:21:28 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3://<bucket>/<schema>/<table> 23/12/03 12:21:28 INFO DefaultSource: Is bootstrapped table => false, tableType is: COPY_ON_WRITE, queryType is: snapshot 23/12/03 12:21:28 INFO HoodieActiveTimeline: Loaded instants upto : Option{val=[20231202193157845__clean__COMPLETED__20231202193208000]} 23/12/03 12:21:28 INFO TableSchemaResolver: Reading schema from s3://<bucket>/<schema>/<table>/c_day=20231130/cce9afd1-46a1-4668-b8b4-0ac697f1ed57-0_3-21-2099_20231202191150326.parquet 23/12/03 12:21:29 INFO S3AInputStream: Switching to Random IO seek policy 23/12/03 12:21:29 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from s3://<bucket>/<schema>/<table> 23/12/03 12:21:29 INFO HoodieTableConfig: Loading table properties from s3://<bucket>/<schema>/<table>/.hoodie/hoodie.properties 23/12/03 12:21:29 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3://<bucket>/<schema>/<table> 23/12/03 12:21:29 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from s3://<bucket>/<schema>/<table>/.hoodie/metadata 23/12/03 12:21:29 INFO HoodieTableConfig: Loading table properties from s3://<bucket>/<schema>/<table>/.hoodie/metadata/.hoodie/hoodie.properties 23/12/03 12:21:29 INFO HoodieTableMetaClient: Finished Loading Table of type MERGE_ON_READ(version=1, baseFileFormat=HFILE) from s3://<bucket>/<schema>/<table>/.hoodie/metadata 23/12/03 12:21:29 INFO HoodieActiveTimeline: Loaded instants upto : Option{val=[20231202193157845__deltacommit__COMPLETED__20231202193207000]} 23/12/03 12:21:29 INFO AbstractTableFileSystemView: Took 2 ms to read 0 instants, 0 replaced file groups 23/12/03 12:21:30 INFO ClusteringUtils: Found 0 files in pending clustering operations 23/12/03 12:21:30 INFO HoodieActiveTimeline: Loaded instants upto : Option{val=[20231202193157845__clean__COMPLETED__20231202193208000]} 23/12/03 12:21:30 INFO BaseHoodieTableFileIndex: Refresh table stg_tracking_unified__click, spent: 365 ms 23/12/03 12:21:30 DEBUG HoodieFileIndex: Unable to compose relative partition path prefix from the predicates; falling back to fetching all partitions 23/12/03 12:21:30 INFO HoodieTableMetadataUtil: Loading latest merged file slices for metadata table partition files 23/12/03 12:21:30 INFO AbstractTableFileSystemView: Took 0 ms to read 0 instants, 0 replaced file groups 23/12/03 12:21:30 INFO ClusteringUtils: Found 0 files in pending clustering operations 23/12/03 12:21:30 INFO AbstractTableFileSystemView: Building file system view for partition (files) 23/12/03 12:21:30 DEBUG AbstractTableFileSystemView: #files found in partition (files) =15, Time taken =26 23/12/03 12:21:30 DEBUG HoodieTableFileSystemView: Adding file-groups for partition :files, #FileGroups=1 23/12/03 12:21:30 DEBUG AbstractTableFileSystemView: addFilesToView: NumFiles=15, NumFileGroups=1, FileGroupsCreationTime=11, StoreTimeTaken=1 23/12/03 12:21:30 DEBUG AbstractTableFileSystemView: Time to load partition (files) =40 23/12/03 12:21:30 INFO HoodieBackedTableMetadata: Opened metadata base file from s3://<bucket>/<schema>/<table>/.hoodie/metadata/files/files-0000_0-31-2210_20231201193545608001.hfile at instant 20231201193545608001 in 14 ms 23/12/03 12:21:30 INFO HoodieActiveTimeline: Loaded instants upto : Option{val=[20231202193157845__clean__COMPLETED__20231202193208000]} [It's stuck at this point. No further logs are printed after this] ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org