jonathantransb opened a new issue, #10232:
URL: https://github.com/apache/hudi/issues/10232
**Describe the problem you faced**
I'm attempting to read a Hudi table on Glue Catalog using SparkSQL with
metadata enabled. However, my job appears to hang indefinitely at a certain
step. Despite enabling DEBUG logs, I'm unable to find any indications of what
may be causing this issue. Notably, this problem only occurs with Hudi tables
where `clean` is the latest action in the timeline.
**To Reproduce**
Steps to reproduce the behavior:
1. Create a Hudi table where `clean` is the latest action in the timeline
https://github.com/apache/hudi/assets/60864800/813c8190-8a47-48ed-8d8c-31dc2a29ff4b;>
2. Open spark-shell
```bash
spark-shell \
--conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \
--conf "spark.sql.parquet.filterPushdown=true" \
--conf "spark.sql.parquet.mergeSchema=false" \
--conf "spark.speculation=false" \
--conf "spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2" \
--conf "spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem" \
--conf "spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem" \
--conf
"spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain"
\
--conf "spark.sql.catalogImplementation=hive" \
--conf "spark.sql.catalog.spark_catalog.type=hive" \
--conf
"spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog"
\
--conf
"spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension" \
--conf "spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar"
```
3. Run spark.sql():
```bash
scala> spark.sql("SET hoodie.metadata.enable=true")
scala> spark.sql("SELECT * FROM . LIMIT 50").show()
```
**Expected behavior**
Spark job can read the table without hanging
**Environment Description**
* Hudi version : 0.14.0
* Spark version : 3.4.1
* Hive version : 2.3.9
* Hadoop version : 3.3.6
* Storage (HDFS/S3/GCS..) : S3
* Running on Docker? (yes/no) : yes
**Additional context**
I encountered no issues while using Hudi version 0.13.1. However, upon
trying the new Hudi 0.14.0 version, I experienced this problem.
For tables where `commit` is the latest action in the timeline, Hudi 0.14.0
can read the table without any hanging issues.
https://github.com/apache/hudi/assets/60864800/e8375fe0-7858-4779-b397-dc23f006c7dc;>
The driver pod consistently uses up to 1 CPU core, although I'm uncertain
about the processes that are running:
https://github.com/apache/hudi/assets/60864800/d11979f7-c4ac-4703-80dd-6e122152b8fb;>
**Stacktrace**
```
23/12/03 12:21:23 INFO HiveConf: Found configuration file
file:/opt/spark/conf/hive-site.xml
23/12/03 12:21:23 INFO HiveClientImpl: Warehouse location for Hive client
(version 2.3.9) is file:/opt/spark/work-dir/spark-warehouse
23/12/03 12:21:23 INFO AWSGlueClientFactory: Using region from ec2 metadata
: ap-southeast-1
23/12/03 12:21:24 INFO AWSGlueClientFactory: Using region from ec2 metadata
: ap-southeast-1
23/12/03 12:21:26 WARN MetricsConfig: Cannot locate configuration: tried
hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
23/12/03 12:21:26 INFO MetricsSystemImpl: Scheduled Metric snapshot period
at 10 second(s).
23/12/03 12:21:26 INFO MetricsSystemImpl: s3a-file-system metrics system
started
23/12/03 12:21:27 WARN SDKV2Upgrade: Directly referencing AWS SDK V1
credential provider com.amazonaws.auth.DefaultAWSCredentialsProviderChain. AWS
SDK V1 credential providers will be removed once S3A is upgraded to SDK V2
23/12/03 12:21:28 WARN DFSPropertiesConfiguration: Cannot find
HUDI_CONF_DIR, please set it as the dir of hudi-defaults.conf
23/12/03 12:21:28 WARN DFSPropertiesConfiguration: Properties file
file:/etc/hudi/conf/hudi-defaults.conf not found. Ignoring to load props file
23/12/03 12:21:28 INFO DataSourceUtils: Getting table path..
23/12/03 12:21:28 INFO TablePathUtils: Getting table path from path :
s3:
23/12/03 12:21:28 INFO DefaultSource: Obtained hudi table path:
s3:
23/12/03 12:21:28 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient
from s3:
23/12/03 12:21:28 INFO HoodieTableConfig: Loading table properties from
s3:/.hoodie/hoodie.properties
23/12/03 12:21:28 INFO HoodieTableMetaClient: Finished Loading Table of type
COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from
s3:
23/12/03 12:21:28 INFO DefaultSource: Is bootstrapped table => false,
tableType is: COPY_ON_WRITE, queryType is: snapshot
23/12/03 12:21:28 INFO HoodieActiveTimeline: Loaded instants upto :
Option{val=[20231202193157845__clean__COMPLETED__20231202193208000]}
23/12/03 12:21:28 INFO