[I] [SUPPORT] SparkSQL hangs indefinitely during Hudi table read operation [hudi]

via GitHub Sun, 03 Dec 2023 20:10:15 -0800


jonathantransb opened a new issue, #10232:
URL: https://github.com/apache/hudi/issues/10232


   **Describe the problem you faced**
   
   I'm attempting to read a Hudi table on Glue Catalog using SparkSQL with 
metadata enabled. However, my job appears to hang indefinitely at a certain 
step. Despite enabling DEBUG logs, I'm unable to find any indications of what 
may be causing this issue. Notably, this problem only occurs with Hudi tables 
where `clean` is the latest action in the timeline.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Create a Hudi table where `clean` is the latest action in the timeline 
   <img width="1061" alt="image" 
src="https://github.com/apache/hudi/assets/60864800/813c8190-8a47-48ed-8d8c-31dc2a29ff4b";>
   
   2. Open spark-shell
   ```bash
   spark-shell \
   --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \
   --conf "spark.sql.parquet.filterPushdown=true" \
   --conf "spark.sql.parquet.mergeSchema=false" \
   --conf "spark.speculation=false" \
   --conf "spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2" \
   --conf "spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem" \
   --conf "spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem" \
   --conf 
"spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain"
 \
   --conf "spark.sql.catalogImplementation=hive" \
   --conf "spark.sql.catalog.spark_catalog.type=hive" \
   --conf 
"spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog"
 \
   --conf 
"spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension" \
   --conf "spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar"
   ```
   
   3. Run spark.sql():
   ```bash
   scala> spark.sql("SET hoodie.metadata.enable=true")
   scala> spark.sql("SELECT * FROM <schema>.<table> LIMIT 50").show()
   ```
   
   **Expected behavior**
   
   Spark job can read the table without hanging
   
   **Environment Description**
   
   * Hudi version : 0.14.0
   
   * Spark version : 3.4.1
   
   * Hive version : 2.3.9
   
   * Hadoop version : 3.3.6
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : yes
   
   
   **Additional context**
   
   I encountered no issues while using Hudi version 0.13.1. However, upon 
trying the new Hudi 0.14.0 version, I experienced this problem.
   
   For tables where `commit` is the latest action in the timeline, Hudi 0.14.0 
can read the table without any hanging issues.
   
   <img width="1069" alt="image" 
src="https://github.com/apache/hudi/assets/60864800/e8375fe0-7858-4779-b397-dc23f006c7dc";>
   
   The driver pod consistently uses up to 1 CPU core, although I'm uncertain 
about the processes that are running:
   
   <img width="554" alt="image" 
src="https://github.com/apache/hudi/assets/60864800/d11979f7-c4ac-4703-80dd-6e122152b8fb";>
   
   
   **Stacktrace**
   
   ```
   23/12/03 12:21:23 INFO HiveConf: Found configuration file 
file:/opt/spark/conf/hive-site.xml
   23/12/03 12:21:23 INFO HiveClientImpl: Warehouse location for Hive client 
(version 2.3.9) is file:/opt/spark/work-dir/spark-warehouse
   23/12/03 12:21:23 INFO AWSGlueClientFactory: Using region from ec2 metadata 
: ap-southeast-1
   23/12/03 12:21:24 INFO AWSGlueClientFactory: Using region from ec2 metadata 
: ap-southeast-1
   23/12/03 12:21:26 WARN MetricsConfig: Cannot locate configuration: tried 
hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
   23/12/03 12:21:26 INFO MetricsSystemImpl: Scheduled Metric snapshot period 
at 10 second(s).
   23/12/03 12:21:26 INFO MetricsSystemImpl: s3a-file-system metrics system 
started
   23/12/03 12:21:27 WARN SDKV2Upgrade: Directly referencing AWS SDK V1 
credential provider com.amazonaws.auth.DefaultAWSCredentialsProviderChain. AWS 
SDK V1 credential providers will be removed once S3A is upgraded to SDK V2
   23/12/03 12:21:28 WARN DFSPropertiesConfiguration: Cannot find 
HUDI_CONF_DIR, please set it as the dir of hudi-defaults.conf
   23/12/03 12:21:28 WARN DFSPropertiesConfiguration: Properties file 
file:/etc/hudi/conf/hudi-defaults.conf not found. Ignoring to load props file
   23/12/03 12:21:28 INFO DataSourceUtils: Getting table path..
   23/12/03 12:21:28 INFO TablePathUtils: Getting table path from path : 
s3://<bucket>/<schema>/<table>
   23/12/03 12:21:28 INFO DefaultSource: Obtained hudi table path: 
s3://<bucket>/<schema>/<table>
   23/12/03 12:21:28 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
from s3://<bucket>/<schema>/<table>
   23/12/03 12:21:28 INFO HoodieTableConfig: Loading table properties from 
s3://<bucket>/<schema>/<table>/.hoodie/hoodie.properties
   23/12/03 12:21:28 INFO HoodieTableMetaClient: Finished Loading Table of type 
COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from 
s3://<bucket>/<schema>/<table>
   23/12/03 12:21:28 INFO DefaultSource: Is bootstrapped table => false, 
tableType is: COPY_ON_WRITE, queryType is: snapshot
   23/12/03 12:21:28 INFO HoodieActiveTimeline: Loaded instants upto : 
Option{val=[20231202193157845__clean__COMPLETED__20231202193208000]}
   23/12/03 12:21:28 INFO TableSchemaResolver: Reading schema from 
s3://<bucket>/<schema>/<table>/c_day=20231130/cce9afd1-46a1-4668-b8b4-0ac697f1ed57-0_3-21-2099_20231202191150326.parquet
   23/12/03 12:21:29 INFO S3AInputStream: Switching to Random IO seek policy
   23/12/03 12:21:29 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
from s3://<bucket>/<schema>/<table>
   23/12/03 12:21:29 INFO HoodieTableConfig: Loading table properties from 
s3://<bucket>/<schema>/<table>/.hoodie/hoodie.properties
   23/12/03 12:21:29 INFO HoodieTableMetaClient: Finished Loading Table of type 
COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from 
s3://<bucket>/<schema>/<table>
   23/12/03 12:21:29 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
from s3://<bucket>/<schema>/<table>/.hoodie/metadata
   23/12/03 12:21:29 INFO HoodieTableConfig: Loading table properties from 
s3://<bucket>/<schema>/<table>/.hoodie/metadata/.hoodie/hoodie.properties
   23/12/03 12:21:29 INFO HoodieTableMetaClient: Finished Loading Table of type 
MERGE_ON_READ(version=1, baseFileFormat=HFILE) from 
s3://<bucket>/<schema>/<table>/.hoodie/metadata
   23/12/03 12:21:29 INFO HoodieActiveTimeline: Loaded instants upto : 
Option{val=[20231202193157845__deltacommit__COMPLETED__20231202193207000]}
   23/12/03 12:21:29 INFO AbstractTableFileSystemView: Took 2 ms to read  0 
instants, 0 replaced file groups
   23/12/03 12:21:30 INFO ClusteringUtils: Found 0 files in pending clustering 
operations
   23/12/03 12:21:30 INFO HoodieActiveTimeline: Loaded instants upto : 
Option{val=[20231202193157845__clean__COMPLETED__20231202193208000]}
   23/12/03 12:21:30 INFO BaseHoodieTableFileIndex: Refresh table 
stg_tracking_unified__click, spent: 365 ms
   23/12/03 12:21:30 DEBUG HoodieFileIndex: Unable to compose relative 
partition path prefix from the predicates; falling back to fetching all 
partitions
   23/12/03 12:21:30 INFO HoodieTableMetadataUtil: Loading latest merged file 
slices for metadata table partition files
   23/12/03 12:21:30 INFO AbstractTableFileSystemView: Took 0 ms to read  0 
instants, 0 replaced file groups
   23/12/03 12:21:30 INFO ClusteringUtils: Found 0 files in pending clustering 
operations
   23/12/03 12:21:30 INFO AbstractTableFileSystemView: Building file system 
view for partition (files)
   23/12/03 12:21:30 DEBUG AbstractTableFileSystemView: #files found in 
partition (files) =15, Time taken =26
   23/12/03 12:21:30 DEBUG HoodieTableFileSystemView: Adding file-groups for 
partition :files, #FileGroups=1
   23/12/03 12:21:30 DEBUG AbstractTableFileSystemView: addFilesToView: 
NumFiles=15, NumFileGroups=1, FileGroupsCreationTime=11, StoreTimeTaken=1
   23/12/03 12:21:30 DEBUG AbstractTableFileSystemView: Time to load partition 
(files) =40
   23/12/03 12:21:30 INFO HoodieBackedTableMetadata: Opened metadata base file 
from 
s3://<bucket>/<schema>/<table>/.hoodie/metadata/files/files-0000_0-31-2210_20231201193545608001.hfile
 at instant 20231201193545608001 in 14 ms
   23/12/03 12:21:30 INFO HoodieActiveTimeline: Loaded instants upto : 
Option{val=[20231202193157845__clean__COMPLETED__20231202193208000]}
   
   [It's stuck at this point. No further logs are printed after this]
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] [SUPPORT] SparkSQL hangs indefinitely during Hudi table read operation [hudi]

Reply via email to