Re: [I] [SUPPORT] SparkSQL hangs indefinitely during Hudi table read operation [hudi]

2024-02-27 Thread via GitHub


ad1happy2go commented on issue #10232:
URL: https://github.com/apache/hudi/issues/10232#issuecomment-1966891345

   @jonathantransb Are you still facing this issue? Is it possible to hop into 
a call to understand this better. I did tried jobs with 0.14 but didn't able to 
reproduce any such issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] SparkSQL hangs indefinitely during Hudi table read operation [hudi]

2023-12-07 Thread via GitHub


jonathantransb commented on issue #10232:
URL: https://github.com/apache/hudi/issues/10232#issuecomment-1845843154

   @ad1happy2go Thank you for handling this. I haven't tried the settings 
without the Glue catalog yet.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] SparkSQL hangs indefinitely during Hudi table read operation [hudi]

2023-12-07 Thread via GitHub


ad1happy2go commented on issue #10232:
URL: https://github.com/apache/hudi/issues/10232#issuecomment-1845569663

   @jonathantransb Thanks for raising this, Sorry for the delay here. In case 
you tried, Are you facing this issue with OSS hudi and without glue catalog? I 
will try to check out that and get back to you. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [SUPPORT] SparkSQL hangs indefinitely during Hudi table read operation [hudi]

2023-12-03 Thread via GitHub


jonathantransb opened a new issue, #10232:
URL: https://github.com/apache/hudi/issues/10232

   **Describe the problem you faced**
   
   I'm attempting to read a Hudi table on Glue Catalog using SparkSQL with 
metadata enabled. However, my job appears to hang indefinitely at a certain 
step. Despite enabling DEBUG logs, I'm unable to find any indications of what 
may be causing this issue. Notably, this problem only occurs with Hudi tables 
where `clean` is the latest action in the timeline.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Create a Hudi table where `clean` is the latest action in the timeline 
   https://github.com/apache/hudi/assets/60864800/813c8190-8a47-48ed-8d8c-31dc2a29ff4b;>
   
   2. Open spark-shell
   ```bash
   spark-shell \
   --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \
   --conf "spark.sql.parquet.filterPushdown=true" \
   --conf "spark.sql.parquet.mergeSchema=false" \
   --conf "spark.speculation=false" \
   --conf "spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2" \
   --conf "spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem" \
   --conf "spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem" \
   --conf 
"spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain"
 \
   --conf "spark.sql.catalogImplementation=hive" \
   --conf "spark.sql.catalog.spark_catalog.type=hive" \
   --conf 
"spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog"
 \
   --conf 
"spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension" \
   --conf "spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar"
   ```
   
   3. Run spark.sql():
   ```bash
   scala> spark.sql("SET hoodie.metadata.enable=true")
   scala> spark.sql("SELECT * FROM . LIMIT 50").show()
   ```
   
   **Expected behavior**
   
   Spark job can read the table without hanging
   
   **Environment Description**
   
   * Hudi version : 0.14.0
   
   * Spark version : 3.4.1
   
   * Hive version : 2.3.9
   
   * Hadoop version : 3.3.6
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : yes
   
   
   **Additional context**
   
   I encountered no issues while using Hudi version 0.13.1. However, upon 
trying the new Hudi 0.14.0 version, I experienced this problem.
   
   For tables where `commit` is the latest action in the timeline, Hudi 0.14.0 
can read the table without any hanging issues.
   
   https://github.com/apache/hudi/assets/60864800/e8375fe0-7858-4779-b397-dc23f006c7dc;>
   
   The driver pod consistently uses up to 1 CPU core, although I'm uncertain 
about the processes that are running:
   
   https://github.com/apache/hudi/assets/60864800/d11979f7-c4ac-4703-80dd-6e122152b8fb;>
   
   
   **Stacktrace**
   
   ```
   23/12/03 12:21:23 INFO HiveConf: Found configuration file 
file:/opt/spark/conf/hive-site.xml
   23/12/03 12:21:23 INFO HiveClientImpl: Warehouse location for Hive client 
(version 2.3.9) is file:/opt/spark/work-dir/spark-warehouse
   23/12/03 12:21:23 INFO AWSGlueClientFactory: Using region from ec2 metadata 
: ap-southeast-1
   23/12/03 12:21:24 INFO AWSGlueClientFactory: Using region from ec2 metadata 
: ap-southeast-1
   23/12/03 12:21:26 WARN MetricsConfig: Cannot locate configuration: tried 
hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
   23/12/03 12:21:26 INFO MetricsSystemImpl: Scheduled Metric snapshot period 
at 10 second(s).
   23/12/03 12:21:26 INFO MetricsSystemImpl: s3a-file-system metrics system 
started
   23/12/03 12:21:27 WARN SDKV2Upgrade: Directly referencing AWS SDK V1 
credential provider com.amazonaws.auth.DefaultAWSCredentialsProviderChain. AWS 
SDK V1 credential providers will be removed once S3A is upgraded to SDK V2
   23/12/03 12:21:28 WARN DFSPropertiesConfiguration: Cannot find 
HUDI_CONF_DIR, please set it as the dir of hudi-defaults.conf
   23/12/03 12:21:28 WARN DFSPropertiesConfiguration: Properties file 
file:/etc/hudi/conf/hudi-defaults.conf not found. Ignoring to load props file
   23/12/03 12:21:28 INFO DataSourceUtils: Getting table path..
   23/12/03 12:21:28 INFO TablePathUtils: Getting table path from path : 
s3:
   23/12/03 12:21:28 INFO DefaultSource: Obtained hudi table path: 
s3:
   23/12/03 12:21:28 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
from s3:
   23/12/03 12:21:28 INFO HoodieTableConfig: Loading table properties from 
s3:/.hoodie/hoodie.properties
   23/12/03 12:21:28 INFO HoodieTableMetaClient: Finished Loading Table of type 
COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from 
s3:
   23/12/03 12:21:28 INFO DefaultSource: Is bootstrapped table => false, 
tableType is: COPY_ON_WRITE, queryType is: snapshot
   23/12/03 12:21:28 INFO HoodieActiveTimeline: Loaded instants upto : 
Option{val=[20231202193157845__clean__COMPLETED__20231202193208000]}
   23/12/03 12:21:28 INFO