[GitHub] [hudi] tpcross opened a new issue, #8584: [SUPPORT] Spark SQL query FileNotFoundException using cleaner policy KEEP_LATEST_BY_HOURS

via GitHub Wed, 26 Apr 2023 21:19:16 -0700


tpcross opened a new issue, #8584:
URL: https://github.com/apache/hudi/issues/8584


   **Describe the problem you faced**
   
   Hello, having an issue with the cleaning retention policy
   
   Spark SQL query FileNotFoundException when querying Hudi table using cleaner 
policy hoodie.cleaner.policy KEEP_LATEST_BY_HOURS and 
hoodie.cleaner.hours.retained set to 12
   
   The cleaner deletes an S3 object which is still used by the running query 
(which had been running for 2.5 hours when it failed, less than the hours 
retained setting). I've confirmed it was the cleaner that deleted the object by 
checking the path in the .clean.requested and .clean files in the timeline.
   
   When the query started the file group had one slice
   994d5334-bc27-439b-89a9-3f129f658c90-0_12-31-2430_20221123052731868.parquet
   
   When it failed there were two new slices:
   994d5334-bc27-439b-89a9-3f129f658c90-0_1141-82-5080_20230421072039927.parquet
   994d5334-bc27-439b-89a9-3f129f658c90-0_401-51-3162_20230421070656147.parquet
   
   How to set the cleaner policy to prevent the issue and keep the query 
running? 
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Reader starts running Spark SQL query, query plan uses a slice older than 
the hoodie.cleaner.hours.retained setting
   2. Writer runs upsert on table
   3. New slice is written to file group
   4. Inline clean deletes old slice
   
   **Expected behavior**
   
   The old file slice should remain in s3 for at least 12 hours after the 
commit at 20230421070656147 so the query can finish successfully.
   
   **Environment Description**
   
   * Hudi version : 0.12.1-amzn-0
   
   * Spark version : 3.3.0
   
   * Hive version :3.1.3
   
   * Hadoop version : 3.3.3
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   Was previously using KEEP_LATEST_COMMITS policy but this can also cause 
Spark SQL queries to fail when there are a number of commits in a short time 
period, so tried changing to KEEP_LATEST_BY_HOURS
   
   The table has irregular commits, the old commit is from 202211 because the 
error is from a test environment, but the same problem can happen in production.
   
   
   **Stacktrace**
   
   ```
   java.io.FileNotFoundException: No such file or directory 
's3://.../994d5334-bc27-439b-89a9-3f129f658c90-0_12-31-2430_20221123052731868.parquet'
        at 
com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:524)
        at 
com.amazon.ws.emr.hadoop.fs.EmrFileSystem.getFileStatus(EmrFileSystem.java:617)
        at 
org.apache.parquet.hadoop.util.HadoopInputFile.fromPath(HadoopInputFile.java:61)
        at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:39)
        at 
org.apache.spark.sql.execution.datasources.parquet.Spark32PlusHoodieParquetFileFormat.footerFileMetaData$lzycompute$1(Spark32PlusHoodieParquetFileFormat.scala:161)
        at 
org.apache.spark.sql.execution.datasources.parquet.Spark32PlusHoodieParquetFileFormat.footerFileMetaData$1(Spark32PlusHoodieParquetFileFormat.scala:160)
        at 
org.apache.spark.sql.execution.datasources.parquet.Spark32PlusHoodieParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(Spark32PlusHoodieParquetFileFormat.scala:164)
        at 
org.apache.hudi.HoodieDataSourceHelper$.$anonfun$buildHoodieParquetReader$1(HoodieDataSourceHelper.scala:67)
        at 
org.apache.hudi.HoodieBaseRelation.$anonfun$createBaseFileReader$1(HoodieBaseRelation.scala:590)
        at 
org.apache.hudi.HoodieBaseRelation$BaseFileReader.apply(HoodieBaseRelation.scala:652)
        at 
org.apache.hudi.HoodieMergeOnReadRDD.compute(HoodieMergeOnReadRDD.scala:121)
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] tpcross opened a new issue, #8584: [SUPPORT] Spark SQL query FileNotFoundException using cleaner policy KEEP_LATEST_BY_HOURS

Reply via email to