tpcross opened a new issue, #8584: URL: https://github.com/apache/hudi/issues/8584
**Describe the problem you faced** Hello, having an issue with the cleaning retention policy Spark SQL query FileNotFoundException when querying Hudi table using cleaner policy hoodie.cleaner.policy KEEP_LATEST_BY_HOURS and hoodie.cleaner.hours.retained set to 12 The cleaner deletes an S3 object which is still used by the running query (which had been running for 2.5 hours when it failed, less than the hours retained setting). I've confirmed it was the cleaner that deleted the object by checking the path in the .clean.requested and .clean files in the timeline. When the query started the file group had one slice 994d5334-bc27-439b-89a9-3f129f658c90-0_12-31-2430_20221123052731868.parquet When it failed there were two new slices: 994d5334-bc27-439b-89a9-3f129f658c90-0_1141-82-5080_20230421072039927.parquet 994d5334-bc27-439b-89a9-3f129f658c90-0_401-51-3162_20230421070656147.parquet How to set the cleaner policy to prevent the issue and keep the query running? **To Reproduce** Steps to reproduce the behavior: 1. Reader starts running Spark SQL query, query plan uses a slice older than the hoodie.cleaner.hours.retained setting 2. Writer runs upsert on table 3. New slice is written to file group 4. Inline clean deletes old slice **Expected behavior** The old file slice should remain in s3 for at least 12 hours after the commit at 20230421070656147 so the query can finish successfully. **Environment Description** * Hudi version : 0.12.1-amzn-0 * Spark version : 3.3.0 * Hive version :3.1.3 * Hadoop version : 3.3.3 * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : no **Additional context** Was previously using KEEP_LATEST_COMMITS policy but this can also cause Spark SQL queries to fail when there are a number of commits in a short time period, so tried changing to KEEP_LATEST_BY_HOURS The table has irregular commits, the old commit is from 202211 because the error is from a test environment, but the same problem can happen in production. **Stacktrace** ``` java.io.FileNotFoundException: No such file or directory 's3://.../994d5334-bc27-439b-89a9-3f129f658c90-0_12-31-2430_20221123052731868.parquet' at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:524) at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.getFileStatus(EmrFileSystem.java:617) at org.apache.parquet.hadoop.util.HadoopInputFile.fromPath(HadoopInputFile.java:61) at org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:39) at org.apache.spark.sql.execution.datasources.parquet.Spark32PlusHoodieParquetFileFormat.footerFileMetaData$lzycompute$1(Spark32PlusHoodieParquetFileFormat.scala:161) at org.apache.spark.sql.execution.datasources.parquet.Spark32PlusHoodieParquetFileFormat.footerFileMetaData$1(Spark32PlusHoodieParquetFileFormat.scala:160) at org.apache.spark.sql.execution.datasources.parquet.Spark32PlusHoodieParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(Spark32PlusHoodieParquetFileFormat.scala:164) at org.apache.hudi.HoodieDataSourceHelper$.$anonfun$buildHoodieParquetReader$1(HoodieDataSourceHelper.scala:67) at org.apache.hudi.HoodieBaseRelation.$anonfun$createBaseFileReader$1(HoodieBaseRelation.scala:590) at org.apache.hudi.HoodieBaseRelation$BaseFileReader.apply(HoodieBaseRelation.scala:652) at org.apache.hudi.HoodieMergeOnReadRDD.compute(HoodieMergeOnReadRDD.scala:121) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org