SabyasachiDasTR opened a new issue, #7600:
URL: https://github.com/apache/hudi/issues/7600

   **Describe the problem you faced**
   
   We are incrementally upserting data into our Hudi table/s every 5 minutes. 
   We have set CLEANER_POLICY as KEEP_LATEST_BY_HOURS with 
CLEANER_HOURS_RETAINED = 48.
   
   The old delta log files in our partition from 2 months back are still not 
cleaned and we can see in cli last cleanup happened 2 months back on November. 
I do not see any action being performed on cleaning the old log files. The only 
command we execute is Upsert and we have single writer and compaction runs 
every hour. 
   We think this is causing out emr job to underperform and crash multiple 
times as very large number of delta log files are getting piled up in the 
partitions and compaction is trying to read them while processing the job.
   
   ![MicrosoftTeams-image 
(33)](https://user-images.githubusercontent.com/52735405/210500715-89227935-b74a-418a-9701-5b783c56a74e.png)
   
   **Options used during Upsert:**
   
![HudiOptionsLatest](https://user-images.githubusercontent.com/52735405/210503366-77d47c7c-169f-4a87-8234-0971079a9347.PNG)
   
   **Writing to s3**
   
![Upsertcmd](https://user-images.githubusercontent.com/52735405/210501558-28eb3712-fed8-4c93-9c85-ccb6ef3521dc.PNG)
   Partition structure: s3://bucket/table/partition/parquet and .log files
   
   **Expected behavior**
   As per my understanding the logs should be deleted beyond 
CLEANER_HOURS_RETAINED which is 2 days .
   
   **Environment Description**
   
   * Hudi version : 0.11.1
   
   * Spark version : 3.2.1
   
   * Hive version : Hive not install on EMR Cluster emr-6.7.0
   
   * Hadoop version : 3.2.1
   
   * Storage (HDFS/S3/GCS..) : s3
   
   * Running on Docker? (yes/no) : No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to