hussein-awala opened a new pull request, #7041:
URL: https://github.com/apache/hudi/pull/7041

   ### Change Logs
   
   When the clean planner lists the files in the partitions and it doesn't find 
any file to delete, the clean operation is skipped without any commit, then in 
the next clean, if the incremental cleaning mode is enabled, the clean planner 
doesn't find any information about the checked commits, and it will recheck all 
the files a second time. This PR creates a clean commit contains the 
`earliestCommitToRetain` regardless the deleted files list, in this case the 
clean planner will check only the partitions that have been changed since the 
`earliestCommitToRetain` in the last clean commit.
   
   ### Impact
   
   A new clean commit will be added to the timeline even if there was not a 
real clean operation. For the benefits, a big performance improvement (and cost 
reduction of S3 listing) in cleaning operation for table where old partitions 
are seldom changed.
   
   ### Risk level (write none, low medium or high below)
   
   low:
   The risk level is low because these changes affects only the clean plans 
without files to delete, and I kept the checks on the empty commit files to 
avoid Avro empty file exception, and I improved the method which clean this 
empty files. If for some reason we have an empty Avro file, a brute force will 
be performed to prepare the clean plan.
   I will test these changes on our project within the week to make sure 
everything is fine
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
     ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
     changes to the website._
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Change Logs and Impact were stated clearly
   - [x] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to