nsivabalan opened a new pull request, #6561:
URL: https://github.com/apache/hudi/pull/6561

   ### Change Logs
   
   Apparently clustering is being triggered twice since we don't cache the 
write status and for doing some validation, we do isEmpty on 
JavaRDD<WriteStatus> which ended up calling it again. 
   
   ### Impact
   
   Could improve the clustering performance. 
   
   **Risk level: medium**
   
   If not for the fix, clustering could be triggered twice, but only one set of 
files will be included in the final commit metadata. Duplicated copy will be 
deleted during marker reconciliation step. 
   
   Test/Verification:
   Manually verified that if not for the fix, markers are created twice(two 
files differ just in write token) and later reconcilation step deletes one of 
them. With the fix, I don't see such duplicates. Only one file is created for 
clustering and during reconciliation, nothing gets deleted. 
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to