jonvex opened a new pull request, #7413:
URL: https://github.com/apache/hudi/pull/7413

   ### Change Logs
   
   Currently, all of the Custom Bulk Insert ColumnSortPartitioner impls 
incorrectly return "true" from the "arePartitionRecordsSorted" method, even 
though records might not necessarily be sorted by the partition-path columns as 
is required by this method. 
   
   I fixed the implementations to return true only if the sort column names 
list starts with the partition-path column name.
   
   
   ### Impact
   
   In the case when these Partitioners are used and the sort column names don't 
start with the partitionPath, this could lead to a Parquet writers being closed 
prematurely when writing files creating a LOT of small files in the current 
implementation. This fix will prevent this. 
   
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   Maybe need to change "hoodie.clustering.plan.strategy.sort.columns" to 
explain this? And any other configs that are used to set the sort ordering.
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to