voonhous commented on PR #7159:
URL: https://github.com/apache/hudi/pull/7159#issuecomment-1370582777

   # Issue
   Issue at hand: Clustering will be performed for inputGroups with only 1 
fileSlice, which may cause unnecessary file re-writes and write amplifications 
should there be no column sorting required.
   
   # Edge cases
   CMIIW, the changes here does not fully fix the cluster of inputGroups with 
only 1 fileSlice issue.
   
   I am not sure if I have missed out any scenarios, at the top of my head, I 
can only think of these 3 scenarios.
   
   1. No sorting required
   2. Sorting required; column has not been sorted (replacecommit/clustering 
not performed yet)
   3. Sorting required; column has already been sorted 
(replacecommit/clustering has been performed)
   
   While this fix is able to fix the issue for case (1), it is not able to 
differentiate between the cases (2) and (3). 
   
   As such, if a parquet file has the required columns that are already sorted, 
an unnecessary rewrite will be performed again.
   
   I am not sure if there are any way around this issue other than reading 
required replacecommit files (if they are not archived) to check if a sort 
operation has been performed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to