zhuanshenbsj1 commented on PR #7159:
URL: https://github.com/apache/hudi/pull/7159#issuecomment-1385053453

   > # Issue
   > Issue at hand: Clustering will be performed for inputGroups with only 1 
fileSlice, which may cause unnecessary file re-writes and write amplifications 
should there be no column sorting required.
   > 
   > # Edge cases
   > CMIIW, the changes here does not fully fix the cluster of inputGroups with 
only 1 fileSlice issue.
   > 
   > I am not sure if I have missed out any scenarios, at the top of my head, I 
can only think of these 3 scenarios.
   > 
   > 1. No sorting required
   > 2. Sorting required; column has not been sorted (replacecommit/clustering 
not performed yet)
   > 3. Sorting required; column has already been sorted 
(replacecommit/clustering has been performed)
   > 
   > While this fix is able to fix the issue for case (1), it is not able to 
differentiate between the cases (2) and (3).
   > 
   > As such, if a parquet file has the required columns that are already 
sorted, an unnecessary rewrite will be performed again.
   > 
   > I am not sure if there are any way around this issue other than reading 
required replacecommit files (if they are not archived) to check if a sort 
operation has been performed.
   
   In case (3) as you say,  we still need to check whether sort.columns is 
changed(column_a,column_b ->column_a,column_c) if we consider to check 
clustered file .  We need to add new flag data in per clustered file when 
finished a clustering instant, and check every file this flag when generate new 
clustering plan.  It will make this check operation  too expensive.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to