zhuanshenbsj1 commented on PR #7159: URL: https://github.com/apache/hudi/pull/7159#issuecomment-1385053453
> # Issue > Issue at hand: Clustering will be performed for inputGroups with only 1 fileSlice, which may cause unnecessary file re-writes and write amplifications should there be no column sorting required. > > # Edge cases > CMIIW, the changes here does not fully fix the cluster of inputGroups with only 1 fileSlice issue. > > I am not sure if I have missed out any scenarios, at the top of my head, I can only think of these 3 scenarios. > > 1. No sorting required > 2. Sorting required; column has not been sorted (replacecommit/clustering not performed yet) > 3. Sorting required; column has already been sorted (replacecommit/clustering has been performed) > > While this fix is able to fix the issue for case (1), it is not able to differentiate between the cases (2) and (3). > > As such, if a parquet file has the required columns that are already sorted, an unnecessary rewrite will be performed again. > > I am not sure if there are any way around this issue other than reading required replacecommit files (if they are not archived) to check if a sort operation has been performed. In case (3) as you say, we still need to check whether sort.columns is changed(column_a,column_b ->column_a,column_c) if we consider to check clustered file . We need to add new flag data in per clustered file when finished a clustering instant, and check every file this flag when generate new clustering plan. It will make this check operation too expensive. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org