voonhous commented on PR #7159: URL: https://github.com/apache/hudi/pull/7159#issuecomment-1370582777
# Issue Issue at hand: Clustering will be performed for inputGroups with only 1 fileSlice, which may cause unnecessary file re-writes and write amplifications should there be no column sorting required. # Edge cases CMIIW, the changes here does not fully fix the cluster of inputGroups with only 1 fileSlice issue. I am not sure if I have missed out any scenarios, at the top of my head, I can only think of these 3 scenarios. 1. No sorting required 2. Sorting required; column has not been sorted (replacecommit/clustering not performed yet) 3. Sorting required; column has already been sorted (replacecommit/clustering has been performed) While this fix is able to fix the issue for case (1), it is not able to differentiate between the cases (2) and (3). As such, if a parquet file has the required columns that are already sorted, an unnecessary rewrite will be performed again. I am not sure if there are any way around this issue other than reading required replacecommit files (if they are not archived) to check if a sort operation has been performed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org