[GitHub] [hudi] SteNicholas commented on a diff in pull request #7159: [HUDI-5173]Skip if there is only one file in clusteringGroup

GitBox Thu, 17 Nov 2022 08:49:29 -0800


SteNicholas commented on code in PR #7159:
URL: https://github.com/apache/hudi/pull/7159#discussion_r1025443367



##########
hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/client/clustering/plan/strategy/FlinkSizeBasedClusteringPlanStrategy.java:
##########
@@ -70,9 +70,11 @@ protected Stream<HoodieClusteringGroup> 
buildClusteringGroupsForPartition(String
       // check if max size is reached and create new group, if needed.
       // in now, every clustering group out put is 1 file group.
       if (totalSizeSoFar >= writeConfig.getClusteringTargetFileMaxBytes() && 
!currentGroup.isEmpty()) {
-        LOG.info("Adding one clustering group " + totalSizeSoFar + " max 
bytes: "
-            + writeConfig.getClusteringMaxBytesInGroup() + " num input slices: 
" + currentGroup.size());
-        fileSliceGroups.add(Pair.of(currentGroup, 1));
+        if (currentGroup.size() > 1 || 
(!StringUtils.isNullOrEmpty(writeConfig.getClusteringSortColumns()) && 
currentGroup.size() == 1)) {

Review Comment:
   If the one file has already been clustered, it's no need to add into 
`fileSliceGroups`. IMO, whether the `fileSliceGroups` has only one clustered 
file could move out of the `buildClusteringGroupsForPartition`, therefore this 
doesn't modify the strategy of Flink and Spark engine.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] SteNicholas commented on a diff in pull request #7159: [HUDI-5173]Skip if there is only one file in clusteringGroup

Reply via email to