[GitHub] [hudi] satishkotha commented on a change in pull request #2275: [HUDI-1354] Block updates and replace on file groups in clustering

GitBox Mon, 30 Nov 2020 22:52:48 -0800


satishkotha commented on a change in pull request #2275:
URL: https://github.com/apache/hudi/pull/2275#discussion_r533106354




##########
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clustering/update/UpdateStrategy.java
##########
@@ -0,0 +1,32 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table.action.clustering.update;
+
+import org.apache.hudi.common.model.HoodieFileGroupId;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.table.WorkloadProfile;
+
+import java.util.List;
+

Review comment:
       Good idea by adding this interface to keep it generic. Please add 
javadoc.

##########
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java
##########
@@ -103,6 +104,9 @@ public BaseSparkCommitActionExecutor(HoodieEngineContext 
context,
     if (isWorkloadProfileNeeded()) {

Review comment:
       what happens if workload profile is not needed? Is there a better place 
to do this validation? Can we do this after tagLocation is done? If any of the 
records have location to files in pending clustering, we can throw error.

##########
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java
##########
@@ -103,6 +104,9 @@ public BaseSparkCommitActionExecutor(HoodieEngineContext 
context,
     if (isWorkloadProfileNeeded()) {
       profile = new WorkloadProfile(buildProfile(inputRecordsRDD));
       LOG.info("Workload profile :" + profile);
+      // apply clustering update strategy.

Review comment:
       I think we need to think more about the algorithm here.  For example, 
   1. f1, f2, f3 are file groups in partition.
   2. Assume there is pending clustering on all file groups f1, f2, f3.
   3. f3 is a small file. So we buildProfile would assign inserts to f3.
   4. applying update strategy will throw error because f3 is included.
   
   Instead, we may want to change buildProfile to exclude file groups that are 
in pending clustering. So, new file f4 would be created in step#3 and ingestion 
can continue. This way inserts can continue without errors.

##########
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clustering/update/RejectUpdateStrategy.java
##########
@@ -0,0 +1,61 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table.action.clustering.update;
+
+import org.apache.hudi.common.model.HoodieFileGroupId;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieClusteringUpdateException;
+import org.apache.hudi.table.WorkloadProfile;
+import org.apache.hudi.table.WorkloadStat;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+
+public class RejectUpdateStrategy implements UpdateStrategy {
+  private static final Logger LOG = 
LogManager.getLogger(RejectUpdateStrategy.class);
+
+  @Override
+  public void apply(List<Pair<HoodieFileGroupId, HoodieInstant>> 
fileGroupsInPendingClustering, WorkloadProfile workloadProfile) {
+    List<Pair<String, String>> partitionPathAndFileIds = 
fileGroupsInPendingClustering.stream()
+        .map(entry -> Pair.of(entry.getLeft().getPartitionPath(), 
entry.getLeft().getFileId())).collect(Collectors.toList());
+    if (partitionPathAndFileIds.size() == 0) {
+      return;
+    }
+
+    Set<Map.Entry<String, WorkloadStat>> partitionStatEntries = 
workloadProfile.getPartitionPathStatMap().entrySet();
+    for (Map.Entry<String, WorkloadStat> partitionStat : partitionStatEntries) 
{
+      for (Map.Entry<String, Pair<String, Long>> updateLocEntry :
+              partitionStat.getValue().getUpdateLocationToCount().entrySet()) {
+        String partitionPath = partitionStat.getKey();
+        String fileId = updateLocEntry.getKey();
+        if (partitionPathAndFileIds.contains(Pair.of(partitionPath, fileId))) {

Review comment:
       Since you are doing contains, is it better to change its type to Set?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] satishkotha commented on a change in pull request #2275: [HUDI-1354] Block updates and replace on file groups in clustering

Reply via email to