Re: [PR] [HUDI-8606] Use spark engineContext to parallelize secondary keys lookup [hudi]

via GitHub Fri, 29 Nov 2024 11:25:24 -0800


nsivabalan commented on code in PR #12376:
URL: https://github.com/apache/hudi/pull/12376#discussion_r1863880521



##########
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java:
##########
@@ -809,13 +797,9 @@ protected Map<String, String> 
getSecondaryKeysForRecordKeys(List<String> recordK
     }
 
     // Parallel lookup keys from each file slice
-    Map<String, String> reverseSecondaryKeyMap = new HashMap<>();
-    partitionFileSlices.parallelStream().forEach(fileSlice -> {
-      Map<String, String> partialResult = 
reverseLookupSecondaryKeys(partitionName, recordKeys, fileSlice);
-      synchronized (reverseSecondaryKeyMap) {
-        reverseSecondaryKeyMap.putAll(partialResult);
-      }
-    });
+    Map<String, String> reverseSecondaryKeyMap = new 
HashMap<>(recordKeys.size());
+    getEngineContext().setJobStatus(this.getClass().getSimpleName(), "Lookup 
secondary keys from metadata table partition " + partitionName);
+    getEngineContext().map(partitionFileSlices, fileSlice -> 
reverseLookupSecondaryKeys(partitionName, recordKeys, fileSlice), 
partitionFileSlices.size()).forEach(reverseSecondaryKeyMap::putAll);

Review Comment:
   we can't update the hashmap in the driver concurrently from N spark tasks. 
lets do collect in the end. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-8606] Use spark engineContext to parallelize secondary keys lookup [hudi]

Reply via email to