Re: [PR] perf(spark): Parse bucket index hash-field config once instead of per… [hudi]

via GitHub Fri, 12 Jun 2026 20:31:22 -0700


wombatu-kun commented on code in PR #18979:
URL: https://github.com/apache/hudi/pull/18979#discussion_r3407321135



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkBucketIndexPartitioner.java:
##########
@@ -129,7 +132,7 @@ public int getPartition(Object key) {
     Option<HoodieRecordLocation> location = keyLocation._2;
     int bucketId = location.isPresent()
         ? BucketIdentifier.bucketIdFromFileId(location.get().getFileId())
-        : BucketIdentifier.getBucketId(keyLocation._1.getRecordKey(), 
indexKeyField, numBuckets);
+        : BucketIdentifier.getBucketId(keyLocation._1.getRecordKey(), 
indexKeyFieldList, numBuckets);

Review Comment:
   `SparkPartitionBucketIndexPartitioner.getPartition` 
(`SparkPartitionBucketIndexPartitioner.java:163`) keeps the identical 
per-record `getBucketId(recordKey, String indexKeyField, ...)` call this PR 
replaces here, and it is the default partitioner for partition-level simple 
bucket index (`HoodieLayoutConfig.java:101`), so the per-record re-parse is not 
removed on that path. The same precompute-to-List fix applies directly.
   
   Two other Spark write paths still re-parse per record via the String 
overloads: `ConsistentBucketIndexBulkInsertPartitionerWithRows.java:182` and 
`SingleSparkJobConsistentHashingExecutionStrategy.java:208`. If these are 
intentionally out of scope, a note would help; otherwise they are natural 
follow-ups.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] perf(spark): Parse bucket index hash-field config once instead of per… [hudi]

Reply via email to