Re: [PR] perf(spark): Parse bucket index hash-field config once instead of per… [hudi]

via GitHub Fri, 12 Jun 2026 20:35:59 -0700


wombatu-kun commented on code in PR #18979:
URL: https://github.com/apache/hudi/pull/18979#discussion_r3407330354



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkBucketIndexPartitioner.java:
##########
@@ -80,7 +83,7 @@ public SparkBucketIndexPartitioner(WorkloadProfile profile,
               + table.getIndex().getClass().getSimpleName());
     }
     this.numBuckets = ((HoodieBucketIndex) table.getIndex()).getNumBuckets();
-    this.indexKeyField = config.getBucketIndexHashField();
+    this.indexKeyFieldList = 
KeyGenUtils.getIndexKeyFields(config.getBucketIndexHashField());

Review Comment:
   The same per-record re-parse exists on the Flink bucket-index write path, 
fixable identically (precompute `KeyGenUtils.getIndexKeyFields` once, call the 
List overload): `BucketIndexPartitioner.java:56` and 
`BucketIndexRemotePartitioner.java:62` (both hold a `String indexKeyFields` 
field re-parsed in `partition()`), `BucketStreamWriteFunction.java:147` (field 
set in setup, re-parsed per `processElement`), and 
`BucketBulkInsertWriterHelper.java:103` (the `indexKeys` String is threaded per 
record from the `Pipelines.java:151` map stage, so the fix needs the 
static-method signature to take a List).
   
   Since the change is mechanical and this PR is small, consider folding the 
Flink mirror into it, or filing a Flink follow-up under the same issue.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] perf(spark): Parse bucket index hash-field config once instead of per… [hudi]

Reply via email to