voonhous commented on code in PR #18979:
URL: https://github.com/apache/hudi/pull/18979#discussion_r3407479318
##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkBucketIndexPartitioner.java:
##########
@@ -129,7 +132,7 @@ public int getPartition(Object key) {
Option<HoodieRecordLocation> location = keyLocation._2;
int bucketId = location.isPresent()
? BucketIdentifier.bucketIdFromFileId(location.get().getFileId())
- : BucketIdentifier.getBucketId(keyLocation._1.getRecordKey(),
indexKeyField, numBuckets);
+ : BucketIdentifier.getBucketId(keyLocation._1.getRecordKey(),
indexKeyFieldList, numBuckets);
Review Comment:
Good catch -- folded all three Spark sites into this PR:
`SparkPartitionBucketIndexPartitioner` (the default partition-level
simple-bucket partitioner), and the two consistent-hashing paths
`ConsistentBucketIndexBulkInsertPartitionerWithRows` and
`SingleSparkJobConsistentHashingExecutionStrategy`. Each now precomputes
`KeyGenUtils.getIndexKeyFields(...)` once and calls the existing `List`
overload.
I also swept the remaining bucket-index call sites to confirm these were the
only ones: everything else (`BucketIndexBulkInsertPartitioner`,
`HoodieBucketIndex` / `HoodieSimpleBucketIndex` /
`HoodieConsistentBucketIndex`, `SparkConsistentBucketDuplicateUpdateStrategy`,
and read-side `BucketIndexSupport`) already takes a `List`, so these three were
the only remaining Spark re-parse sites.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]