wombatu-kun commented on code in PR #18979:
URL: https://github.com/apache/hudi/pull/18979#discussion_r3407321135
##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkBucketIndexPartitioner.java:
##########
@@ -129,7 +132,7 @@ public int getPartition(Object key) {
Option<HoodieRecordLocation> location = keyLocation._2;
int bucketId = location.isPresent()
? BucketIdentifier.bucketIdFromFileId(location.get().getFileId())
- : BucketIdentifier.getBucketId(keyLocation._1.getRecordKey(),
indexKeyField, numBuckets);
+ : BucketIdentifier.getBucketId(keyLocation._1.getRecordKey(),
indexKeyFieldList, numBuckets);
Review Comment:
`SparkPartitionBucketIndexPartitioner.getPartition`
(`SparkPartitionBucketIndexPartitioner.java:163`) keeps the identical
per-record `getBucketId(recordKey, String indexKeyField, ...)` call this PR
replaces here, and it is the default partitioner for partition-level simple
bucket index (`HoodieLayoutConfig.java:101`), so the per-record re-parse is not
removed on that path. The same precompute-to-List fix applies directly.
Two other Spark write paths still re-parse per record via the String
overloads: `ConsistentBucketIndexBulkInsertPartitionerWithRows.java:182` and
`SingleSparkJobConsistentHashingExecutionStrategy.java:208`. If these are
intentionally out of scope, a note would help; otherwise they are natural
follow-ups.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]