voonhous opened a new issue, #18978: URL: https://github.com/apache/hudi/issues/18978
### Describe the problem `SparkBucketIndexPartitioner#getPartition` calls the `BucketIdentifier.getBucketId` overload that takes the raw comma-separated hash-field config String; that overload re-parses the immutable config value on every call, allocating per record in the shuffle of every upsert or insert into simple-bucket-index tables. The same pattern repeats on the row paths: the `BucketPartitionUtils.createDataFrame` keyBy closure and `BucketBulkInsertDataInternalWriterHelper#write` re-parse the same config string per row in bucket-index row-writer bulk inserts. ### Proposed fix Precompute the parsed field list once per partitioner or writer with `KeyGenUtils.getIndexKeyFields` (the exact parser the String overload uses today, including trim, empty-token filtering and null handling) and call the existing List-taking `getBucketId` overload. Bucket ids are bit-identical since the downstream chain is unchanged; `HoodieBucketIndex` already follows this pattern for the tagging path. Will raise a PR for this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
