voonhous opened a new issue, #18978:
URL: https://github.com/apache/hudi/issues/18978

   ### Describe the problem
   
   `SparkBucketIndexPartitioner#getPartition` calls the 
`BucketIdentifier.getBucketId` overload that takes the raw comma-separated 
hash-field config String; that overload re-parses the immutable config value on 
every call, allocating per record in the shuffle of every upsert or insert into 
simple-bucket-index tables.
   
   The same pattern repeats on the row paths: the 
`BucketPartitionUtils.createDataFrame` keyBy closure and 
`BucketBulkInsertDataInternalWriterHelper#write` re-parse the same config 
string per row in bucket-index row-writer bulk inserts.
   
   ### Proposed fix
   
   Precompute the parsed field list once per partitioner or writer with 
`KeyGenUtils.getIndexKeyFields` (the exact parser the String overload uses 
today, including trim, empty-token filtering and null handling) and call the 
existing List-taking `getBucketId` overload. Bucket ids are bit-identical since 
the downstream chain is unchanged; `HoodieBucketIndex` already follows this 
pattern for the tagging path.
   
   Will raise a PR for this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to