voonhous opened a new issue, #18964:
URL: https://github.com/apache/hudi/issues/18964

   ### Describe the problem
   
   `HoodieMetadataPayload#createRecordIndexUpdate` calls 
`TimelineUtils.parseDateFromInstantTime(instantTime).getTime()` for every 
record, even though the instant time is the same string for the entire commit. 
The parse runs a string-slicing compatibility fixup plus `LocalDateTime.parse` 
with a `DateTimeFormatter` per call. Per-record callers include RLI record 
generation for base files, revived keys, and record-index initialization.
   
   The read side mirrors it: 
`HoodieTableMetadataUtil#getLocationFromRecordIndexInfo` runs `new Date(...)` 
plus `HoodieInstantTimeGenerator.formatDate` per looked-up record during 
record-index lookups, although the set of distinct instant times is tiny (one 
per commit).
   
   For a 10M-record commit this is roughly 10M redundant date parses on the 
write side and the same again per upsert-tagging lookup on the read side.
   
   ### Proposed fix
   
   - Add an overload `createRecordIndexUpdate(recordKey, partition, fileId, 
instantTimeMillis, fileIdEncoding)` and keep the String overload delegating to 
it after a single parse; batch callers parse once outside their per-record 
loops.
   - In `getLocationFromRecordIndexInfo`, memoize the millis-to-instant-string 
formatting with a small bounded cache keyed by the millis value.
   
   Output records and decoded locations are unchanged.
   
   Will raise a PR for this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to