hudi-bot opened a new issue, #16913:
URL: https://github.com/apache/hudi/issues/16913
In MDT, we employ a custom way of determining the file group of a metadata
record to write to in MDT, by hashing the record key
("HoodieTableMetadataUtil#mapRecordKeyToFileGroupIndex"). This allows
hash-based join and lookup based on keys by reading specific file group(s) in
the MDT partition only, such as record-level index.
However, as the secondary index uses
<secondary_column_value$record_key_value> as the metadata payload key, when
looking up using "secondary_column_value" only, we cannot use the hash-based
join and lookup, as we need to know the full key for determining the file group
to read in MDT. Thus we have to scan all file groups in MDT for lookups which
is inefficient if the secondary index is huge. This can become a performance
bottleneck in looking up a larger number of keys on a large secondary index.
To solve the problem, we should use secondary_column_value only for
determining the file group while still keeping
<secondary_column_value$record_key_value> as the metadata payload record key.
By employing this, during lookup, with the secondary_column_value we can
determine the file group to read based on the secondary_column_value solely to
identify the file group(s) to read, thus the hash-based join and lookup on
secondary index.
## JIRA info
- Link: https://issues.apache.org/jira/browse/HUDI-9164
- Type: Improvement
- Fix version(s):
- 1.1.0
---
## Comments
13/Mar/25 17:50;yihua;Improved the description for clarify.;;;
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]