hudi-bot opened a new issue, #16913:
URL: https://github.com/apache/hudi/issues/16913

   In MDT, we employ a custom way of determining the file group of a metadata 
record to write to in MDT, by hashing the record key 
("HoodieTableMetadataUtil#mapRecordKeyToFileGroupIndex").  This allows 
hash-based join and lookup based on keys by reading specific file group(s) in 
the MDT partition only, such as record-level index.
   
   However, as the secondary index uses 
<secondary_column_value$record_key_value> as the metadata payload key, when 
looking up using "secondary_column_value" only, we cannot use the hash-based 
join and lookup, as we need to know the full key for determining the file group 
to read in MDT.  Thus we have to scan all file groups in MDT for lookups which 
is inefficient if the secondary index is huge.  This can become a performance 
bottleneck in looking up a larger number of keys on a large secondary index.
   
   To solve the problem, we should use secondary_column_value only for 
determining the file group while still keeping 
<secondary_column_value$record_key_value> as the metadata payload record key.  
By employing this, during lookup, with the secondary_column_value we can 
determine the file group to read based on the secondary_column_value solely to 
identify the file group(s) to read, thus the hash-based join and lookup on 
secondary index.
   
   ## JIRA info
   
   - Link: https://issues.apache.org/jira/browse/HUDI-9164
   - Type: Improvement
   - Fix version(s):
     - 1.1.0
   
   
   ---
   
   
   ## Comments
   
   13/Mar/25 17:50;yihua;Improved the description for clarify.;;;


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to