[I] Implement a Hash(Bucket)-based Index [hudi]

via GitHub Sat, 29 Nov 2025 19:39:06 -0800


hudi-bot opened a new issue, #14722:
URL: https://github.com/apache/hudi/issues/14722


   This ticket is to introduce a new hash based index, which can improve the 
performance of  write operations and speed up the queries at the same 
time(removing shuffle for Spark/Hive).
   
   The new hash-based index works with a customized hash-based partitioner, 
which partition records based on the hash value of index keys and a fixed 
bucket number. So there's no need to visit the existing files to determine 
which file group each record belongs.
   
   Meanwhile, the file group id, hash mode and bucket num can be used by the 
query engines to eliminate shuffle introduced by aggregation and join.
   
   We implemented an HoodieIndex based on hive hash function which used on 
production environment of ByteDance for many very-large volume dataset, and we 
hope this feature can be contributed to the community soon.
   
   ## JIRA info
   
   - Link: https://issues.apache.org/jira/browse/HUDI-1503
   - Type: Wish
   - Epic: https://issues.apache.org/jira/browse/HUDI-3039
   - Fix version(s):
     - 1.1.0
   
   
   ---
   
   
   ## Comments
   
   22/Jan/21 05:44;shahmihir;h4. [Shimin 
Yang|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=dangdangdang]
   
   Hello Mr. Yang,
   
   I would be interested in working on this issue, I was wondering if there is 
some documentation about the index or the project's design so I could 
understand the problem better?
   
   Thank you!;;;


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Implement a Hash(Bucket)-based Index [hudi]

Reply via email to