[ https://issues.apache.org/jira/browse/HUDI-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17269828#comment-17269828 ]
Mihir Shah commented on HUDI-1503: ---------------------------------- h4. [Shimin Yang|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=dangdangdang] Hello Mr. Yang, I would be interested in working on this issue, I was wondering if there is some documentation about the index or the project's design so I could understand the problem better? Thank you! > Implement a Hash(Bucket)-based Index > ------------------------------------ > > Key: HUDI-1503 > URL: https://issues.apache.org/jira/browse/HUDI-1503 > Project: Apache Hudi > Issue Type: Wish > Components: Index, Performance > Reporter: Shimin Yang > Priority: Major > > This ticket is to introduce a new hash based index, which can improve the > performance of write operations and speed up the queries at the same > time(removing shuffle for Spark/Hive). > The new hash-based index works with a customized hash-based partitioner, > which partition records based on the hash value of index keys and a fixed > bucket number. So there's no need to visit the existing files to determine > which file group each record belongs. > Meanwhile, the file group id, hash mode and bucket num can be used by the > query engines to eliminate shuffle introduced by aggregation and join. > We implemented an HoodieIndex based on hive hash function which used on > production environment of ByteDance for many very-large volume dataset, and > we hope this feature can be contributed to the community soon. -- This message was sent by Atlassian Jira (v8.3.4#803005)