[jira] [Commented] (HUDI-1503) Implement a Hash(Bucket)-based Index

Mihir Shah (Jira) Thu, 21 Jan 2021 21:45:10 -0800


    [ 
https://issues.apache.org/jira/browse/HUDI-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17269828#comment-17269828
 ]


Mihir Shah commented on HUDI-1503:
----------------------------------

h4. [Shimin 
Yang|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=dangdangdang]

Hello Mr. Yang,

I would be interested in working on this issue, I was wondering if there is 
some documentation about the index or the project's design so I could 
understand the problem better?

Thank you!

> Implement a Hash(Bucket)-based Index
> ------------------------------------
>
>                 Key: HUDI-1503
>                 URL: https://issues.apache.org/jira/browse/HUDI-1503
>             Project: Apache Hudi
>          Issue Type: Wish
>          Components: Index, Performance
>            Reporter: Shimin Yang
>            Priority: Major
>
> This ticket is to introduce a new hash based index, which can improve the 
> performance of  write operations and speed up the queries at the same 
> time(removing shuffle for Spark/Hive).
> The new hash-based index works with a customized hash-based partitioner, 
> which partition records based on the hash value of index keys and a fixed 
> bucket number. So there's no need to visit the existing files to determine 
> which file group each record belongs.
> Meanwhile, the file group id, hash mode and bucket num can be used by the 
> query engines to eliminate shuffle introduced by aggregation and join.
> We implemented an HoodieIndex based on hive hash function which used on 
> production environment of ByteDance for many very-large volume dataset, and 
> we hope this feature can be contributed to the community soon.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-1503) Implement a Hash(Bucket)-based Index

Reply via email to