nsivabalan commented on a change in pull request #2245: URL: https://github.com/apache/hudi/pull/2245#discussion_r543739496
########## File path: docs/_posts/2020-11-11-hudi-indexing-mechanisms.md ########## @@ -0,0 +1,80 @@ +--- +title: "Apache Hudi Indexing mechanisms" +excerpt: "Detailing different indexing mechanisms in Hudi and when to use each of them" +author: sivabalan +category: blog +--- + + +## Introduction +Hudi employs index to find and update the location of incoming records during write operations. To be specific, index assist in differentiating +inserts vs updates. This blog talks about different indices and when to each of them. + +Hudi dataset can be of two types in general, partitioned and non-partitioned. So, most index has two implementations, one for partitioned dataset +and another for non-partitioned called as global index. + +These are the types of index supported by Hudi as of now. + +- InMemory Review comment: Even though the blog talks about only 3 of these, just to be comprehensive, have included InMemory also here. ########## File path: docs/_posts/2020-11-11-hudi-indexing-mechanisms.md ########## @@ -0,0 +1,80 @@ +--- +title: "Apache Hudi Indexing mechanisms" +excerpt: "Detailing different indexing mechanisms in Hudi and when to use each of them" +author: sivabalan +category: blog +--- + + +## Introduction +Hudi employs index to find and update the location of incoming records during write operations. To be specific, index assist in differentiating +inserts vs updates. This blog talks about different indices and when to each of them. + +Hudi dataset can be of two types in general, partitioned and non-partitioned. So, most index has two implementations, one for partitioned dataset +and another for non-partitioned called as global index. + +These are the types of index supported by Hudi as of now. + +- InMemory +- Bloom +- Simple +- Hbase + +You could use “hoodie.index.type” to choose any of these indices. + +## Different workloads +Since data comes in at different volumes, velocity and has different access patterns, different indices could be used for different workloads. +Let’s walk through some of the typical workloads and see how to leverage Hudi index for such use-cases. + +### Fact table +These are typical primary table in a dimensional model. It contains measures or quantitative figures and is used for analysis and decision making. +For eg, trip tables in case of ride-sharing, user buying and selling of shares, or any other similar use-case can be categorized as fact tables. +These tables are usually ever growing with random updates on most recent data with long tail of older data. In other words, most updates go into +the latest partitions with few updates going to older ones. + +![Fact table](/assets/images/blog/hudi-indexes/Hudi_Index_Blog_Fact_table.png) +Figure showing the spread of updates for Fact table. + +Hudi "BLOOM" index is the way to go for these kinds of tables, since index look-up will prune a lot of data files. So, effectively actual look up will +happen only in a very few data files where the records are most likely present. This bloom index will also benefit a lot for use-cases where record +keys have some kind of ordering (timestamp) among them. File pruning will cut down a lot of data files to be looked up resulting in very fast look-up times. +On a high level, bloom index does pruning based on ranges of data files, followed by bloom filter look up. Depending on the workload, this could +result in a lot of shuffling depending on the amount of data touched. Hudi is planning to support [record level indexing](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+08+%3A+Record+level+indexing+mechanisms+for+Hudi+datasets?src=contextnavpagetreemode) Review comment: for now, have added links to RFCs. if you prefer to link jiras, can you assist me w/ right links(for all). I was trying to look for secondary index and couldn't find a jira and hence resorted to use RFC links. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org