nsivabalan commented on a change in pull request #2245:
URL: https://github.com/apache/hudi/pull/2245#discussion_r543739496



##########
File path: docs/_posts/2020-11-11-hudi-indexing-mechanisms.md
##########
@@ -0,0 +1,80 @@
+---
+title: "Apache Hudi Indexing mechanisms"
+excerpt: "Detailing different indexing mechanisms in Hudi and when to use each 
of them"
+author: sivabalan
+category: blog
+---
+
+
+## Introduction
+Hudi employs index to find and update the location of incoming records during 
write operations. To be specific, index assist in differentiating 
+inserts vs updates. This blog talks about different indices and when to each 
of them.
+
+Hudi dataset can be of two types in general, partitioned and non-partitioned. 
So, most index has two implementations, one for partitioned dataset 
+and another for non-partitioned called as global index.
+
+These are the types of index supported by Hudi as of now.
+
+- InMemory

Review comment:
       Even though the blog talks about only 3 of these, just to be 
comprehensive, have included InMemory also here. 

##########
File path: docs/_posts/2020-11-11-hudi-indexing-mechanisms.md
##########
@@ -0,0 +1,80 @@
+---
+title: "Apache Hudi Indexing mechanisms"
+excerpt: "Detailing different indexing mechanisms in Hudi and when to use each 
of them"
+author: sivabalan
+category: blog
+---
+
+
+## Introduction
+Hudi employs index to find and update the location of incoming records during 
write operations. To be specific, index assist in differentiating 
+inserts vs updates. This blog talks about different indices and when to each 
of them.
+
+Hudi dataset can be of two types in general, partitioned and non-partitioned. 
So, most index has two implementations, one for partitioned dataset 
+and another for non-partitioned called as global index.
+
+These are the types of index supported by Hudi as of now.
+
+- InMemory
+- Bloom
+- Simple
+- Hbase
+
+You could use “hoodie.index.type” to choose any of these indices.
+
+## Different workloads
+Since data comes in at different volumes, velocity and has different access 
patterns, different indices could be used for different workloads. 
+Let’s walk through some of the typical workloads and see how to leverage Hudi 
index for such use-cases.
+
+### Fact table
+These are typical primary table in a dimensional model. It contains measures 
or quantitative figures and is used for analysis and decision making. 
+For eg, trip tables in case of ride-sharing, user buying and selling of 
shares, or any other similar use-case can be categorized as fact tables. 
+These tables are usually ever growing with random updates on most recent data 
with long tail of older data. In other words, most updates go into 
+the latest partitions with few updates going to older ones.
+
+![Fact table](/assets/images/blog/hudi-indexes/Hudi_Index_Blog_Fact_table.png)
+Figure showing the spread of updates for Fact table.
+
+Hudi "BLOOM" index is the way to go for these kinds of tables, since index 
look-up will prune a lot of data files. So, effectively actual look up will 
+happen only in a very few data files where the records are most likely 
present. This bloom index will also benefit a lot for use-cases where record 
+keys have some kind of ordering (timestamp) among them. File pruning will cut 
down a lot of data files to be looked up resulting in very fast look-up times.
+On a high level, bloom index does pruning based on ranges of data files, 
followed by bloom filter look up. Depending on the workload, this could 
+result in a lot of shuffling depending on the amount of data touched. Hudi is 
planning to support [record level 
indexing](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+08+%3A+Record+level+indexing+mechanisms+for+Hudi+datasets?src=contextnavpagetreemode)
 

Review comment:
       for now, have added links to RFCs. if you prefer to link jiras, can you 
assist me w/ right links(for all). I was trying to look for secondary index and 
couldn't find a jira and hence resorted to use RFC links. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to