[GitHub] [hudi] vinothchandar commented on a change in pull request #2245: [WIP] Adding Hudi indexing mechanisms blog

GitBox Sun, 15 Nov 2020 18:19:06 -0800


vinothchandar commented on a change in pull request #2245:
URL: https://github.com/apache/hudi/pull/2245#discussion_r523850991




##########
File path: docs/_posts/2020-11-11-hudi-indexing-mechanisms.mb
##########
@@ -0,0 +1,93 @@
+---
+title: "Apache Hudi Indexing mechanisms"
+excerpt: "Detailing different indexing mechanisms in Hudi and when to use each 
of them"
+author: sivabalan
+category: blog
+---
+
+
+## 1. Introduction
+Hoodie employs index to find and update the location of incoming records 
during write operations. Hoodie index is a very critical piece in Hoodie as it 
gives record level lookup support to Hudi for efficient write operations. This 
blog talks about different indices and when to use which one. 

Review comment:
       Apache Hudi please. everywhere :) 

##########
File path: docs/_posts/2020-11-11-hudi-indexing-mechanisms.mb
##########
@@ -0,0 +1,93 @@
+---
+title: "Apache Hudi Indexing mechanisms"
+excerpt: "Detailing different indexing mechanisms in Hudi and when to use each 
of them"
+author: sivabalan
+category: blog
+---
+
+
+## 1. Introduction
+Hoodie employs index to find and update the location of incoming records 
during write operations. Hoodie index is a very critical piece in Hoodie as it 
gives record level lookup support to Hudi for efficient write operations. This 
blog talks about different indices and when to use which one. 

Review comment:
       more motivation on why this is important from use-case perspective. for 
e.g upstream database may be updated in random ways and the downstream hudi 
table needs to absorb them well. 
   
   

##########
File path: docs/_posts/2020-11-11-hudi-indexing-mechanisms.mb
##########
@@ -0,0 +1,93 @@
+---
+title: "Apache Hudi Indexing mechanisms"
+excerpt: "Detailing different indexing mechanisms in Hudi and when to use each 
of them"
+author: sivabalan
+category: blog
+---
+
+
+## 1. Introduction
+Hoodie employs index to find and update the location of incoming records 
during write operations. Hoodie index is a very critical piece in Hoodie as it 
gives record level lookup support to Hudi for efficient write operations. This 
blog talks about different indices and when to use which one. 
+
+Hoodie dataset can be of two types in general, partitioned and 
non-partitioned. So, most index has two implementations one for partitioned 
dataset and another for non-partitioned called as global index. 
+
+These are the types of index supported by Hoodie as of now. 
+
+- InMemory
+- Bloom
+- Simple
+- Hbase 

Review comment:
       its also pluggable. we should mention that

##########
File path: docs/_posts/2020-11-11-hudi-indexing-mechanisms.mb
##########
@@ -0,0 +1,93 @@
+---
+title: "Apache Hudi Indexing mechanisms"
+excerpt: "Detailing different indexing mechanisms in Hudi and when to use each 
of them"
+author: sivabalan
+category: blog
+---
+
+
+## 1. Introduction
+Hoodie employs index to find and update the location of incoming records 
during write operations. Hoodie index is a very critical piece in Hoodie as it 
gives record level lookup support to Hudi for efficient write operations. This 
blog talks about different indices and when to use which one. 
+
+Hoodie dataset can be of two types in general, partitioned and 
non-partitioned. So, most index has two implementations one for partitioned 
dataset and another for non-partitioned called as global index. 
+
+These are the types of index supported by Hoodie as of now. 
+
+- InMemory

Review comment:
       this is not worth mentioning. its just s test impl

##########
File path: docs/_posts/2020-11-11-hudi-indexing-mechanisms.mb
##########
@@ -0,0 +1,93 @@
+---
+title: "Apache Hudi Indexing mechanisms"
+excerpt: "Detailing different indexing mechanisms in Hudi and when to use each 
of them"
+author: sivabalan
+category: blog
+---
+
+
+## 1. Introduction
+Hoodie employs index to find and update the location of incoming records 
during write operations. Hoodie index is a very critical piece in Hoodie as it 
gives record level lookup support to Hudi for efficient write operations. This 
blog talks about different indices and when to use which one. 
+
+Hoodie dataset can be of two types in general, partitioned and 
non-partitioned. So, most index has two implementations one for partitioned 
dataset and another for non-partitioned called as global index. 
+
+These are the types of index supported by Hoodie as of now. 
+
+- InMemory
+- Bloom
+- Simple
+- Hbase 
+
+You could use “hoodie.index.type” to choose any of these indices. 
+
+### 1.1 Motivation
+Different workloads have different access patterns. Hudi supports different 
indexing schemes to cater to the needs of different workloads. So depending on 
one’s use-case, indexing schema can be chosen. 
+
+For eg: ……. 
+To Be filled
+
+Let's take a brief look at each of these indices.
+
+## 2. InMemory
+Stores an in memory hashmap of records to location mapping. Intended to be 
used for local testing. 
+
+## 3. Bloom
+Leverages bloom index stored with data files to find the location for the 
incoming records. This is the most commonly used Index in Hudi and is the 
default one. On a high level, this does a range pruning followed by bloom look 
up. So, if the record keys are laid out such that it follows some type of 
ordering like timestamps, then this will essentially cut down a lot of files to 
be looked up as bloom would have filtered out most of the files. But Range 
pruning is optional depending on your use-case. If your write batch is such 
that the records have no ordering in them (e.g uuid), but the pattern is such 
that mostly the recent partitions are updated with a long tail of 
updates/deletes to the older partitions, then still bloom index will be faster. 
But better to turn off range pruning as it just incurs the cost of checking w/o 
much benefit. 
+
+For instance, consider a list of file slices in a partition
+
+F1 : key_t0 to key_t10000
+F2 : key_t10001 to key_t20000
+F3 : key_t20001 to key_t30000
+F4 : key_t30001 to key_t40000
+F5 : key_t40001 to key_t50000
+
+So, when looking up records ranging from key_t25000 to key_t28000, bloom will 
filter every file slice except F3 with range pruning. 
+
+Here is a high level pseudocode used for this bloom:

Review comment:
       this is more like steps. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] vinothchandar commented on a change in pull request #2245: [WIP] Adding Hudi indexing mechanisms blog

Reply via email to