[GitHub] [hudi] prashantwason commented on a change in pull request #3932: [HUDI-2704] Adding RFC-37 for Metadata based bloom index

2021-11-13 Thread GitBox


prashantwason commented on a change in pull request #3932:
URL: https://github.com/apache/hudi/pull/3932#discussion_r748700103



##
File path: rfc/rfc-37/rfc-37.md
##
@@ -0,0 +1,168 @@
+
+# RFC-37: Metadata based Bloom Index
+
+
+## Proposers
+
+- @nsivabalan
+- @manojpec
+
+## Approvers
+ - @
+ - @
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-2703
+
+## Abstract
+Hudi maintains indices to locate/map incoming records to file groups during 
writes. Most commonly 
+used record index is the HoodieBloomIndex. For larger installations and for 
global index types, performance might be an issue
+due to loading of bloom from large number of data files and due to throttling 
issues with some of the cloud stores. We are proposing to 
+build a new Metadata index (metadata table based bloom index) to boost the 
performance of existing bloom index. 

Review comment:
   Why not use a record-level-index for this functionality? Is it because 
of storage requirements? Wouldn't a record level index be faster than using a 
bloom based index?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] prashantwason commented on a change in pull request #3932: [HUDI-2704] Adding RFC-37 for Metadata based bloom index

2021-11-13 Thread GitBox


prashantwason commented on a change in pull request #3932:
URL: https://github.com/apache/hudi/pull/3932#discussion_r748699656



##
File path: rfc/rfc-37/rfc-37.md
##
@@ -0,0 +1,168 @@
+
+# RFC-37: Metadata based Bloom Index
+
+
+## Proposers
+
+- @nsivabalan
+- @manojpec
+
+## Approvers
+ - @
+ - @
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-2703
+
+## Abstract
+Hudi maintains indices to locate/map incoming records to file groups during 
writes. Most commonly 
+used record index is the HoodieBloomIndex. For larger installations and for 
global index types, performance might be an issue
+due to loading of bloom from large number of data files and due to throttling 
issues with some of the cloud stores. We are proposing to 
+build a new Metadata index (metadata table based bloom index) to boost the 
performance of existing bloom index. 
+
+## Background
+HoodieBloomIndex is used to find the location of incoming records during every 
write. This will assist Hudi in deterministically 
+routing records to a given file group and to distinguish inserts vs updates. 
This bloom index relies on (min, max) values 
+of records keys and bloom indexes in base file footers to find the actual 
record location. In this RFC, we plan to 
+build a new index on top of metadata table which to assist in bloom index 
based tagging. 
+
+## Design
+HoodieBloomIndex involves the following steps to find the right location of 
incoming records
+1. Load all interested partitions and fetch data files. 
+2. Find and filter files to keys mapping based on min max in data file footers.
+3. Filter files to keys mapping based on bloom index in data file footers. 
+4. Look up actual data files to find the right location of every incoming 
record.
+
+As we could see from step 1 and 2, we are in need of min and max values for 
"_hoodie_record_key" and bloom filter for 
+all data files to perform the tagging. In this design, we will add these to 
metadata table and the index lookup 
+will look into these metadata table partitions to deduce the file to keys 
mapping. 
+
+To realize this, we are adding two new partitions namely, `column_stats` and 
`bloom_filter` to metadata table.  
+
+Why metadata table: 
+Metadata table uses HFile to store and retrieve data. HFile is an indexed file 
format and supports random lookups based on 
+keys. Since, we will be storing stats/bloom for every file and the index will 
do lookups based on files, we should be able to 
+benefit from the faster lookups in HFile. 
+
+
+
+Following sections will talk about different partitions, key formats and then 
dive into the data and control flows.
+
+### Column_Stats partition:
+"Column_stats" will be discussed in depth in RFC-27, but in the interest of 
this RFC, Column_stats partition stores 
+statistics(min and max value) for `__hoodie_record_key` column for all files 
in the Hudi data table. 
+
+High level requirement for this column_stats partition are:
+Given a list of record keys, partition paths and file names, find the possibly 
matching file names based on
+`__hoodie_record_key` column stats. 
+
+To cater to this requirement, we need to ensure our keys in Hfile are such 
that we can do pointed lookups for a given data file.
+Below picture gives a pictorial representation of Column stats partition in 
metadata table. 
+
+
+
+We have to encode column names, filenames etc to IDs to save storage and to 
exploit compression. We will update the RFC 
+once we have more data around what kind of ID we can go with. On a high level, 
we are looking at incremental IDs vs 
+hash Ids. 
+
+For now, lets assume that every entity will be given an ID (column name, 
partition path name, file name) 
+
+```
+Key in column_stats partition =
+[colId][PartitionId][FileId]
+```
+```
+Value: stats  {
+  min_value: bytes
+  max_value: bytes
+  ...
+  ...
+}
+```
+
+### Bloom Filter Partition:
+This will assist in storing bloom filters for all base files in the data 
table. This will be leveraged by metadata 
+index being designed with this RFC.
+
+
+
+Requirements:
+Given a list of FileIDs, return their bloom filters
+```
+Key format: [PartitionId][FileId]

Review comment:
   Since fileId is UUID based, can we assume that fileIDs are unique within 
HUDI? If so, the partitionId is not required here.
   
   But prefixing with partitionID may lead to better perf as all the fileIDs 
for a partition will be together in same block. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] prashantwason commented on a change in pull request #3932: [HUDI-2704] Adding RFC-37 for Metadata based bloom index

2021-11-13 Thread GitBox


prashantwason commented on a change in pull request #3932:
URL: https://github.com/apache/hudi/pull/3932#discussion_r748699656



##
File path: rfc/rfc-37/rfc-37.md
##
@@ -0,0 +1,168 @@
+
+# RFC-37: Metadata based Bloom Index
+
+
+## Proposers
+
+- @nsivabalan
+- @manojpec
+
+## Approvers
+ - @
+ - @
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-2703
+
+## Abstract
+Hudi maintains indices to locate/map incoming records to file groups during 
writes. Most commonly 
+used record index is the HoodieBloomIndex. For larger installations and for 
global index types, performance might be an issue
+due to loading of bloom from large number of data files and due to throttling 
issues with some of the cloud stores. We are proposing to 
+build a new Metadata index (metadata table based bloom index) to boost the 
performance of existing bloom index. 
+
+## Background
+HoodieBloomIndex is used to find the location of incoming records during every 
write. This will assist Hudi in deterministically 
+routing records to a given file group and to distinguish inserts vs updates. 
This bloom index relies on (min, max) values 
+of records keys and bloom indexes in base file footers to find the actual 
record location. In this RFC, we plan to 
+build a new index on top of metadata table which to assist in bloom index 
based tagging. 
+
+## Design
+HoodieBloomIndex involves the following steps to find the right location of 
incoming records
+1. Load all interested partitions and fetch data files. 
+2. Find and filter files to keys mapping based on min max in data file footers.
+3. Filter files to keys mapping based on bloom index in data file footers. 
+4. Look up actual data files to find the right location of every incoming 
record.
+
+As we could see from step 1 and 2, we are in need of min and max values for 
"_hoodie_record_key" and bloom filter for 
+all data files to perform the tagging. In this design, we will add these to 
metadata table and the index lookup 
+will look into these metadata table partitions to deduce the file to keys 
mapping. 
+
+To realize this, we are adding two new partitions namely, `column_stats` and 
`bloom_filter` to metadata table.  
+
+Why metadata table: 
+Metadata table uses HFile to store and retrieve data. HFile is an indexed file 
format and supports random lookups based on 
+keys. Since, we will be storing stats/bloom for every file and the index will 
do lookups based on files, we should be able to 
+benefit from the faster lookups in HFile. 
+
+
+
+Following sections will talk about different partitions, key formats and then 
dive into the data and control flows.
+
+### Column_Stats partition:
+"Column_stats" will be discussed in depth in RFC-27, but in the interest of 
this RFC, Column_stats partition stores 
+statistics(min and max value) for `__hoodie_record_key` column for all files 
in the Hudi data table. 
+
+High level requirement for this column_stats partition are:
+Given a list of record keys, partition paths and file names, find the possibly 
matching file names based on
+`__hoodie_record_key` column stats. 
+
+To cater to this requirement, we need to ensure our keys in Hfile are such 
that we can do pointed lookups for a given data file.
+Below picture gives a pictorial representation of Column stats partition in 
metadata table. 
+
+
+
+We have to encode column names, filenames etc to IDs to save storage and to 
exploit compression. We will update the RFC 
+once we have more data around what kind of ID we can go with. On a high level, 
we are looking at incremental IDs vs 
+hash Ids. 
+
+For now, lets assume that every entity will be given an ID (column name, 
partition path name, file name) 
+
+```
+Key in column_stats partition =
+[colId][PartitionId][FileId]
+```
+```
+Value: stats  {
+  min_value: bytes
+  max_value: bytes
+  ...
+  ...
+}
+```
+
+### Bloom Filter Partition:
+This will assist in storing bloom filters for all base files in the data 
table. This will be leveraged by metadata 
+index being designed with this RFC.
+
+
+
+Requirements:
+Given a list of FileIDs, return their bloom filters
+```
+Key format: [PartitionId][FileId]

Review comment:
   Since fileId is UUID based, can we assume that fileIDs are unique within 
HUDI? If so, the partitionId is not required here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] prashantwason commented on a change in pull request #3932: [HUDI-2704] Adding RFC-37 for Metadata based bloom index

2021-11-13 Thread GitBox


prashantwason commented on a change in pull request #3932:
URL: https://github.com/apache/hudi/pull/3932#discussion_r748699577



##
File path: rfc/rfc-37/rfc-37.md
##
@@ -0,0 +1,168 @@
+
+# RFC-37: Metadata based Bloom Index
+
+
+## Proposers
+
+- @nsivabalan
+- @manojpec
+
+## Approvers
+ - @
+ - @
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-2703
+
+## Abstract
+Hudi maintains indices to locate/map incoming records to file groups during 
writes. Most commonly 
+used record index is the HoodieBloomIndex. For larger installations and for 
global index types, performance might be an issue
+due to loading of bloom from large number of data files and due to throttling 
issues with some of the cloud stores. We are proposing to 
+build a new Metadata index (metadata table based bloom index) to boost the 
performance of existing bloom index. 
+
+## Background
+HoodieBloomIndex is used to find the location of incoming records during every 
write. This will assist Hudi in deterministically 
+routing records to a given file group and to distinguish inserts vs updates. 
This bloom index relies on (min, max) values 
+of records keys and bloom indexes in base file footers to find the actual 
record location. In this RFC, we plan to 
+build a new index on top of metadata table which to assist in bloom index 
based tagging. 
+
+## Design
+HoodieBloomIndex involves the following steps to find the right location of 
incoming records
+1. Load all interested partitions and fetch data files. 
+2. Find and filter files to keys mapping based on min max in data file footers.
+3. Filter files to keys mapping based on bloom index in data file footers. 
+4. Look up actual data files to find the right location of every incoming 
record.
+
+As we could see from step 1 and 2, we are in need of min and max values for 
"_hoodie_record_key" and bloom filter for 
+all data files to perform the tagging. In this design, we will add these to 
metadata table and the index lookup 
+will look into these metadata table partitions to deduce the file to keys 
mapping. 
+
+To realize this, we are adding two new partitions namely, `column_stats` and 
`bloom_filter` to metadata table.  
+
+Why metadata table: 
+Metadata table uses HFile to store and retrieve data. HFile is an indexed file 
format and supports random lookups based on 
+keys. Since, we will be storing stats/bloom for every file and the index will 
do lookups based on files, we should be able to 
+benefit from the faster lookups in HFile. 
+
+
+
+Following sections will talk about different partitions, key formats and then 
dive into the data and control flows.
+
+### Column_Stats partition:
+"Column_stats" will be discussed in depth in RFC-27, but in the interest of 
this RFC, Column_stats partition stores 
+statistics(min and max value) for `__hoodie_record_key` column for all files 
in the Hudi data table. 
+
+High level requirement for this column_stats partition are:
+Given a list of record keys, partition paths and file names, find the possibly 
matching file names based on
+`__hoodie_record_key` column stats. 
+
+To cater to this requirement, we need to ensure our keys in Hfile are such 
that we can do pointed lookups for a given data file.
+Below picture gives a pictorial representation of Column stats partition in 
metadata table. 
+
+
+
+We have to encode column names, filenames etc to IDs to save storage and to 
exploit compression. We will update the RFC 

Review comment:
   encoding column names, partition names may not be required as HFile 
compresses blocks of key-value data. So repeated string of column names, etc 
will compress well.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org