yihua commented on code in PR #9775:
URL: https://github.com/apache/hudi/pull/9775#discussion_r1337519296


##########
rfc/rfc-8/rfc-8.md:
##########
@@ -0,0 +1,209 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-8: Metadata based Record Index
+
+## Proposers
+- @prashantwason
+
+## Approvers
+
+
+## Status
+JIRA: https://issues.apache.org/jira/browse/HUDI-53
+
+
+## Abstract
+HUDI requires an [Index](https://hudi.apache.org/docs/indexing) during updates 
to locate the existing records by their 
+unique record keys. The HUDI Index saves a mapping of the record-key to 
record's file path. Hudi supports several indexes 
+like:
+ 1. Bloom Index: Employs bloom filters built out of the record keys, 
optionally also pruning candidate files using record key ranges.
+ 2. Simple Index (default): Performs a lean join of the incoming update/delete 
records against keys extracted from the table on storage.
+ 3. HBase Index: Manages the index mapping in an external Apache HBase table.
+
+We are proposing a new Index called Record Index which will save the record 
key to file path location within the 
+[HUDI Metadata Table](https://hudi.apache.org/docs/metadata). Since the HUDI 
Metadata Table is internal to a HUDI Dataset, 
+the Record Index is updated and queried using the resources already available 
to the HUDI dataset.
+
+
+## Justification
+
+Bloom and Simple Index are slow for large datasets as they have high costs 
involved in gathering the index data from various
+data files at lookup time. Furthermore, these indexes do not save a one-to-one 
record-key to record file path mapping but
+deduce the mapping via an optimized search at lookup time. A per file overhead 
required in these indexes means that datasets 
+with larger number of files or number of records will not work well with these 
indexes. 
+
+The Hbase Index saves one to one mapping for each record key so is very fast 
and scaled with the dataset size. But Hbase 
+Index requires a separate HBase cluster to be maintained. HBase is 
operationally difficult to maintain and scale for throughput, 
+requires dedicated resources and expertise to maintain.
+
+The Record Index will provide the speed and scalability of HBase Index without 
all the limitation and overhead. Since 
+the HUDI Metadata Table is a HUDI Table, all future performance improvements 
in writes and queries will automatically 
+provide those improvements to Record Index performance. 
+
+## Design
+Record Index will save the record-key to file path mapping in a new partition 
within the HUDI Metadata Table. Metadata table
+uses HBase HFile - the tree map file format to store and retrieve data. HFile 
is an indexed file format
+and supports map like faster lookups by keys. Since, we will be storing 
mapping for every single record key, Record Index
+lookups for large number of keys transform into direct lookups of keys from 
HUDI Metadata Table and should be able to 
+benefit greatly from the faster lookups in HFile.
+
+<img src="metadata_index_1.png" alt="High Level Metadata Index Design" 
width="480"/>
+
+
+### Metadata Table partitioning and schema:
+
+A new partition `record_index` will be added under the metadata table. The 
existing metadata table payload schema will
+be extended and shared for this partition also. The type field will be used to 
detect the record_index payload record.
+Here is the schema for the record_index payload record.
+```
+    {
+        "name": "recordIndexMetadata",
+        "doc": "Metadata Index that contains information about record keys and 
their location in the dataset",
+        "type": [
+            "null",
+             {
+               "type": "record",
+               "name": "HoodieRecordIndexInfo",
+                "fields": [
+                    {
+                        "name": "partition",
+                        "type": "string",
+                        "doc": "Partition which contains the record",
+                        "avro.java.string": "String"
+                    },
+                    {
+                        "name": "fileIdHighBits",
+                        "type": "long",
+                        "doc": "fileId which contains the record (high 64 
bits)"
+                    },
+                    {
+                        "name": "fileIdLowBits",
+                        "type": "long",
+                        "doc": "fileId which contains the record (low 64 bits)"
+                    },
+                    {
+                        "name": "fileIndex",
+                        "type": "int",
+                        "doc": "index of the file"
+                    },
+                    {
+                        "name": "instantTime",
+                        "type": "long",
+                        "doc": "Epoch time in millisecond at which record was 
added"
+                    }
+                ]
+            }
+        ],
+        "default" : null
+    }
+```
+
+The key for the record index record would be the actual key from the record. 
The partition name is also saved as string.
+HUDI base files names have a format which includes a UUID fileID, an integer 
file Index, a write token and a timestamp. 
+The record index payload only saves the fileID and file index information. The 
fileID is split into UUID and the integer file index. The UUID is encoded into 
two longs and the file index is saved
+as an integer. The timestamp is encoded into epoch time in milliseconds.
+
+This schema format is chosen to minimize the data size of each mapping to 
ensure the smallest possible size of the 
+record index even for datasets with billions of records. 
+
+Experiments have shown that with random UUID record keys and datestr 
partitions (YYYY/MM/DD), we can achieve an average
+size of 50 to 55 bytes per mapping saved in the record index. The size might 
even be lower for keys which may compress better.
+
+<img src="metadata_index_bloom_partition.png" alt="Bloom filter partition" 
width="500"/>
+
+Below picture gives a pictorial representation of record index partition in 
metadata table.
+<img src="metadata_index_col_stats.png" alt="Column Stats Partition" 
width="480"/>

Review Comment:
   The image here shows the column stats partition, not the record index.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to