danny0405 commented on code in PR #13095:
URL: https://github.com/apache/hudi/pull/13095#discussion_r2094712223


##########
rfc/rfc-92/rfc-92.md:
##########
@@ -0,0 +1,223 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-92: Support Bitmap Index
+
+## Proposers
+
+- @CTTY
+
+## Approvers
+
+- @yihua
+
+## Status
+
+JIRA: [HUDI-9048](https://issues.apache.org/jira/browse/HUDI-9048)
+
+## Abstract
+
+Apache Hudi is actively expanding its support for different indexing 
strategies to optimize query performance and data retrieval efficiency. 
+However, bitmap indexes—widely recognized for their effectiveness in filtering 
low-cardinality columns—are not yet supported. 
+This project proposes the integration of bitmap indexing into Hudi’s indexing 
framework. 
+Bitmap indexes offer compact storage and fast bitwise operations, making them 
particularly well-suited for analytical workloads where predicates involve 
columns with a limited number of distinct values. 
+By introducing bitmap index support, we aim to enhance Hudi’s performance for 
a broader set of query patterns, especially in use cases with filtering on 
categorical or boolean fields.
+
+## Background
+
+A bitmap index is a specialized indexing technique that enhances query 
performance, particularly for columns with low cardinality (few distinct 
values). 
+Instead of storing row pointers like traditional indexes, it represents data 
as bitmaps, where each distinct value in a column has a corresponding bit 
vector indicating the presence of that value in different rows.
+The main advantage of a bitmap index is its ability to perform fast bitwise 
operations, which allow for quick filtering and combination of multiple 
conditions.
+<p align="center">
+<img src="./bitmap_example.png" width="610" height="550" />
+</p>
+
+In Hudi, bitmap indexes can provide significant performance benefits by 
helping to skip unnecessary files during query execution. 
+Since Hudi organizes data into files and partitions, a bitmap index can track 
which files contain relevant values, allowing the query engine to efficiently 
prune irrelevant files before scanning. 
+Additionally, bitmap indexes enable bitmap joins, where bitwise operations 
quickly determine matching records across datasets without performing costly 
row-by-row comparisons.
+
+## Design
+
+### Bitmap Metadata Structure
+Bitmap indexes for all columns will be stored in the `bitmap_index` partition 
of the table's Hudi metadata table. 
(`<table_path>/.hoodie/metadata/bitmap_index`)
+
+To support bitmap indexing, we plan to introduce a new type of metadata record 
along with a new metadata field, `BitmapIndexMetadata`.
+This field will store the serialized and encoded bitmap string directly.
+Each record's key will follow the format: 
`<column_name>$<column_value>$<file_group_id>`,
+where file_group_id is defined as `<partition_path>$<file_id>`.
+```
+store_type$grocery$20250401$2aa10971-11b2-41b3-9baf-ab33c0000144-0
+```
+
+For non-partitioned tables, we use `.` as a placeholder for the partition 
path. In this case, the key would be:
+```
+store_type$grocery$.$2aa10971-11b2-41b3-9baf-ab33c0000144-0
+```
+
+The index key can be constructed on-the-fly by concatenating strings during 
both read and write operations.
+Since the reader must know at least the column name and column value, 
searching bitmaps by key prefix is also supported.
+
+Below is another example of how the index record looks like in the storage.
+![bitmap_payload](./bitmap_payload.png)
+
+### Maintaining the Index
+Just like other existing Hudi indexes, the logic for writing the bitmap index 
will be integrated with Hudi's metadata writer during commit operations.
+Bitmap index maintenance during writes typically falls into the following 
categories: **Initialize**, **Insert**, **Update**, and **Delete**.
+Before diving into each category, let’s define two bitmap meta operations that 
will be referenced frequently:
+
+**addBitmapPosition**: 
+1. For a given record, extract its indexed column name-value pairs and file 
group ID to construct the bitmap key.
+2. Retrieve the existing bitmap using the key, or initialize an empty one if 
it doesn't exist.
+3. Add the record’s position to the bitmap.
+4. Write the bitmap record `<bitmap key, updated bitmap>` to the 
`bitmap_index` partition in the metadata table
+
+**removeBitmapPosition**:
+1. For a given record, extract its indexed column name-value pairs and file 
group ID to construct the bitmap key.
+2. Retrieve the existing bitmap using the key; if it doesn’t exist, the 
operation becomes a no-op.
+3. Remove the record’s position from the bitmap.
+4. Write the bitmap record `<bitmap key, updated bitmap>` to the 
`bitmap_index` partition in the metadata table
+
+> Note: We don’t need an operation to explicitly delete outdated bitmap 
records.
+Since the bitmap key remains unchanged, the new record will overwrite the 
previous one.
+#### Initialize
+When initializing the bitmap index, the metadata writer scans all file groups 
to collect `existing_records`.
+The bitmap records will be written into the `bitmap_index` partition of the 
table's Hudi metadata table.
+Pseudocode:
+```
+for record in existing_records:
+  addBitmapPosition(record)
+```
+#### Insert
+The commit metadata for inserts includes files affected by new records.
+The metadata writer extracts `inserted_records` from those files and updates 
the bitmap index accordingly.
+Pseudocode:
+```
+for record in inserted_records:
+  addBitmapPosition(record)
+```
+#### Update
+Update commit metadata includes files impacted by updated records.

Review Comment:
   It lookes like the update strategy is very similiar with RLI, may consider 
some code/workload resusing from there.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to