danny0405 commented on code in PR #13095:
URL: https://github.com/apache/hudi/pull/13095#discussion_r2082747617


##########
rfc/rfc-92/rfc-92.md:
##########
@@ -0,0 +1,223 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-92: Support Bitmap Index
+
+## Proposers
+
+- @CTTY
+
+## Approvers
+
+- @yihua
+
+## Status
+
+JIRA: [HUDI-9048](https://issues.apache.org/jira/browse/HUDI-9048)
+
+## Abstract
+
+Apache Hudi is actively expanding its support for different indexing 
strategies to optimize query performance and data retrieval efficiency. 
+However, bitmap indexes—widely recognized for their effectiveness in filtering 
low-cardinality columns—are not yet supported. 
+This project proposes the integration of bitmap indexing into Hudi’s indexing 
framework. 
+Bitmap indexes offer compact storage and fast bitwise operations, making them 
particularly well-suited for analytical workloads where predicates involve 
columns with a limited number of distinct values. 
+By introducing bitmap index support, we aim to enhance Hudi’s performance for 
a broader set of query patterns, especially in use cases with filtering on 
categorical or boolean fields.
+
+## Background
+
+A bitmap index is a specialized indexing technique that enhances query 
performance, particularly for columns with low cardinality (few distinct 
values). 
+Instead of storing row pointers like traditional indexes, it represents data 
as bitmaps, where each distinct value in a column has a corresponding bit 
vector indicating the presence of that value in different rows.
+The main advantage of a bitmap index is its ability to perform fast bitwise 
operations, which allow for quick filtering and combination of multiple 
conditions.
+<p align="center">
+<img src="./bitmap_example.png" width="610" height="550" />
+</p>
+
+In Hudi, bitmap indexes can provide significant performance benefits by 
helping to skip unnecessary files during query execution. 
+Since Hudi organizes data into files and partitions, a bitmap index can track 
which files contain relevant values, allowing the query engine to efficiently 
prune irrelevant files before scanning. 
+Additionally, bitmap indexes enable bitmap joins, where bitwise operations 
quickly determine matching records across datasets without performing costly 
row-by-row comparisons.
+
+## Design
+
+### Bitmap Metadata Structure
+Bitmap indexes for all columns will be stored in the `bitmap_index` partition 
of the table's Hudi metadata table. 
(`<table_path>/.hoodie/metadata/bitmap_index`)
+
+To support bitmap indexing, we plan to introduce a new type of metadata record 
along with a new metadata field, `BitmapIndexMetadata`.
+This field will store the serialized and encoded bitmap string directly.
+Each record's key will follow the format: 
`<column_name>$<column_value>$<file_group_id>`,

Review Comment:
   This format takes reference from secondary index right? Or can you elaborate 
a little more why choose this key format.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to