danny0405 commented on code in PR #13095: URL: https://github.com/apache/hudi/pull/13095#discussion_r2094707825
########## rfc/rfc-92/rfc-92.md: ########## @@ -0,0 +1,223 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> +# RFC-92: Support Bitmap Index + +## Proposers + +- @CTTY + +## Approvers + +- @yihua + +## Status + +JIRA: [HUDI-9048](https://issues.apache.org/jira/browse/HUDI-9048) + +## Abstract + +Apache Hudi is actively expanding its support for different indexing strategies to optimize query performance and data retrieval efficiency. +However, bitmap indexes—widely recognized for their effectiveness in filtering low-cardinality columns—are not yet supported. +This project proposes the integration of bitmap indexing into Hudi’s indexing framework. +Bitmap indexes offer compact storage and fast bitwise operations, making them particularly well-suited for analytical workloads where predicates involve columns with a limited number of distinct values. +By introducing bitmap index support, we aim to enhance Hudi’s performance for a broader set of query patterns, especially in use cases with filtering on categorical or boolean fields. + +## Background + +A bitmap index is a specialized indexing technique that enhances query performance, particularly for columns with low cardinality (few distinct values). +Instead of storing row pointers like traditional indexes, it represents data as bitmaps, where each distinct value in a column has a corresponding bit vector indicating the presence of that value in different rows. +The main advantage of a bitmap index is its ability to perform fast bitwise operations, which allow for quick filtering and combination of multiple conditions. +<p align="center"> +<img src="./bitmap_example.png" width="610" height="550" /> +</p> + +In Hudi, bitmap indexes can provide significant performance benefits by helping to skip unnecessary files during query execution. +Since Hudi organizes data into files and partitions, a bitmap index can track which files contain relevant values, allowing the query engine to efficiently prune irrelevant files before scanning. +Additionally, bitmap indexes enable bitmap joins, where bitwise operations quickly determine matching records across datasets without performing costly row-by-row comparisons. + +## Design + +### Bitmap Metadata Structure +Bitmap indexes for all columns will be stored in the `bitmap_index` partition of the table's Hudi metadata table. (`<table_path>/.hoodie/metadata/bitmap_index`) + +To support bitmap indexing, we plan to introduce a new type of metadata record along with a new metadata field, `BitmapIndexMetadata`. +This field will store the serialized and encoded bitmap string directly. +Each record's key will follow the format: `<column_name>$<column_value>$<file_group_id>`, +where file_group_id is defined as `<partition_path>$<file_id>`. +``` +store_type$grocery$20250401$2aa10971-11b2-41b3-9baf-ab33c0000144-0 Review Comment: do we need to store the partition path? the file id is globally unique already. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
