[GitHub] [hudi] prasannarajaperumal commented on a diff in pull request #5370: [RFC-52][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

GitBox Thu, 15 Sep 2022 23:42:55 -0700


prasannarajaperumal commented on code in PR #5370:
URL: https://github.com/apache/hudi/pull/5370#discussion_r972656384



##########
rfc/rfc-52/rfc-52.md:
##########
@@ -0,0 +1,284 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-52: Introduce Secondary Index to Improve HUDI Query Performance
+
+## Proposers
+
+- @huberylee
+- @hujincalrin
+- @XuQianJin-Stars
+- @YuweiXiao
+- @stream2000
+
+## Approvers
+ - @vinothchandar
+ - @xushiyan
+ - @leesf
+
+## Status
+
+JIRA: [HUDI-3907](https://issues.apache.org/jira/browse/HUDI-3907)
+
+Documentation Navigation
+- [Abstract](#abstract)
+- [Background](#background)
+- [Insufficiency](#insufficiency)
+- [Architecture](#architecture)
+- [Differences between Secondary Index and HUDI Record Level 
Index](#difference)
+- [Implementation](#implementation)
+  - [SQL Layer](#impl-sql-layer)
+  - [Optimizer Layer](#impl-optimizer-layer)
+  - [Standard API Layer](#impl-api-layer)
+  - [Index Implementation Layer](#imple-index-layer)
+    - [KV Mapping](#impl-index-layer-kv-mapping)
+    - [Build Index](#impl-index-layer-build-index)
+    - [Read Index](#impl-index-layer-read-index)
+    - [Index Management](#index-management)
+- [Lucene Secondary Index Implementation](#lucene-secondary-index-impl)
+  - [Inverted Index](#lucene-inverted-index)
+  - [Index Generation](#lucene-index-generation)
+  - [Query by Lucene Index](#query-by-lucene-index)
+
+
+## <a id='abstract'>Abstract</a>
+In query processing, we need to scan many data blocks in HUDI table. However, 
most of them may not
+match the query predicate even after using column statistic info in the 
metadata table, row group level or
+page level statistics in parquet files, etc.
+
+The total data size of touched blocks determines the query speed, and how to 
save IO has become
+the key point to improving query performance.
+
+## <a id='background'>Background</a>
+Many works have been carried out to optimize reading HUDI table parquet file.
+
+Since Spark 3.2.0, with the power of parquet column index, page level 
statistics info can be used
+to filter data, and the process of reading data can be described as follows(<a 
id='process-a'>Process A</a>):
+- Step1: Comparing the inclusion relation of row group data's middle position 
and task split info
+   to decided which row groups should be handled by current task. If the row 
group data's middle
+   position is contained by task split, the row group should be handled by 
this task
+- Step2: Using pushed down predicates and row group level column statistics 
info to pick out matched
+   row groups
+- Step 3: Filtering page by page level statistics for each column predicates, 
then get matched row id set
+for every column independently
+- Step 4: Getting final matched row id ranges by combining all column matched 
rows, then get final matched
+pages for every column
+- Step 5: Loading and uncompressing matched pages for every requested columns
+- Step 6: Reading data by matched row id ranges
+
+![](filter-by-page-statistics.jpg)
+
+
+## <a id='insufficiency'>Insufficiency</a>
+Although page level statistics can greatly save IO cost, there is still some 
irrelevant data be read out.
+
+We may need a way to get exactly row data we need to minimize the amount of 
reading blocks.
+Thus, we propose a **Secondary Index** structure to only read the rows we care 
about to
+speed up query performance.
+
+## <a id='architecture'>Architecture</a>
+The main structure of secondary index contains 4 layers
+1. SQL Parser layer: SQL command for user to create/drop/alter/show/..., for 
managing secondary index
+2. Optimizer layer: Pick up the best physical/logical plan for a query using 
RBO/CBO/HBO etc
+3. Standard API interface layer: provides standard interfaces for upper-layer 
to invoke, such as ``createIndex``, 
+``getRowIdSet`` and so on
+4. IndexManager Factory layer: many kinds of secondary Index implementations 
for users to choice, 
+   such as HBase based, Lucene based, B+ tree based, etc
+5. Index Implementation layer:  provides the ability to read, write and manage 
the underlying index
+
+![](architecture.jpg)
+
+
+## <a id='difference'>Differences between Secondary Index and HUDI Record 
Level Index</a>
+Before discussing secondary index, let's take a look at Record Level Index. 
Both indexes
+can filter useless data blocks, there are still many differences between them.
+
+At present, record level index in hudi 
+([RFC-08](https://cwiki.apache.org/confluence/display/HUDI/RFC-08++Record+level+indexing+mechanisms+for+Hudi+datasets),
 ongoing)
+is mainly implemented for ``tagLocation`` in write path.
+Secondary index structure will be used for query acceleration in read path, 
but not in write path.
+
+If Record Level Index is applied in read path for query with RecordKey 
predicate, it can only filter at file group level,
+while secondary index could provide the exact matched set of rows.
+
+For more details about current implementation of record level index, please 
refer to

Review Comment:
   I am not sure why index is dependent on read or write path
   Record level index is a primary index and what you propose here is secondary 
index. 
   We should use the primary index in the read path when the filter is a 
pointed query for a uuid. 
   `select c1,c2 from t where _row_key='uuid1'`
   We should use the secondary index in the write path when the filter is on 
the index.
   `update c1=5 from t where c2 = 20`



##########
rfc/rfc-52/rfc-52.md:
##########
@@ -0,0 +1,284 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-52: Introduce Secondary Index to Improve HUDI Query Performance
+
+## Proposers
+
+- @huberylee
+- @hujincalrin
+- @XuQianJin-Stars
+- @YuweiXiao
+- @stream2000
+
+## Approvers
+ - @vinothchandar
+ - @xushiyan
+ - @leesf
+
+## Status
+
+JIRA: [HUDI-3907](https://issues.apache.org/jira/browse/HUDI-3907)
+
+Documentation Navigation
+- [Abstract](#abstract)
+- [Background](#background)
+- [Insufficiency](#insufficiency)
+- [Architecture](#architecture)
+- [Differences between Secondary Index and HUDI Record Level 
Index](#difference)
+- [Implementation](#implementation)
+  - [SQL Layer](#impl-sql-layer)
+  - [Optimizer Layer](#impl-optimizer-layer)
+  - [Standard API Layer](#impl-api-layer)
+  - [Index Implementation Layer](#imple-index-layer)
+    - [KV Mapping](#impl-index-layer-kv-mapping)
+    - [Build Index](#impl-index-layer-build-index)
+    - [Read Index](#impl-index-layer-read-index)
+    - [Index Management](#index-management)
+- [Lucene Secondary Index Implementation](#lucene-secondary-index-impl)
+  - [Inverted Index](#lucene-inverted-index)
+  - [Index Generation](#lucene-index-generation)
+  - [Query by Lucene Index](#query-by-lucene-index)
+
+
+## <a id='abstract'>Abstract</a>
+In query processing, we need to scan many data blocks in HUDI table. However, 
most of them may not
+match the query predicate even after using column statistic info in the 
metadata table, row group level or
+page level statistics in parquet files, etc.
+
+The total data size of touched blocks determines the query speed, and how to 
save IO has become
+the key point to improving query performance.
+
+## <a id='background'>Background</a>
+Many works have been carried out to optimize reading HUDI table parquet file.
+
+Since Spark 3.2.0, with the power of parquet column index, page level 
statistics info can be used
+to filter data, and the process of reading data can be described as follows(<a 
id='process-a'>Process A</a>):
+- Step1: Comparing the inclusion relation of row group data's middle position 
and task split info
+   to decided which row groups should be handled by current task. If the row 
group data's middle
+   position is contained by task split, the row group should be handled by 
this task
+- Step2: Using pushed down predicates and row group level column statistics 
info to pick out matched
+   row groups
+- Step 3: Filtering page by page level statistics for each column predicates, 
then get matched row id set
+for every column independently
+- Step 4: Getting final matched row id ranges by combining all column matched 
rows, then get final matched
+pages for every column
+- Step 5: Loading and uncompressing matched pages for every requested columns
+- Step 6: Reading data by matched row id ranges
+
+![](filter-by-page-statistics.jpg)
+
+
+## <a id='insufficiency'>Insufficiency</a>
+Although page level statistics can greatly save IO cost, there is still some 
irrelevant data be read out.
+
+We may need a way to get exactly row data we need to minimize the amount of 
reading blocks.
+Thus, we propose a **Secondary Index** structure to only read the rows we care 
about to
+speed up query performance.
+
+## <a id='architecture'>Architecture</a>
+The main structure of secondary index contains 4 layers
+1. SQL Parser layer: SQL command for user to create/drop/alter/show/..., for 
managing secondary index
+2. Optimizer layer: Pick up the best physical/logical plan for a query using 
RBO/CBO/HBO etc
+3. Standard API interface layer: provides standard interfaces for upper-layer 
to invoke, such as ``createIndex``, 
+``getRowIdSet`` and so on
+4. IndexManager Factory layer: many kinds of secondary Index implementations 
for users to choice, 
+   such as HBase based, Lucene based, B+ tree based, etc
+5. Index Implementation layer:  provides the ability to read, write and manage 
the underlying index
+
+![](architecture.jpg)
+
+
+## <a id='difference'>Differences between Secondary Index and HUDI Record 
Level Index</a>
+Before discussing secondary index, let's take a look at Record Level Index. 
Both indexes
+can filter useless data blocks, there are still many differences between them.
+
+At present, record level index in hudi 
+([RFC-08](https://cwiki.apache.org/confluence/display/HUDI/RFC-08++Record+level+indexing+mechanisms+for+Hudi+datasets),
 ongoing)
+is mainly implemented for ``tagLocation`` in write path.
+Secondary index structure will be used for query acceleration in read path, 
but not in write path.
+
+If Record Level Index is applied in read path for query with RecordKey 
predicate, it can only filter at file group level,
+while secondary index could provide the exact matched set of rows.
+
+For more details about current implementation of record level index, please 
refer to
+[pull-3508](https://github.com/apache/hudi/pull/3508).
+
+## <a id='implementation'>Implementation</a>
+
+### <a id='impl-sql-layer'>SQL Layer</a>
+Parsing all kinds of index related SQL(Spark/Flink, etc.), including 
create/drop/alter index, optimize table, etc.
+
+### <a id='impl-optimizer-layer'>Optimizer Layer</a>
+For the convenience of implementation, we can implement the first phase based 
on RBO(rule-based optimizer),  

Review Comment:
   We can introduce a separate section in the RFC to talk about the 
implementation is going to be in phases. But here I would want us to think 
generically how this would plugin into the Optimization layer of Spark/Flink. 
   
   My thoughts around this. 
   RBO for index can be very misleading for query performance - especially when 
combined with column level stats IO skipping. We can do a simple hint based 
approach to always use specific index when available else implement it using 
CBO. 
   
   The way I am thinking for 
[cascades](https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.98.9460) 
based optimizer is to generate a memo of equivalent query fragments (with 
direct scans, using all the indexes possible) with cost and run it by the cost 
based optimizer to pick the right plan. 
   
   What do you think?



##########
rfc/rfc-52/rfc-52.md:
##########
@@ -0,0 +1,284 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-52: Introduce Secondary Index to Improve HUDI Query Performance
+
+## Proposers
+
+- @huberylee
+- @hujincalrin
+- @XuQianJin-Stars
+- @YuweiXiao
+- @stream2000
+
+## Approvers
+ - @vinothchandar
+ - @xushiyan
+ - @leesf
+
+## Status
+
+JIRA: [HUDI-3907](https://issues.apache.org/jira/browse/HUDI-3907)
+
+Documentation Navigation
+- [Abstract](#abstract)
+- [Background](#background)
+- [Insufficiency](#insufficiency)
+- [Architecture](#architecture)
+- [Differences between Secondary Index and HUDI Record Level 
Index](#difference)
+- [Implementation](#implementation)
+  - [SQL Layer](#impl-sql-layer)
+  - [Optimizer Layer](#impl-optimizer-layer)
+  - [Standard API Layer](#impl-api-layer)
+  - [Index Implementation Layer](#imple-index-layer)
+    - [KV Mapping](#impl-index-layer-kv-mapping)
+    - [Build Index](#impl-index-layer-build-index)
+    - [Read Index](#impl-index-layer-read-index)
+    - [Index Management](#index-management)
+- [Lucene Secondary Index Implementation](#lucene-secondary-index-impl)
+  - [Inverted Index](#lucene-inverted-index)
+  - [Index Generation](#lucene-index-generation)
+  - [Query by Lucene Index](#query-by-lucene-index)
+
+
+## <a id='abstract'>Abstract</a>
+In query processing, we need to scan many data blocks in HUDI table. However, 
most of them may not
+match the query predicate even after using column statistic info in the 
metadata table, row group level or
+page level statistics in parquet files, etc.
+
+The total data size of touched blocks determines the query speed, and how to 
save IO has become
+the key point to improving query performance.
+
+## <a id='background'>Background</a>
+Many works have been carried out to optimize reading HUDI table parquet file.
+
+Since Spark 3.2.0, with the power of parquet column index, page level 
statistics info can be used
+to filter data, and the process of reading data can be described as follows(<a 
id='process-a'>Process A</a>):
+- Step1: Comparing the inclusion relation of row group data's middle position 
and task split info
+   to decided which row groups should be handled by current task. If the row 
group data's middle
+   position is contained by task split, the row group should be handled by 
this task
+- Step2: Using pushed down predicates and row group level column statistics 
info to pick out matched
+   row groups
+- Step 3: Filtering page by page level statistics for each column predicates, 
then get matched row id set
+for every column independently
+- Step 4: Getting final matched row id ranges by combining all column matched 
rows, then get final matched
+pages for every column
+- Step 5: Loading and uncompressing matched pages for every requested columns
+- Step 6: Reading data by matched row id ranges
+
+![](filter-by-page-statistics.jpg)
+
+
+## <a id='insufficiency'>Insufficiency</a>
+Although page level statistics can greatly save IO cost, there is still some 
irrelevant data be read out.
+
+We may need a way to get exactly row data we need to minimize the amount of 
reading blocks.
+Thus, we propose a **Secondary Index** structure to only read the rows we care 
about to
+speed up query performance.
+
+## <a id='architecture'>Architecture</a>
+The main structure of secondary index contains 4 layers
+1. SQL Parser layer: SQL command for user to create/drop/alter/show/..., for 
managing secondary index

Review Comment:
   Best to call this out in the scope of the RFC



##########
rfc/rfc-52/rfc-52.md:
##########
@@ -0,0 +1,284 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-52: Introduce Secondary Index to Improve HUDI Query Performance
+
+## Proposers
+
+- @huberylee
+- @hujincalrin
+- @XuQianJin-Stars
+- @YuweiXiao
+- @stream2000
+
+## Approvers
+ - @vinothchandar
+ - @xushiyan
+ - @leesf
+
+## Status
+
+JIRA: [HUDI-3907](https://issues.apache.org/jira/browse/HUDI-3907)
+
+Documentation Navigation
+- [Abstract](#abstract)
+- [Background](#background)
+- [Insufficiency](#insufficiency)
+- [Architecture](#architecture)
+- [Differences between Secondary Index and HUDI Record Level 
Index](#difference)
+- [Implementation](#implementation)
+  - [SQL Layer](#impl-sql-layer)
+  - [Optimizer Layer](#impl-optimizer-layer)
+  - [Standard API Layer](#impl-api-layer)
+  - [Index Implementation Layer](#imple-index-layer)
+    - [KV Mapping](#impl-index-layer-kv-mapping)
+    - [Build Index](#impl-index-layer-build-index)
+    - [Read Index](#impl-index-layer-read-index)
+    - [Index Management](#index-management)
+- [Lucene Secondary Index Implementation](#lucene-secondary-index-impl)
+  - [Inverted Index](#lucene-inverted-index)
+  - [Index Generation](#lucene-index-generation)
+  - [Query by Lucene Index](#query-by-lucene-index)
+
+
+## <a id='abstract'>Abstract</a>
+In query processing, we need to scan many data blocks in HUDI table. However, 
most of them may not
+match the query predicate even after using column statistic info in the 
metadata table, row group level or
+page level statistics in parquet files, etc.
+
+The total data size of touched blocks determines the query speed, and how to 
save IO has become
+the key point to improving query performance.
+
+## <a id='background'>Background</a>
+Many works have been carried out to optimize reading HUDI table parquet file.
+
+Since Spark 3.2.0, with the power of parquet column index, page level 
statistics info can be used
+to filter data, and the process of reading data can be described as follows(<a 
id='process-a'>Process A</a>):
+- Step1: Comparing the inclusion relation of row group data's middle position 
and task split info
+   to decided which row groups should be handled by current task. If the row 
group data's middle
+   position is contained by task split, the row group should be handled by 
this task
+- Step2: Using pushed down predicates and row group level column statistics 
info to pick out matched
+   row groups
+- Step 3: Filtering page by page level statistics for each column predicates, 
then get matched row id set
+for every column independently
+- Step 4: Getting final matched row id ranges by combining all column matched 
rows, then get final matched
+pages for every column
+- Step 5: Loading and uncompressing matched pages for every requested columns
+- Step 6: Reading data by matched row id ranges
+
+![](filter-by-page-statistics.jpg)
+
+
+## <a id='insufficiency'>Insufficiency</a>
+Although page level statistics can greatly save IO cost, there is still some 
irrelevant data be read out.
+
+We may need a way to get exactly row data we need to minimize the amount of 
reading blocks.
+Thus, we propose a **Secondary Index** structure to only read the rows we care 
about to
+speed up query performance.
+
+## <a id='architecture'>Architecture</a>
+The main structure of secondary index contains 4 layers
+1. SQL Parser layer: SQL command for user to create/drop/alter/show/..., for 
managing secondary index
+2. Optimizer layer: Pick up the best physical/logical plan for a query using 
RBO/CBO/HBO etc
+3. Standard API interface layer: provides standard interfaces for upper-layer 
to invoke, such as ``createIndex``, 
+``getRowIdSet`` and so on
+4. IndexManager Factory layer: many kinds of secondary Index implementations 
for users to choice, 
+   such as HBase based, Lucene based, B+ tree based, etc
+5. Index Implementation layer:  provides the ability to read, write and manage 
the underlying index
+
+![](architecture.jpg)
+
+
+## <a id='difference'>Differences between Secondary Index and HUDI Record 
Level Index</a>
+Before discussing secondary index, let's take a look at Record Level Index. 
Both indexes
+can filter useless data blocks, there are still many differences between them.
+
+At present, record level index in hudi 
+([RFC-08](https://cwiki.apache.org/confluence/display/HUDI/RFC-08++Record+level+indexing+mechanisms+for+Hudi+datasets),
 ongoing)
+is mainly implemented for ``tagLocation`` in write path.
+Secondary index structure will be used for query acceleration in read path, 
but not in write path.
+
+If Record Level Index is applied in read path for query with RecordKey 
predicate, it can only filter at file group level,
+while secondary index could provide the exact matched set of rows.
+
+For more details about current implementation of record level index, please 
refer to
+[pull-3508](https://github.com/apache/hudi/pull/3508).
+
+## <a id='implementation'>Implementation</a>
+
+### <a id='impl-sql-layer'>SQL Layer</a>
+Parsing all kinds of index related SQL(Spark/Flink, etc.), including 
create/drop/alter index, optimize table, etc.
+
+### <a id='impl-optimizer-layer'>Optimizer Layer</a>
+For the convenience of implementation, we can implement the first phase based 
on RBO(rule-based optimizer),  
+and then gradually expand and improve CBO and HBO based on the collected 
statistical information.
+
+We can define RBO in several ways, for example, SQL with more than 10 
predicates does not push down 
+to use secondary index, but uses the existing scanning logic. It may be a cost 
way to use too many
+predicates indexes to get row id set.
+
+### <a id='impl-api-layer'>Standard API Layer</a>
+The standard APIs are as follows, and subsequent index types(e.g., 
Hbase/Lucene/B+ tree ...) need to implement these APIs.
+
+```
+// Get row id set for the specified table with predicates
+Set<RowId> getRowIdSet(HoodieTable table, Map<column, List<PredicateList>> 
columnToPredicates ..)

Review Comment:
   All usages of index are scan pruning. We should define a standard API for 
scan pruning that is generic enough for all forms of pruning (min/max, index) 
can work. I am thinking more like 
   `HoodieTableScan pruneScan(HoodieTable table, HoodieTableScan scan, 
List<Predicate> columnPredicates`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] prasannarajaperumal commented on a diff in pull request #5370: [RFC-52][HUDI-3907] RFC for Introduce Secondary Index to Improve Hudi Query Performance

Reply via email to