Re: [PR] [HUDI-4552][RFC-58] Integrate column stats index with all query engines [hudi]

2024-01-02 Thread via GitHub


pratyakshsharma commented on code in PR #6345:
URL: https://github.com/apache/hudi/pull/6345#discussion_r1439625464


##
rfc/rfc-58/rfc-58.md:
##
@@ -0,0 +1,69 @@
+
+# RFC-58: Integrate column stats index with all query engines
+
+
+
+## Proposers
+
+- @pratyakshsharma
+
+## Approvers

Review Comment:
   @prasannarajaperumal I have elaborated the ColumnHandle/ColumnDomain 
approach in a bit more detail. Sorry for the long hold on this one. Please have 
a look.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-4552][RFC-58] Integrate column stats index with all query engines [hudi]

2024-01-02 Thread via GitHub


pratyakshsharma commented on code in PR #6345:
URL: https://github.com/apache/hudi/pull/6345#discussion_r1439623738


##
rfc/rfc-58/rfc-58.md:
##
@@ -0,0 +1,69 @@
+
+# RFC-58: Integrate column stats index with all query engines
+
+
+
+## Proposers
+
+- @pratyakshsharma
+
+## Approvers
+- @bhavanisudha
+- @danny0405
+- @codope
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-4552
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+Query engines like hive or presto typically scan a large amount of data for 
query planning and execution. Proper indexing can help reduce this scan to a 
great extent. Parquet files are the most commonly used file format for storing 
columnar data with various lakehouse techniques mainly because of their strong 
support with spark and
+the kind of indexing that they employ at different levels. Parquet files 
maintain indexes at file level, row group level and page level. Till some time 
back, Hudi used to make use of these indexes for fast querying via the parquet 
reader libraries. The problem with this approach was every file object had to 
be opened once to read the index stored in parquet footer to be able to do file 
pruning. This could potentially become a bottleneck in case of a large number of
+files. With the introduction of [multi-modal 
index](https://www.onehouse.ai/blog/introducing-multi-modal-index-for-the-lakehouse-in-apache-hudi)
 in Hudi, this problem has been solved to a great extent. Currently the data 
skipping support using this multi-modal index is available for spark and 
[flink](https://issues.apache.org/jira/browse/HUDI-4353) engines. We intend to 
extend this support for other query engines like presto, trino and hive in this 
RFC. 
+
+## Background
+[RFC-27](https://github.com/apache/hudi/blob/master/rfc/rfc-27/rfc-27.md) 
added a new partition corresponding to column_stats index in metadata table of 
Hudi. We plan to use the information stored in this partition for pruning the 
files. 
+
+## Implementation
+Describe the new thing you want to do in appropriate detail, how it fits into 
the project architecture.
+Provide a detailed description of how you intend to implement this 
feature.This may be fairly extensive and have large subsections of its own.
+Or it may be a few sentences. Use judgement based on the scope of the change.
+
+We propose two different approaches for integrating column stats index with 
different query engines and discuss the pros and cons for the same below.
+1. **Using domains** - Presto and Trino have the concept of column domains. 
Domain is actually the set of possible values that need to be returned for a 
particular column. Domains get created at the time of creating splits for 
processing. Domains basically contain a map of column to possible values where 
the possible values are populated after doing the necessary pre work of 
combining all the different filter predicates supplied as part of the query. 
[This draft PR](https://github.com/apache/hudi/pull/6087) shows the use of 
these domains for integrating data skipping index with presto engine. 
+This basically involves exposing a new api in HoodieTableMetadata.java as 
below - 
+
+```java
+FileStatus[] getFilesToQueryUsingCSI(List columns, 
ColumnDomain columnDomain) throws IOException;

Review Comment:
   @alexeykudinkin I have added the details.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org