Re: [DISCUSS] hudi index improve

Shiyan Xu Mon, 18 Apr 2022 02:05:49 -0700

+1 great initiative.

Please also support Trino. Todd Gao is working on Trino/Presto native
connectors. We should align the plan going from there. Looking forward to
the RFC.


On Mon, Apr 18, 2022 at 11:41 AM 孟涛 <mengtao0...@qq.com.invalid> wrote:

> ＋1 , it will be a great feature for hudi
> index is very import to boost query, and we are also trying to add index
> support for trino on hudi; maybe we can work together. Looking forward to
> the design documents
> some minor questions:
> 1. do we need to consider concurrent operation
> 2. do we want to use metaTable to store index information?
>
>
>
>
>
>
> ------------------&nbsp;原始邮件&nbsp;------------------
> 发件人:
>                                                   "dev"
>                                                                 <
> forwardxu...@gmail.com&gt;;
> 发送时间:&nbsp;2022年4月18日(星期一) 中午11:18
> 收件人:&nbsp;"dev"<dev@hudi.apache.org&gt;;
>
> 主题:&nbsp;[DISCUSS] hudi index improve
>
>
>
> Hi All,
>
> I want to improve hudi‘s index. There are four main steps to achieve this
>
> 1. Implement index syntax
> &nbsp;&nbsp;&nbsp; a. Implement index syntax for spark sql [1] , I have
> submitted the
> first pr.
> &nbsp;&nbsp;&nbsp; b. Implement index syntax for prestodb sql
> &nbsp;&nbsp;&nbsp; c. Implement index syntax for trino sql
>
> 2. read/write index decoupling
> The read/write index is decoupled from the computing engine side, and the
> sql index syntax of the first step can be independently executed and called
> through the API.
>
> 3. build index service
>
> Promote the implementation of the hudi service framework, including index
> service, metastore service[2], compact/cluster service[3], etc.
>
> 4. Index Management
> There are two kinds of management semantic for Index.
>
> &nbsp;&nbsp; - Automatic Refresh
> &nbsp;&nbsp; - Manual Refresh
>
>
> &nbsp;&nbsp; 1. Automatic Refresh
>
> When a user creates an index on the main table without using WITH DEFERRED
> REFRESH syntax, the index will be managed by the system automatically. For
> every data load to the main table, the system will immediately trigger a
> load to the index automatically. These two data loading (to main table and
> index) is executed in a transactional manner, meaning that it will be
> either both success or neither success.
>
> The data loading to index is incremental, avoiding an expensive total
> refresh.
>
> If a user performs the following command on the main table, the system will
> return failure. (reject the operation)
>
>
> &nbsp;&nbsp; - Data management command: UPDATE/DELETE/DELETE.
> &nbsp;&nbsp; - Schema management command: ALTER TABLE DROP COLUMN, ALTER
> TABLE CHANGE
> &nbsp;&nbsp; DATATYPE, ALTER TABLE RENAME. Note that adding a new column
> is supported,
> &nbsp;&nbsp; and for dropping columns and change datatype command, hudi
> will check
> &nbsp;&nbsp; whether it will impact the index table, if not, the operation
> is allowed,
> &nbsp;&nbsp; otherwise operation will be rejected by throwing an exception.
> &nbsp;&nbsp; - Partition management command: ALTER TABLE ADD/DROP
> PARTITION.
>
> If a user does want to perform above operations on the main table, the user
> can first drop the index, perform the operation, and re-create the index
> again.
>
> If a user drops the main table, the index will be dropped immediately too.
>
> We do recommend you to use this management for indexing.
>
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.&nbsp; Manual Refresh
>
> When a user creates an index on the main table using WITH DEFERRED REFRESH
> syntax, the index will be created with status disabled and query will NOT
> use this index until the user issues REFRESH INDEX command to build the
> index. For every REFRESH INDEX command, the system will trigger a full
> refresh of the index. Once the refresh operation is finished, system will
> change index status to enabled, so that it can be used in query rewrite.
>
> For every new data loading, data update, delete, the related index will be
> made disabled, which means that the following queries will not benefit from
> the index before it becomes enabled again.
>
> If the main table is dropped by the user, the related index will be dropped
> immediately.
>
>
>
> Any feedback is welcome!
>
> Thank you.
>
> Regards,
> Forward Xu
>
> Related Links:
> [1] Implement index syntax for spark sql
> <https://issues.apache.org/jira/browse/HUDI-3881&gt;
> [2] Metastore service <https://github.com/apache/hudi/pull/5064&gt;
>
> [3] <https://github.com/apache/hudi/pull/4872&gt;compaction/clustering
> job in
> Service <https://github.com/apache/hudi/pull/4872&gt;

-- 
Best,
Shiyan

Re: [DISCUSS] hudi index improve

Reply via email to