Re:[DISCUSS] hudi index improve

wangxianghu Mon, 18 Apr 2022 07:39:36 -0700

+1 on index improvement
index optimization is a very valuable thing for hudi
Looking forward to the design doc









At 2022-04-18 11:18:35, "Forward Xu" <forwardxu...@gmail.com> wrote:
>Hi All,
>
>I want to improve hudi‘s index. There are four main steps to achieve this
>
>1. Implement index syntax
>    a. Implement index syntax for spark sql [1] , I have submitted the
>first pr.
>    b. Implement index syntax for prestodb sql
>    c. Implement index syntax for trino sql
>
>2. read/write index decoupling
>The read/write index is decoupled from the computing engine side, and the
>sql index syntax of the first step can be independently executed and called
>through the API.
>
>3. build index service
>
>Promote the implementation of the hudi service framework, including index
>service, metastore service[2], compact/cluster service[3], etc.
>
>4. Index Management
>There are two kinds of management semantic for Index.
>
>   - Automatic Refresh
>   - Manual Refresh
>
>
>   1. Automatic Refresh
>
>When a user creates an index on the main table without using WITH DEFERRED
>REFRESH syntax, the index will be managed by the system automatically. For
>every data load to the main table, the system will immediately trigger a
>load to the index automatically. These two data loading (to main table and
>index) is executed in a transactional manner, meaning that it will be
>either both success or neither success.
>
>The data loading to index is incremental, avoiding an expensive total
>refresh.
>
>If a user performs the following command on the main table, the system will
>return failure. (reject the operation)
>
>
>   - Data management command: UPDATE/DELETE/DELETE.
>   - Schema management command: ALTER TABLE DROP COLUMN, ALTER TABLE CHANGE
>   DATATYPE, ALTER TABLE RENAME. Note that adding a new column is supported,
>   and for dropping columns and change datatype command, hudi will check
>   whether it will impact the index table, if not, the operation is allowed,
>   otherwise operation will be rejected by throwing an exception.
>   - Partition management command: ALTER TABLE ADD/DROP PARTITION.
>
>If a user does want to perform above operations on the main table, the user
>can first drop the index, perform the operation, and re-create the index
>again.
>
>If a user drops the main table, the index will be dropped immediately too.
>
>We do recommend you to use this management for indexing.
>
>      2.  Manual Refresh
>
>When a user creates an index on the main table using WITH DEFERRED REFRESH
>syntax, the index will be created with status disabled and query will NOT
>use this index until the user issues REFRESH INDEX command to build the
>index. For every REFRESH INDEX command, the system will trigger a full
>refresh of the index. Once the refresh operation is finished, system will
>change index status to enabled, so that it can be used in query rewrite.
>
>For every new data loading, data update, delete, the related index will be
>made disabled, which means that the following queries will not benefit from
>the index before it becomes enabled again.
>
>If the main table is dropped by the user, the related index will be dropped
>immediately.
>
>
>
>Any feedback is welcome!
>
>Thank you.
>
>Regards,
>Forward Xu
>
>Related Links:
>[1] Implement index syntax for spark sql
><https://issues.apache.org/jira/browse/HUDI-3881>
>[2] Metastore service <https://github.com/apache/hudi/pull/5064>
>
>[3] <https://github.com/apache/hudi/pull/4872>compaction/clustering job in
>Service <https://github.com/apache/hudi/pull/4872>

Re:[DISCUSS] hudi index improve

Reply via email to