+1 great initiative. Please also support Trino. Todd Gao is working on Trino/Presto native connectors. We should align the plan going from there. Looking forward to the RFC.
On Mon, Apr 18, 2022 at 11:41 AM 孟涛 <mengtao0...@qq.com.invalid> wrote: > +1 , it will be a great feature for hudi > index is very import to boost query, and we are also trying to add index > support for trino on hudi; maybe we can work together. Looking forward to > the design documents > some minor questions: > 1. do we need to consider concurrent operation > 2. do we want to use metaTable to store index information? > > > > > > > ------------------ 原始邮件 ------------------ > 发件人: > "dev" > < > forwardxu...@gmail.com>; > 发送时间: 2022年4月18日(星期一) 中午11:18 > 收件人: "dev"<dev@hudi.apache.org>; > > 主题: [DISCUSS] hudi index improve > > > > Hi All, > > I want to improve hudi‘s index. There are four main steps to achieve this > > 1. Implement index syntax > a. Implement index syntax for spark sql [1] , I have > submitted the > first pr. > b. Implement index syntax for prestodb sql > c. Implement index syntax for trino sql > > 2. read/write index decoupling > The read/write index is decoupled from the computing engine side, and the > sql index syntax of the first step can be independently executed and called > through the API. > > 3. build index service > > Promote the implementation of the hudi service framework, including index > service, metastore service[2], compact/cluster service[3], etc. > > 4. Index Management > There are two kinds of management semantic for Index. > > - Automatic Refresh > - Manual Refresh > > > 1. Automatic Refresh > > When a user creates an index on the main table without using WITH DEFERRED > REFRESH syntax, the index will be managed by the system automatically. For > every data load to the main table, the system will immediately trigger a > load to the index automatically. These two data loading (to main table and > index) is executed in a transactional manner, meaning that it will be > either both success or neither success. > > The data loading to index is incremental, avoiding an expensive total > refresh. > > If a user performs the following command on the main table, the system will > return failure. (reject the operation) > > > - Data management command: UPDATE/DELETE/DELETE. > - Schema management command: ALTER TABLE DROP COLUMN, ALTER > TABLE CHANGE > DATATYPE, ALTER TABLE RENAME. Note that adding a new column > is supported, > and for dropping columns and change datatype command, hudi > will check > whether it will impact the index table, if not, the operation > is allowed, > otherwise operation will be rejected by throwing an exception. > - Partition management command: ALTER TABLE ADD/DROP > PARTITION. > > If a user does want to perform above operations on the main table, the user > can first drop the index, perform the operation, and re-create the index > again. > > If a user drops the main table, the index will be dropped immediately too. > > We do recommend you to use this management for indexing. > > 2. Manual Refresh > > When a user creates an index on the main table using WITH DEFERRED REFRESH > syntax, the index will be created with status disabled and query will NOT > use this index until the user issues REFRESH INDEX command to build the > index. For every REFRESH INDEX command, the system will trigger a full > refresh of the index. Once the refresh operation is finished, system will > change index status to enabled, so that it can be used in query rewrite. > > For every new data loading, data update, delete, the related index will be > made disabled, which means that the following queries will not benefit from > the index before it becomes enabled again. > > If the main table is dropped by the user, the related index will be dropped > immediately. > > > > Any feedback is welcome! > > Thank you. > > Regards, > Forward Xu > > Related Links: > [1] Implement index syntax for spark sql > <https://issues.apache.org/jira/browse/HUDI-3881> > [2] Metastore service <https://github.com/apache/hudi/pull/5064> > > [3] <https://github.com/apache/hudi/pull/4872>compaction/clustering > job in > Service <https://github.com/apache/hudi/pull/4872> -- Best, Shiyan