In general, it seems that the INDEX commands mainly serve the batch scenarios, there are some cases that need to clarify here:
1. When a user creates an index with manuaral refresh first then inserts a batch of data(named d1) into the table, does the index created take effect on d1 ? 2. If a user executes a DROP INDEX command on the table and there is another streaming job writing to the table using and building the index, what happens then ? 3. For multiple engines index support, do you mean to execute CREATE INDEX syntax on all kinds of engines ? Does that mean we should support building indexes for all these engines. And if the writer is a different engine that also writes/reads the index, how to handle the transactions ? 4. We may distinguish between different kinds of indexes from the syntax, because the current index of Hudi (column stats index, bloom filter index, and pk index) are all a little different from the database pk index and secondary index, should we give them specific KEYWORD ? Best, Danny Y Ethan Guo <[email protected]> 于2022年4月19日周二 01:49写道: > > +1 it would be great to make Hudi's index support all query engines. Given > that we already have multi-modal index (column stats index, bloom filter > index) in metadata table and there is a proposal to have a metastore > server, is the ultimate goal to serve the index from metastore leveraging > metadata table for all engines? > > On Mon, Apr 18, 2022 at 7:39 AM wangxianghu <[email protected]> wrote: > > > +1 on index improvement > > index optimization is a very valuable thing for hudi > > Looking forward to the design doc > > > > > > > > > > > > > > > > > > At 2022-04-18 11:18:35, "Forward Xu" <[email protected]> wrote: > > >Hi All, > > > > > >I want to improve hudi‘s index. There are four main steps to achieve this > > > > > >1. Implement index syntax > > > a. Implement index syntax for spark sql [1] , I have submitted the > > >first pr. > > > b. Implement index syntax for prestodb sql > > > c. Implement index syntax for trino sql > > > > > >2. read/write index decoupling > > >The read/write index is decoupled from the computing engine side, and the > > >sql index syntax of the first step can be independently executed and > > called > > >through the API. > > > > > >3. build index service > > > > > >Promote the implementation of the hudi service framework, including index > > >service, metastore service[2], compact/cluster service[3], etc. > > > > > >4. Index Management > > >There are two kinds of management semantic for Index. > > > > > > - Automatic Refresh > > > - Manual Refresh > > > > > > > > > 1. Automatic Refresh > > > > > >When a user creates an index on the main table without using WITH DEFERRED > > >REFRESH syntax, the index will be managed by the system automatically. For > > >every data load to the main table, the system will immediately trigger a > > >load to the index automatically. These two data loading (to main table and > > >index) is executed in a transactional manner, meaning that it will be > > >either both success or neither success. > > > > > >The data loading to index is incremental, avoiding an expensive total > > >refresh. > > > > > >If a user performs the following command on the main table, the system > > will > > >return failure. (reject the operation) > > > > > > > > > - Data management command: UPDATE/DELETE/DELETE. > > > - Schema management command: ALTER TABLE DROP COLUMN, ALTER TABLE > > CHANGE > > > DATATYPE, ALTER TABLE RENAME. Note that adding a new column is > > supported, > > > and for dropping columns and change datatype command, hudi will check > > > whether it will impact the index table, if not, the operation is > > allowed, > > > otherwise operation will be rejected by throwing an exception. > > > - Partition management command: ALTER TABLE ADD/DROP PARTITION. > > > > > >If a user does want to perform above operations on the main table, the > > user > > >can first drop the index, perform the operation, and re-create the index > > >again. > > > > > >If a user drops the main table, the index will be dropped immediately too. > > > > > >We do recommend you to use this management for indexing. > > > > > > 2. Manual Refresh > > > > > >When a user creates an index on the main table using WITH DEFERRED REFRESH > > >syntax, the index will be created with status disabled and query will NOT > > >use this index until the user issues REFRESH INDEX command to build the > > >index. For every REFRESH INDEX command, the system will trigger a full > > >refresh of the index. Once the refresh operation is finished, system will > > >change index status to enabled, so that it can be used in query rewrite. > > > > > >For every new data loading, data update, delete, the related index will be > > >made disabled, which means that the following queries will not benefit > > from > > >the index before it becomes enabled again. > > > > > >If the main table is dropped by the user, the related index will be > > dropped > > >immediately. > > > > > > > > > > > >Any feedback is welcome! > > > > > >Thank you. > > > > > >Regards, > > >Forward Xu > > > > > >Related Links: > > >[1] Implement index syntax for spark sql > > ><https://issues.apache.org/jira/browse/HUDI-3881> > > >[2] Metastore service <https://github.com/apache/hudi/pull/5064> > > > > > >[3] <https://github.com/apache/hudi/pull/4872>compaction/clustering job > > in > > >Service <https://github.com/apache/hudi/pull/4872> > >
