+1 on index improvement index optimization is a very valuable thing for hudi Looking forward to the design doc
At 2022-04-18 11:18:35, "Forward Xu" <forwardxu...@gmail.com> wrote: >Hi All, > >I want to improve hudi‘s index. There are four main steps to achieve this > >1. Implement index syntax > a. Implement index syntax for spark sql [1] , I have submitted the >first pr. > b. Implement index syntax for prestodb sql > c. Implement index syntax for trino sql > >2. read/write index decoupling >The read/write index is decoupled from the computing engine side, and the >sql index syntax of the first step can be independently executed and called >through the API. > >3. build index service > >Promote the implementation of the hudi service framework, including index >service, metastore service[2], compact/cluster service[3], etc. > >4. Index Management >There are two kinds of management semantic for Index. > > - Automatic Refresh > - Manual Refresh > > > 1. Automatic Refresh > >When a user creates an index on the main table without using WITH DEFERRED >REFRESH syntax, the index will be managed by the system automatically. For >every data load to the main table, the system will immediately trigger a >load to the index automatically. These two data loading (to main table and >index) is executed in a transactional manner, meaning that it will be >either both success or neither success. > >The data loading to index is incremental, avoiding an expensive total >refresh. > >If a user performs the following command on the main table, the system will >return failure. (reject the operation) > > > - Data management command: UPDATE/DELETE/DELETE. > - Schema management command: ALTER TABLE DROP COLUMN, ALTER TABLE CHANGE > DATATYPE, ALTER TABLE RENAME. Note that adding a new column is supported, > and for dropping columns and change datatype command, hudi will check > whether it will impact the index table, if not, the operation is allowed, > otherwise operation will be rejected by throwing an exception. > - Partition management command: ALTER TABLE ADD/DROP PARTITION. > >If a user does want to perform above operations on the main table, the user >can first drop the index, perform the operation, and re-create the index >again. > >If a user drops the main table, the index will be dropped immediately too. > >We do recommend you to use this management for indexing. > > 2. Manual Refresh > >When a user creates an index on the main table using WITH DEFERRED REFRESH >syntax, the index will be created with status disabled and query will NOT >use this index until the user issues REFRESH INDEX command to build the >index. For every REFRESH INDEX command, the system will trigger a full >refresh of the index. Once the refresh operation is finished, system will >change index status to enabled, so that it can be used in query rewrite. > >For every new data loading, data update, delete, the related index will be >made disabled, which means that the following queries will not benefit from >the index before it becomes enabled again. > >If the main table is dropped by the user, the related index will be dropped >immediately. > > > >Any feedback is welcome! > >Thank you. > >Regards, >Forward Xu > >Related Links: >[1] Implement index syntax for spark sql ><https://issues.apache.org/jira/browse/HUDI-3881> >[2] Metastore service <https://github.com/apache/hudi/pull/5064> > >[3] <https://github.com/apache/hudi/pull/4872>compaction/clustering job in >Service <https://github.com/apache/hudi/pull/4872>