Re: [DISCUSS] hudi index improve

Danny Chan Mon, 18 Apr 2022 19:53:33 -0700

In general, it seems that the INDEX commands mainly serve the batch
scenarios, there are some cases that need to clarify here:


1. When a user creates an index with manuaral refresh first then
inserts a batch of data(named d1) into the table, does the index
created take effect on d1 ?
2. If a user executes a DROP INDEX command on the table and there is
another streaming job writing to the table using and building the
index, what happens then ?
3. For multiple engines index support, do you mean to execute CREATE
INDEX syntax on all kinds of engines ? Does that mean we should
support building indexes for all these engines. And if the writer is a
different engine that also writes/reads the index, how to handle the
transactions ?
4. We may distinguish between different kinds of indexes from the
syntax, because the current index of Hudi (column stats index, bloom
filter
index, and pk index) are all a little different from the database pk
index and secondary index, should we give them specific KEYWORD ?

Best,
Danny

Y Ethan Guo <[email protected]> 于2022年4月19日周二 01:49写道：
>
> +1 it would be great to make Hudi's index support all query engines.  Given
> that we already have multi-modal index (column stats index, bloom filter
> index) in metadata table and there is a proposal to have a metastore
> server, is the ultimate goal to serve the index from metastore leveraging
> metadata table for all engines?
>
> On Mon, Apr 18, 2022 at 7:39 AM wangxianghu <[email protected]> wrote:
>
> > +1 on index improvement
> > index optimization is a very valuable thing for hudi
> > Looking forward to the design doc
> >
> >
> >
> >
> >
> >
> >
> >
> > At 2022-04-18 11:18:35, "Forward Xu" <[email protected]> wrote:
> > >Hi All,
> > >
> > >I want to improve hudi‘s index. There are four main steps to achieve this
> > >
> > >1. Implement index syntax
> > >    a. Implement index syntax for spark sql [1] , I have submitted the
> > >first pr.
> > >    b. Implement index syntax for prestodb sql
> > >    c. Implement index syntax for trino sql
> > >
> > >2. read/write index decoupling
> > >The read/write index is decoupled from the computing engine side, and the
> > >sql index syntax of the first step can be independently executed and
> > called
> > >through the API.
> > >
> > >3. build index service
> > >
> > >Promote the implementation of the hudi service framework, including index
> > >service, metastore service[2], compact/cluster service[3], etc.
> > >
> > >4. Index Management
> > >There are two kinds of management semantic for Index.
> > >
> > >   - Automatic Refresh
> > >   - Manual Refresh
> > >
> > >
> > >   1. Automatic Refresh
> > >
> > >When a user creates an index on the main table without using WITH DEFERRED
> > >REFRESH syntax, the index will be managed by the system automatically. For
> > >every data load to the main table, the system will immediately trigger a
> > >load to the index automatically. These two data loading (to main table and
> > >index) is executed in a transactional manner, meaning that it will be
> > >either both success or neither success.
> > >
> > >The data loading to index is incremental, avoiding an expensive total
> > >refresh.
> > >
> > >If a user performs the following command on the main table, the system
> > will
> > >return failure. (reject the operation)
> > >
> > >
> > >   - Data management command: UPDATE/DELETE/DELETE.
> > >   - Schema management command: ALTER TABLE DROP COLUMN, ALTER TABLE
> > CHANGE
> > >   DATATYPE, ALTER TABLE RENAME. Note that adding a new column is
> > supported,
> > >   and for dropping columns and change datatype command, hudi will check
> > >   whether it will impact the index table, if not, the operation is
> > allowed,
> > >   otherwise operation will be rejected by throwing an exception.
> > >   - Partition management command: ALTER TABLE ADD/DROP PARTITION.
> > >
> > >If a user does want to perform above operations on the main table, the
> > user
> > >can first drop the index, perform the operation, and re-create the index
> > >again.
> > >
> > >If a user drops the main table, the index will be dropped immediately too.
> > >
> > >We do recommend you to use this management for indexing.
> > >
> > >      2.  Manual Refresh
> > >
> > >When a user creates an index on the main table using WITH DEFERRED REFRESH
> > >syntax, the index will be created with status disabled and query will NOT
> > >use this index until the user issues REFRESH INDEX command to build the
> > >index. For every REFRESH INDEX command, the system will trigger a full
> > >refresh of the index. Once the refresh operation is finished, system will
> > >change index status to enabled, so that it can be used in query rewrite.
> > >
> > >For every new data loading, data update, delete, the related index will be
> > >made disabled, which means that the following queries will not benefit
> > from
> > >the index before it becomes enabled again.
> > >
> > >If the main table is dropped by the user, the related index will be
> > dropped
> > >immediately.
> > >
> > >
> > >
> > >Any feedback is welcome!
> > >
> > >Thank you.
> > >
> > >Regards,
> > >Forward Xu
> > >
> > >Related Links:
> > >[1] Implement index syntax for spark sql
> > ><https://issues.apache.org/jira/browse/HUDI-3881>
> > >[2] Metastore service <https://github.com/apache/hudi/pull/5064>
> > >
> > >[3] <https://github.com/apache/hudi/pull/4872>compaction/clustering job
> > in
> > >Service <https://github.com/apache/hudi/pull/4872>
> >

Re: [DISCUSS] hudi index improve

Reply via email to