Hi Yong Fang, I am still working on the feature. For now, file index framework as beed added for columns. And after testing, bloom-filter could actually speed up queries. Bloom-filter is ready.
There are several issues I am working on: 1、Procedures to add and remove file index for existing table. 2、Add more type of index (Not sure what to add, maybe use lucene to speed up queries for "where column_a like %xxx%") 3、Push this feature for starrocks and other compute engines. Starrocks may need a c++ type of index reader. > 2024年5月10日 上午10:20,Yong Fang <[email protected]> 写道: > > Hi yejunhao, > > I'd like to know what's the status of this feature, are you still working > on it? Thanks > > Best, > Fang Yong > > On Wed, Mar 20, 2024 at 12:14 PM Aitozi <[email protected]> wrote: > >> Thanks for your inputs, I have no other questions, +1 for this. >> Looking forward to this feature. >> >> Best, >> Aitozi. >> >> JUNHAO YE <[email protected]> 于2024年3月20日周三 10:19写道: >> >>> Hi,aitozi >>> >>> Really thanks for comment! I have read your question and reply here: >>> >>> (1) For now, the secondary index is mainly designed for append-only >> table. >>> More and more users migrate from hive and hudi to paimon, their main >> table >>> format is append-only. In the future, after deletion files down, I think >>> the secondary >>> index is also useful for primary key with deletion file. >>> See PIP-16 ( >>> >> https://cwiki.apache.org/confluence/display/PAIMON/PIP-16%3A+Paimon+position+delete+mode >>> ) >>> But that's not the job of this period. I should add this to PIP. >>> >>> (2) The answer is yes. I refer to the approach of Hudi and Delta Lake. >>> Hudi put the index bytes in the user meta space of orc file and parquet >>> file, >>> delta lake use an extra file to support index, as a result, I want it >> more >>> flexible. >>> Indeed, it will cause the x2 file numbers, but the file it self will not >>> be touched >>> often. Maybe later in the future, we can consider to combine these index >>> file >>> to reduce the pressure for filesystem, but I think we can implement it >>> this way >>> for now. >>> >>> (3) Correct. If you want drop one column index (this does not happen >>> often), >>> we just rewrite the index file, then discard the corresponding bytes, >>> last, write it >>> back to file and rewrite DataFileMeta in ManifestEntry. >>> >>> Thanks again for comment! >>> >>> Best, >>> Junhao >>> >>> >>> >>> >>>> 2024年3月19日 下午11:07,Aitozi <[email protected]> 写道: >>>> >>>> Hi, junhao >>>> >>>> I's nice to see the secondary index feature in paimon. After read >> the >>>> PIP, I have several questions here. >>>> >>>> (1) For the primary key table, we only push down the filter for the >>> primary >>>> key, because, >>>> we can not filter the value if the value should be merged with other >>>> levels data. So will >>>> the primary key table be benefit from the secondary column index ? Or >> the >>>> main improvement >>>> is for the append table ? >>>> >>>> (2) The storage of the index file, "one file for one datafile of one >>> index >>>> type", will this bring too much >>>> extra files, an index type will x2 the file number ? >>>> >>>> (3) "While drop column index, for example, I have indexed column a and >>> b, I >>>> don't want to index a anymore. I just need to drop the target index >> bytes >>>> from index file, >>>> and don't have to read the data file again." >>>> >>>> Do you mean we will have to rewrite the index file when drop one column >>>> index in it ? >>>> >>>> Best, >>>> Aitozi >>>> >>>> JUNHAO YE <[email protected]> 于2024年3月19日周二 19:26写道: >>>> >>>>> Hi, Zhang YiLong >>>>> >>>>> You are right, as I mentioned in PIP-17. We should have priority of >>>>> different index types. We should consider about combine the result of >>>>> different index type. >>>>> >>>>> Best, junhao. >>>>> >>>>> >>>>>> 2024年3月18日 上午10:49,Zhang YiLong <[email protected]> 写道: >>>>>> >>>>>> This is a big improvement, but I don't think it's for low cardinal >>>>> fields, because the index at the file level, and for low cardinal >> fields >>>>> (e.g. gender is only male and female) in most cases (the field is not >>>>> sorted) it is present in all files. >>>>>> >>>>>> For specific business, we wants a json index, bitmap index, reverse >>>>> index, etc to adapt to different query conditions. So we also need a >>>>> priority, using different indexes for different query filter and >> finally >>>>> combining the results (based on the actual filter criteria and/or) >>>>>> >>>>>> ________________________________ >>>>>> 发件人: yu zelin <[email protected]> >>>>>> 发送时间: 2024年3月15日 14:43 >>>>>> 收件人: [email protected] <[email protected]> >>>>>> 主题: Re: [DISCUSS] PIP-17: Introduce secondary column index >>>>>> >>>>>> An exciting feature, +1. >>>>>> >>>>>> Best Regards, >>>>>> Zelin Yu >>>>>> >>>>>> On Thu, Mar 14, 2024 at 5:53 PM yejunhao <[email protected]> >>>>> wrote: >>>>>> >>>>>>> Hi, Paimon Devs, I’d like to start a discussion about PIP-17[1]. >>>>>>> >>>>>>> Up to now, Paimon use zorder & order & hilbert sort compaction to >>> speed >>>>> up >>>>>>> query. After sort compaction, files will be sorted by the order of >>>>>>> specified columns. But in some situations, for example, we have tens >>> of >>>>>>> columns that should be added in the filter column, sometimes all of >>> them >>>>>>> come up together, sometimes, just a few of them. Zorder or order >>>>> compaction >>>>>>> can't handle this situation, because too many columns will reduce >> the >>>>>>> effect of sorting. So if the column base number of these columns is >>>>> small, >>>>>>> we can use bloomfilter or other indexes to speed up queries. That's >>> why >>>>>>> this PIP comes up. I want to introduce an index framework to support >>>>> paimon >>>>>>> with flexible index system. >>>>>>> >>>>>>> Look forward to your question and suggestions. >>>>>>> >>>>>>> Best, junhao >>>>>>> >>>>>>> [1] >>>>>>> >>>>> >>> >> https://cwiki.apache.org/confluence/display/PAIMON/PIP-17%3A+Introduce+secondary+column+index >>>>> >>>>> >>> >>> >>
