Hi yejunhao, I'd like to know what's the status of this feature, are you still working on it? Thanks
Best, Fang Yong On Wed, Mar 20, 2024 at 12:14 PM Aitozi <[email protected]> wrote: > Thanks for your inputs, I have no other questions, +1 for this. > Looking forward to this feature. > > Best, > Aitozi. > > JUNHAO YE <[email protected]> 于2024年3月20日周三 10:19写道: > > > Hi,aitozi > > > > Really thanks for comment! I have read your question and reply here: > > > > (1) For now, the secondary index is mainly designed for append-only > table. > > More and more users migrate from hive and hudi to paimon, their main > table > > format is append-only. In the future, after deletion files down, I think > > the secondary > > index is also useful for primary key with deletion file. > > See PIP-16 ( > > > https://cwiki.apache.org/confluence/display/PAIMON/PIP-16%3A+Paimon+position+delete+mode > > ) > > But that's not the job of this period. I should add this to PIP. > > > > (2) The answer is yes. I refer to the approach of Hudi and Delta Lake. > > Hudi put the index bytes in the user meta space of orc file and parquet > > file, > > delta lake use an extra file to support index, as a result, I want it > more > > flexible. > > Indeed, it will cause the x2 file numbers, but the file it self will not > > be touched > > often. Maybe later in the future, we can consider to combine these index > > file > > to reduce the pressure for filesystem, but I think we can implement it > > this way > > for now. > > > > (3) Correct. If you want drop one column index (this does not happen > > often), > > we just rewrite the index file, then discard the corresponding bytes, > > last, write it > > back to file and rewrite DataFileMeta in ManifestEntry. > > > > Thanks again for comment! > > > > Best, > > Junhao > > > > > > > > > > > 2024年3月19日 下午11:07,Aitozi <[email protected]> 写道: > > > > > > Hi, junhao > > > > > > I's nice to see the secondary index feature in paimon. After read > the > > > PIP, I have several questions here. > > > > > > (1) For the primary key table, we only push down the filter for the > > primary > > > key, because, > > > we can not filter the value if the value should be merged with other > > > levels data. So will > > > the primary key table be benefit from the secondary column index ? Or > the > > > main improvement > > > is for the append table ? > > > > > > (2) The storage of the index file, "one file for one datafile of one > > index > > > type", will this bring too much > > > extra files, an index type will x2 the file number ? > > > > > > (3) "While drop column index, for example, I have indexed column a and > > b, I > > > don't want to index a anymore. I just need to drop the target index > bytes > > > from index file, > > > and don't have to read the data file again." > > > > > > Do you mean we will have to rewrite the index file when drop one column > > > index in it ? > > > > > > Best, > > > Aitozi > > > > > > JUNHAO YE <[email protected]> 于2024年3月19日周二 19:26写道: > > > > > >> Hi, Zhang YiLong > > >> > > >> You are right, as I mentioned in PIP-17. We should have priority of > > >> different index types. We should consider about combine the result of > > >> different index type. > > >> > > >> Best, junhao. > > >> > > >> > > >>> 2024年3月18日 上午10:49,Zhang YiLong <[email protected]> 写道: > > >>> > > >>> This is a big improvement, but I don't think it's for low cardinal > > >> fields, because the index at the file level, and for low cardinal > fields > > >> (e.g. gender is only male and female) in most cases (the field is not > > >> sorted) it is present in all files. > > >>> > > >>> For specific business, we wants a json index, bitmap index, reverse > > >> index, etc to adapt to different query conditions. So we also need a > > >> priority, using different indexes for different query filter and > finally > > >> combining the results (based on the actual filter criteria and/or) > > >>> > > >>> ________________________________ > > >>> 发件人: yu zelin <[email protected]> > > >>> 发送时间: 2024年3月15日 14:43 > > >>> 收件人: [email protected] <[email protected]> > > >>> 主题: Re: [DISCUSS] PIP-17: Introduce secondary column index > > >>> > > >>> An exciting feature, +1. > > >>> > > >>> Best Regards, > > >>> Zelin Yu > > >>> > > >>> On Thu, Mar 14, 2024 at 5:53 PM yejunhao <[email protected]> > > >> wrote: > > >>> > > >>>> Hi, Paimon Devs, I’d like to start a discussion about PIP-17[1]. > > >>>> > > >>>> Up to now, Paimon use zorder & order & hilbert sort compaction to > > speed > > >> up > > >>>> query. After sort compaction, files will be sorted by the order of > > >>>> specified columns. But in some situations, for example, we have tens > > of > > >>>> columns that should be added in the filter column, sometimes all of > > them > > >>>> come up together, sometimes, just a few of them. Zorder or order > > >> compaction > > >>>> can't handle this situation, because too many columns will reduce > the > > >>>> effect of sorting. So if the column base number of these columns is > > >> small, > > >>>> we can use bloomfilter or other indexes to speed up queries. That's > > why > > >>>> this PIP comes up. I want to introduce an index framework to support > > >> paimon > > >>>> with flexible index system. > > >>>> > > >>>> Look forward to your question and suggestions. > > >>>> > > >>>> Best, junhao > > >>>> > > >>>> [1] > > >>>> > > >> > > > https://cwiki.apache.org/confluence/display/PAIMON/PIP-17%3A+Introduce+secondary+column+index > > >> > > >> > > > > >
