Re: [DISCUSS] PIP-17: Introduce secondary column index

JUNHAO YE Tue, 14 May 2024 19:13:33 -0700

Hi Yong Fang,
I am still working on the feature. For now, file index framework as beed added 
for columns.
And after testing, bloom-filter could actually speed up queries. Bloom-filter 
is ready.


There are several issues I am working on:
1、Procedures to add and remove file index for existing table.
2、Add more type of index (Not sure what to add, maybe use lucene to speed up 
queries for "where column_a like %xxx%")
3、Push this feature for starrocks and other compute engines. Starrocks may need 
a c++ type of index reader.


> 2024年5月10日 上午10:20，Yong Fang <[email protected]> 写道：
> 
> Hi yejunhao,
> 
> I'd like to know what's the status of this feature, are you still working
> on it? Thanks
> 
> Best,
> Fang Yong
> 
> On Wed, Mar 20, 2024 at 12:14 PM Aitozi <[email protected]> wrote:
> 
>> Thanks for your inputs, I have no other questions, +1 for this.
>> Looking forward to this feature.
>> 
>> Best,
>> Aitozi.
>> 
>> JUNHAO YE <[email protected]> 于2024年3月20日周三 10:19写道：
>> 
>>> Hi，aitozi
>>> 
>>> Really thanks for comment! I have read your question and reply here:
>>> 
>>> (1) For now, the secondary index is mainly designed for append-only
>> table.
>>> More and more users migrate from hive and hudi to paimon, their main
>> table
>>> format is append-only. In the future, after deletion files down, I think
>>> the secondary
>>> index is also useful for primary key with deletion file.
>>> See PIP-16 (
>>> 
>> https://cwiki.apache.org/confluence/display/PAIMON/PIP-16%3A+Paimon+position+delete+mode
>>> )
>>> But that's not the job of this period. I should add this to PIP.
>>> 
>>> (2) The answer is yes. I refer to the approach of Hudi and Delta Lake.
>>> Hudi put the index bytes in the user meta space of orc file and parquet
>>> file,
>>> delta lake use an extra file to support index, as a result, I want it
>> more
>>> flexible.
>>> Indeed, it will cause the x2 file numbers, but the file it self will not
>>> be touched
>>> often. Maybe later in the future, we can consider to combine these index
>>> file
>>> to reduce the pressure for filesystem, but I think we can implement it
>>> this way
>>> for now.
>>> 
>>> (3) Correct. If you want drop one column index (this does not happen
>>> often),
>>> we just rewrite the index file, then discard the corresponding bytes,
>>> last, write it
>>> back to file and rewrite DataFileMeta in ManifestEntry.
>>> 
>>> Thanks again for comment!
>>> 
>>> Best,
>>> Junhao
>>> 
>>> 
>>> 
>>> 
>>>> 2024年3月19日 下午11:07，Aitozi <[email protected]> 写道：
>>>> 
>>>> Hi, junhao
>>>> 
>>>>   I's nice to see the secondary index feature in paimon. After read
>> the
>>>> PIP, I have several questions here.
>>>> 
>>>> (1) For the primary key table, we only push down the filter for the
>>> primary
>>>> key, because,
>>>> we can not filter the value if the value should be merged with other
>>>> levels data. So will
>>>> the primary key table be benefit from the secondary column index ? Or
>> the
>>>> main improvement
>>>> is for the append table ?
>>>> 
>>>> (2) The storage of the index file, "one file for one datafile of one
>>> index
>>>> type", will this bring too much
>>>> extra files, an index type will x2 the file number ?
>>>> 
>>>> (3) "While drop column index, for example, I have indexed column a and
>>> b, I
>>>> don't want to index a anymore. I just need to drop the target index
>> bytes
>>>> from index file,
>>>> and don't have to read the data file again."
>>>> 
>>>> Do you mean we will have to rewrite the index file when drop one column
>>>> index in it ?
>>>> 
>>>> Best,
>>>> Aitozi
>>>> 
>>>> JUNHAO YE <[email protected]> 于2024年3月19日周二 19:26写道：
>>>> 
>>>>> Hi, Zhang YiLong
>>>>> 
>>>>> You are right, as I mentioned in PIP-17. We should have priority of
>>>>> different index types. We should consider about combine the result of
>>>>> different index type.
>>>>> 
>>>>> Best, junhao.
>>>>> 
>>>>> 
>>>>>> 2024年3月18日 上午10:49，Zhang YiLong <[email protected]> 写道：
>>>>>> 
>>>>>> This is a big improvement, but I don't think it's for low cardinal
>>>>> fields, because the index at the file level, and for low cardinal
>> fields
>>>>> (e.g. gender is only male and female) in most cases (the field is not
>>>>> sorted) it is present in all files.
>>>>>> 
>>>>>> For specific business, we wants a json index, bitmap index, reverse
>>>>> index, etc  to adapt to different query conditions. So we also need a
>>>>> priority, using different indexes for different query filter and
>> finally
>>>>> combining the results (based on the actual filter criteria and/or)
>>>>>> 
>>>>>> ________________________________
>>>>>> 发件人: yu zelin <[email protected]>
>>>>>> 发送时间: 2024年3月15日 14:43
>>>>>> 收件人: [email protected] <[email protected]>
>>>>>> 主题: Re: [DISCUSS] PIP-17: Introduce secondary column index
>>>>>> 
>>>>>> An exciting feature, +1.
>>>>>> 
>>>>>> Best Regards,
>>>>>> Zelin Yu
>>>>>> 
>>>>>> On Thu, Mar 14, 2024 at 5:53 PM yejunhao <[email protected]>
>>>>> wrote:
>>>>>> 
>>>>>>> Hi, Paimon Devs, I’d like to start a discussion about PIP-17[1].
>>>>>>> 
>>>>>>> Up to now, Paimon use zorder & order & hilbert sort compaction to
>>> speed
>>>>> up
>>>>>>> query. After sort compaction, files will be sorted by the order of
>>>>>>> specified columns. But in some situations, for example, we have tens
>>> of
>>>>>>> columns that should be added in the filter column, sometimes all of
>>> them
>>>>>>> come up together, sometimes, just a few of them. Zorder or order
>>>>> compaction
>>>>>>> can't handle this situation, because too many columns will reduce
>> the
>>>>>>> effect of sorting. So if the column base number of these columns is
>>>>> small,
>>>>>>> we can use bloomfilter or other indexes to speed up queries. That's
>>> why
>>>>>>> this PIP comes up. I want to introduce an index framework to support
>>>>> paimon
>>>>>>> with flexible index system.
>>>>>>> 
>>>>>>> Look forward to your question and suggestions.
>>>>>>> 
>>>>>>> Best, junhao
>>>>>>> 
>>>>>>> [1]
>>>>>>> 
>>>>> 
>>> 
>> https://cwiki.apache.org/confluence/display/PAIMON/PIP-17%3A+Introduce+secondary+column+index
>>>>> 
>>>>> 
>>> 
>>> 
>>

Re: [DISCUSS] PIP-17: Introduce secondary column index

Reply via email to