Re: [PR] [core] Introduce dynamic bloom filer file index [paimon]

via GitHub Thu, 09 Jan 2025 21:23:06 -0800


herefree commented on PR #4723:
URL: https://github.com/apache/paimon/pull/4723#issuecomment-2581802892


   > > > I think this dynamic bloom filter may have poor performance.
   > > > Firstly, it create new BloomFilter64 while meet the required number of 
records, but it won't deduplicate the records. For example, if I have one file, 
which contains 800_0000 records, the number in each BloomFilter is 100_0000, so 
I need to create 8 BloomFilters to contains these values. But which we ignored 
is that, maybe, 800_0000 records could be deduplicated to 100_0000 records. So 
the first improvement, maybe it works, is to test record before write it to 
dynamic bloom filter. It is already exists, we may skip.
   > > > Secondly, we store small bytes in metadata. If we set dynamic bloom 
filter items size too big, we have to store it as file even if there are just 
few distinct values. But if we set it too small, we can get too many bloom 
filters in one dynamic bloom filter, which too cost more time to query. Maybe 
we need to figure out this problem.
   > > > Can you test it and give us some performance result?
   > > 
   > > 
   > > Thanks for your reply. For the first problem of data duplication, I will 
fix it. For the second problem, if the user sets a smaller item and then sets a 
larger max_item, when the amount of data increases, it will indeed cause many 
small bloom filter. Maybe we consider adding a new parameter weight to 
dynamically increase the items in the bloom filter. For example, if the number 
of items in the first bloom filter is 1000, the number of items in the second 
bloom filter is 1000 + weight, the third one is 1000 + 2 * weight..... What do 
you think of this?
   > 
   > Actually, I have already realized one version of what you said, please 
see: #3115 But it was refused by Paimon Community. The first problem we solved 
the same. The key is the second problem, what coefficient we should use to 
expand the bloom-filter. Or, is there any plan better than the expanding 
solution?
   
   The number of bloom filters is limited by max_items , which does not grow 
all the time. Only when the user sets a particularly large max_items and a 
particularly small items will the number of bloom filters be particularly 
large. Perhaps we should let the user specify this coefficient. The default is 
not to expand.
   There are two ways to expand. The first way is nums, nums+coefficient * 
nums, nums+2*coefficient * nums.....coefficient * nums is used as the base of 
growth to grow linearly, or use it as you wrote before Exponential 
growth：nums、nums * coefficient 、nums * coefficient *coefficient？
   As for other plan, maybe we can use multi-threading to speed up the query.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@paimon.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [core] Introduce dynamic bloom filer file index [paimon]

Reply via email to