Re: [PROPOSAL] Introduce Elastic Bloom Filter For Flink

Fabian Hueske Tue, 12 Jun 2018 00:44:32 -0700

Hi Sihua,

Sorry for not replying earlier.
I have one question left. If I understood the design of the linked
Bloomfilter nodes right, users would need to configure a TTL to be able to
remove a node.
When nodes are removed, we would need to insert every key into the current
node which would not be required if we don't remove nodes, right?


>From the small summary of approximated filters, cuckoo filters seem to be
most appropriate as they also support deletes.
Are you aware of any downsides compared to bloom filters (besides
potentially slower inserts)?

Best, Fabian



2018-06-12 9:29 GMT+02:00 sihua zhou <[email protected]>:

> Hi,
>
> no more feedbacks these days...I guess it's because you guys are too busy
> and since I didn't receive any negative feedbacks and there're already some
> positive feedbacks. So I want to implement this *Elastic Bloom Filter*
> based on the current design doc(because I have time to do it currently),
> even though I think the design can be improved definitely, but maybe we
> could discuss the improvement better base on the code, and I believe most
> of the code could be cherry picked for the "final implementation". Does
> anyone object this?
>
> Best, Sihua
>
>
> On 06/6/2018 22:02，sihua zhou<[email protected]> <[email protected]>
> wrote：
>
> Hi,
>
>
> Sorry, but pinging for more feedbacks on this proposal...
> Even the negative feedbacks is highly appreciated!
>
>
> Best, Sihua
>
>
>
>
>
>
> On 05/30/2018 13:19，sihua zhou<[email protected]> wrote：
> Hi,
>
>
> I did a survey of the variants of Bloom Filter and the Cuckoo filter these
> days. Finally, I found 3 of them maybe adaptable for our purpose.
>
>
> 1. standard bloom filter (which we have implemented base on this and used
> it on production with a good experience)
> 2. cuckoo filter, also a very good filter which is a space-efficient data
> structures and support fast query(even faster then BF, but the insert maybe
> a little slower than BF), addtional it support delete() operation.
> 3. count bloom filter, a variant of BF, it supports delete()operation, but
> need to cost 4-5x memory than the standard bloom filter(so, I'm not sure
> whether it's adaptable in practice).
>
>
> Anyway, these filters are just the smallest storage unit in this "Elastic
> Bloom Filter", we can define a general interface, and provide different
> implementation of "storage unit"  base on different filter if we want.
> Maybe I should change the PROPOSAL name to the "Introduce Elastic Filter
> For Flink", the ideal of approach that I outlined in the doc is very
> similar to the paper "Optimization and Applications of Dynamic Bloom
> Filters(http://ijarcs.info/index.php/Ijarcs/article/viewFile/826/814)"(compare
> to the paper, the approach I outlined could have a better query performance
> and also support the RELAXED TTL), maybe it can help to understand the
> desgin doc. Looking forward any feedback!
>
>
> Best, Sihua
> On 05/24/2018 10:36，sihua zhou<[email protected]> wrote：
> Hi,
> Thanks for your suggestions @Elias! I have a brief look at "Cuckoo Filter"
> and "Golumb Compressed Sequence", my first sensation is that maybe "Golumc
> Compressed Sequence" is not a good choose, because it seems to require
> non-constant lookup time, but Cuckoo Filter maybe a good choose, I should
> definitely have a deeper look at it.
>
>
> Beside, to me, all of this filters seems to a "variant" of the bloom
> filter(which is the smallest unit to store data in the current desgin), the
> main challenge for introducing BF into flink is the data skewed(which is
> common phenomenon on production) problem, could you maybe also have a look
> at the solution that I posted on the google doc https://docs.google.com/
> document/d/17UY5RZ1mq--hPzFx-LfBjCAw_kkoIrI9KHovXWkxNYY/edit?usp=sharing
> for this problem, It would be nice if you could give us some advice on that.
>
>
> Best, Sihua
>
>
> On 05/24/2018 07:21，Elias Levy<[email protected]> wrote：
> I would suggest you consider an alternative data structures: a Cuckoo
> Filter or a Golumb Compressed Sequence.
>
> The GCS data structure was introduced in Cache-, Hash- and Space-Efficient
> Bloom Filters
> <http://algo2.iti.kit.edu/documents/cacheefficientbloomfilters-jea.pdf> by
> F. Putze, P. Sanders, and J. Singler.  See section 4.
>
>
>
> We should discuss which exact implementation of bloom filters are the best
> fit.
> @Fabian: There are also implementations of bloom filters that use counting
> and therefore support
> deletes, but obviously this comes at the cost of a potentially higher
> space consumption.
>
> Am 23.05.2018 um 11:29 schrieb Fabian Hueske <[email protected]>:
> IMO, such a feature would be very interesting. However, my concerns with
> Bloom Filter
> is that they are insert-only data structures, i.e., it is not possible to
> remove keys once
> they were added. This might render the filter useless over time.
> In a different thread (see discussion in FLINK-8918 [1]), you mentioned
> that the Bloom
> Filters would be growing.
> If we keep them in memory, how can we prevent them from exceeding memory
> boundaries over
> time?
>
>
>

Re: [PROPOSAL] Introduce Elastic Bloom Filter For Flink

Reply via email to