Hi Stephan,


Thank you very much for the reply and very happy for that!


I'm not sure whether I understood your idea correctly. Does it means 1) we 
should add a new operator with the feature of Elastic Bloom Filter? or 2) we 
could support it as the current (version <= 1.5) InternalTimerServer as an 
independent managed keyed state?


- If that means to introduce a new operator for supporting EBF, I think I not 
sure whether it's easy to integerate the some of the current features of SQL 
with EBF since their were built on other operaters. E.g. the "stream join" and 
the "distinct".


- If that means to maintain the EBF as the current TimeServer(which's not 
integrated into state backend), this is the implementation which we are using 
on production currently.


Please let me know if I totally misunderstood...


Best, Sihua
On 07/5/2018 04:34,Stephan Ewen<se...@apache.org> wrote:
Hi Sihua!


Sorry for joining this discussion late.


I can see the benefit of such a feature and also see the technical merit. It is 
a nice piece of work and a good proposal.


I am wondering if there is a way to add such a technique as a "library 
operator", or whether it needs a deep integration into the runtime and the 
state backends.


The state backends have currently a few efforts going on, like state TTL, 
making timers part of the state backends, asychronous timer snapshots, scalable 
timers in RocksDB, avoiding small file fragmentation (and too much read/write 
amplification) for RocksDB incremental snapshots, faster state recovery 
efforts, etc.
This is a lot, and all these features are on the list for quite a while, with 
various users pushing for them.


If we can add such BloomFilters as an independent operator, quasi like a 
library or utility, then this is much easier to integrate, because it needs no 
coordination with the other State Backend work. It is also easier to review and 
merge, because it would be a new independent feature and not immediately affect 
all existing state functionality. If this interacts deeply with existing state 
backends, it touches some of the most critical and most active parts of the 
system, which needs a lot of time from the core developers of these parts, 
making it harder and take much longer.


What do you think about looking at whether the elastic bloom filters be added 
like a library operator?


Best,
Stephan




On Tue, Jun 12, 2018 at 4:35 PM, sihua zhou <summerle...@163.com> wrote:
Hi,


Maybe I would like to add more information concerning to the Linked Filter 
Nodes on each key group. The reason that we need to maintance a Linked Filter 
Nodes is that we need to handle data skew, data skew is also the most 
challenging problem that we need to overcome. Because we don't know how many 
records will fall into each key group, so we couldn't allocate a Final Filter 
Node at the begin time, so we need to allocate the Filter Node lazily, each 
time we only allocate a Small Filter Node
for the incoming records, once it filled we freeze it and allocate a new node 
for the future incoming records, so we get a Linked Filter Node on each key 
group and only the head Node is writable, the rest are immutable.


Best, Sihua

On 06/12/2018 16:22,sihua zhou<summerle...@163.com> wrote:
Hi Fabian,


Thanks a lot for your reply, you are right that users would need to configure a 
TTL for the Elastic Filter to recycle the memory resource.


For every Linked BloomFilter Nodes, only the head node is writable, the other 
nodes are all full, they are only immutable(only readable, we implement the 
relaxed ttl based on this feature). Even though we don't  need to remove the 
node, we still always need to insert the data into the current node(the head 
node), because the node is allocated lazily(to handle data skew), each node's a 
can only store "a part" of the data, once the current node is full, we allocate 
a new head node.


Concerning to the cuckoo filters, I also think it seem to be most appropriate 
in theroy. But there are some reasons that I prefer to implement this based on 
BF as the first interation.


- I didn't find a open source lib that provide the a "stable" cuckoo filter, 
maybe we need to implement it ourself, it's not a trivial work.


- The most attraction that cuckoo filter provided is that it support deletion, 
but since the cuckoo filter is a dense data structure, we can't store the time 
stamp with the record in cuckoo filter, we may need to depend on the "extra 
thing"(e.g. timer) to use it's deletion, the performance overhead may not cheap.


- No matter it's cuckoo filter or bloom fiter, they both seems as the "smallest 
storage unit" in the "Elastic Filter", after we provide a implementation base 
on Bloom Filter, it easily to extend to cuckoo filter.


How about to provide the Elastic Filter based on BF as the first iteration and 
provide the version that based on cuckoo filter as a second iteration? What do 
you think?


Best, Sihua
On 06/12/2018 15:43,Fabian Hueske<fhue...@apache.org> wrote:
Hi Sihua,


Sorry for not replying earlier.

I have one question left. If I understood the design of the linked Bloomfilter 
nodes right, users would need to configure a TTL to be able to remove a node.

When nodes are removed, we would need to insert every key into the current node 
which would not be required if we don't remove nodes, right?


From the small summary of approximated filters, cuckoo filters seem to be most 
appropriate as they also support deletes.
Are you aware of any downsides compared to bloom filters (besides potentially 
slower inserts)?


Best, Fabian







2018-06-12 9:29 GMT+02:00 sihua zhou <summerle...@163.com>:

Hi,


no more feedbacks these days...I guess it's because you guys are too busy and 
since I didn't receive any negative feedbacks and there're already some 
positive feedbacks. So I want to implement this *Elastic Bloom Filter* based on 
the current design doc(because I have time to do it currently), even though I 
think the design can be improved definitely, but maybe we could discuss the 
improvement better base on the code, and I believe most of the code could be 
cherry picked for the "final implementation". Does anyone object this?


Best, Sihua




On 06/6/2018 22:02,sihua zhou<summerle...@163.com> wrote:
Hi,


Sorry, but pinging for more feedbacks on this proposal...
Even the negative feedbacks is highly appreciated!


Best, Sihua






On 05/30/2018 13:19,sihua zhou<summerle...@163.com> wrote:
Hi,


I did a survey of the variants of Bloom Filter and the Cuckoo filter these 
days. Finally, I found 3 of them maybe adaptable for our purpose.


1. standard bloom filter (which we have implemented base on this and used it on 
production with a good experience)
2. cuckoo filter, also a very good filter which is a space-efficient data 
structures and support fast query(even faster then BF, but the insert maybe a 
little slower than BF), addtional it support delete() operation.
3. count bloom filter, a variant of BF, it supports delete()operation, but need 
to cost 4-5x memory than the standard bloom filter(so, I'm not sure whether 
it's adaptable in practice).


Anyway, these filters are just the smallest storage unit in this "Elastic Bloom 
Filter", we can define a general interface, and provide different 
implementation of "storage unit"  base on different filter if we want. Maybe I 
should change the PROPOSAL name to the "Introduce Elastic Filter For Flink", 
the ideal of approach that I outlined in the doc is very similar to the paper 
"Optimization and Applications of Dynamic Bloom 
Filters(http://ijarcs.info/index.php/Ijarcs/article/viewFile/826/814)"(compare 
to the paper, the approach I outlined could have a better query performance and 
also support the RELAXED TTL), maybe it can help to understand the desgin doc. 
Looking forward any feedback!


Best, Sihua
On 05/24/2018 10:36,sihua zhou<summerle...@163.com> wrote:
Hi,
Thanks for your suggestions @Elias! I have a brief look at "Cuckoo Filter" and 
"Golumb Compressed Sequence", my first sensation is that maybe "Golumc 
Compressed Sequence" is not a good choose, because it seems to require 
non-constant lookup time, but Cuckoo Filter maybe a good choose, I should 
definitely have a deeper look at it.


Beside, to me, all of this filters seems to a "variant" of the bloom 
filter(which is the smallest unit to store data in the current desgin), the 
main challenge for introducing BF into flink is the data skewed(which is common 
phenomenon on production) problem, could you maybe also have a look at the 
solution that I posted on the google doc 
https://docs.google.com/document/d/17UY5RZ1mq--hPzFx-LfBjCAw_kkoIrI9KHovXWkxNYY/edit?usp=sharing
 for this problem, It would be nice if you could give us some advice on that.


Best, Sihua


On 05/24/2018 07:21,Elias Levy<fearsome.lucid...@gmail.com> wrote:
I would suggest you consider an alternative data structures: a Cuckoo
Filter or a Golumb Compressed Sequence.

The GCS data structure was introduced in Cache-, Hash- and Space-Efficient
Bloom Filters
<http://algo2.iti.kit.edu/documents/cacheefficientbloomfilters-jea.pdf> by
F. Putze, P. Sanders, and J. Singler.  See section 4.



We should discuss which exact implementation of bloom filters are the best
fit.
@Fabian: There are also implementations of bloom filters that use counting
and therefore support
deletes, but obviously this comes at the cost of a potentially higher
space consumption.

Am 23.05.2018 um 11:29 schrieb Fabian Hueske <fhue...@gmail.com>:
IMO, such a feature would be very interesting. However, my concerns with
Bloom Filter
is that they are insert-only data structures, i.e., it is not possible to
remove keys once
they were added. This might render the filter useless over time.
In a different thread (see discussion in FLINK-8918 [1]), you mentioned
that the Bloom
Filters would be growing.
If we keep them in memory, how can we prevent them from exceeding memory
boundaries over
time?






Reply via email to