[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284427#comment-15284427
 ] 

Gabor Gevay commented on FLINK-2147:
------------------------------------

OK, we can certainly think about it.

It would probably make sense to apply this to windows, and not the entire 
stream. In this case, we have to figure out how to apply the algorithm already 
when Flink is building the windows. (I mean we can't just stuff the algorithm 
into a WindowFunction, because then Flink would buffer up all the elements of a 
window, which defeats the purpose.) Maybe it can be done with a fold, since I 
think that Flink is already doing preaggregation for folds.

Another issue is how do we want the user to specify the key? I mean it would 
probably not be elegant to apply this to the entire elements, so maybe the user 
should specify a field on which to apply the algorithm. But then maybe it would 
be good if we could fit this with keyBy, by maybe adding this as a method of 
KeyedStream? I'm not sure.

Note that, for example [~aljoscha] might have a more informed opinion on all 
this.

> Approximate calculation of frequencies in data streams
> ------------------------------------------------------
>
>                 Key: FLINK-2147
>                 URL: https://issues.apache.org/jira/browse/FLINK-2147
>             Project: Flink
>          Issue Type: New Feature
>          Components: Streaming
>            Reporter: Gabor Gevay
>              Labels: approximate, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to