[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15955036#comment-15955036
 ] 

Aljoscha Krettek commented on FLINK-2147:
-----------------------------------------

If I'm not mistaken, a count-min sketch or similar sketches serve to reduce the 
size of the state that you need to keep by keeping state for several keys in 
one approximate data structure. For example, assume we have an input stream 
with k different keys. A naive solution for determining frequencies would keep 
k * s (where s is the per-key state size) counters. These counters are 
separate, per-key, and the counter is incremented when we see an element for a 
given key. When using a sketch you can reduce the state size but you no longer 
have state that is separate per key, correct? If this is correct then there is 
no easy way of implementing this in Flink because window state (which is 
regular keyed state) is per-key. Keeping state that is not per-key is quite 
tricky when it comes to changing the parallelism of a topology, which you can 
do with savepoints.

> Approximate calculation of frequencies in data streams
> ------------------------------------------------------
>
>                 Key: FLINK-2147
>                 URL: https://issues.apache.org/jira/browse/FLINK-2147
>             Project: Flink
>          Issue Type: New Feature
>          Components: DataStream API
>            Reporter: Gabor Gevay
>              Labels: approximate, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to