[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284467#comment-15284467
 ] 

Gabor Gevay commented on FLINK-2147:
------------------------------------

In my opinion, the semantics would be to calculate the statistic only about 
each window separately. When to emit is handled by the triggers (as with other 
windowing calculations in Flink.) (Note that the windows can be quite large, 
like weekly or monthly.)

I think that having a statistic about the entire stream is rarely what the user 
actually wants. Flink programs are designed to run indefinitely for a long 
time, and the starting point of a stream is just when the user happened to 
start the Flink program, which might have no real semantic meaning if the Flink 
program is analyzing some external system.

> Approximate calculation of frequencies in data streams
> ------------------------------------------------------
>
>                 Key: FLINK-2147
>                 URL: https://issues.apache.org/jira/browse/FLINK-2147
>             Project: Flink
>          Issue Type: New Feature
>          Components: Streaming
>            Reporter: Gabor Gevay
>              Labels: approximate, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to