subject:"\[jira\] \[Commented\] \(FLINK\-2147\) Approximate calculation of frequencies in data streams"

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2021-04-29 Thread Flink Jira Bot (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17336921#comment-17336921
 ] 

Flink Jira Bot commented on FLINK-2147:
---

This issue was labeled "stale-major" 7 ago and has not received any updates so 
it is being deprioritized. If this ticket is actually Major, please raise the 
priority and ask a committer to assign you the issue or revive the public 
discussion.


> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: API / DataStream
>Reporter: Gábor Gévay
>Priority: Major
>  Labels: approximate, stale-major, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2021-04-22 Thread Flink Jira Bot (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17328984#comment-17328984
 ] 

Flink Jira Bot commented on FLINK-2147:
---

This major issue is unassigned and itself and all of its Sub-Tasks have not 
been updated for 30 days. So, it has been labeled "stale-major". If this ticket 
is indeed "major", please either assign yourself or give an update. Afterwards, 
please remove the label. In 7 days the issue will be deprioritized.

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: API / DataStream
>Reporter: Gábor Gévay
>Priority: Major
>  Labels: approximate, stale-major, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2019-04-24 Thread Felipe Oliveira Gutierrez (JIRA)



[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825067#comment-16825067
 ] 

Felipe Oliveira Gutierrez commented on FLINK-2147:
--

Hi [~aljoscha], [~till.rohrmann], and @all from this thread.

I would like to learn the necessary skills to help on this issue. I have 
implemented the CountMinSketch [1] using a ProcessWindowFunction [2]. However, 
what I would like to have is a function that I can call after the window like 
sum(), min(), etc. I have implemented a custom operator to do the Combiner [4], 
but I believe that this issue is a bit more complex since it has to compute 
after the window. So, I was planning to base my implementation on a custom 
JoinFuntion as [~till.rohrmann] did in this example [3].

Where do you think I can start? Do you have any sample code that I can start to 
build up my own custom window operator and afterward strive to implement the 
CountMin-Sketch for Flink?

[1] 
[https://github.com/felipegutierrez/explore-flink/blob/master/src/main/java/org/sense/flink/util/CountMinSketch.java]

[2] 
[https://github.com/felipegutierrez/explore-flink/blob/master/src/main/java/org/sense/flink/examples/stream/MultiSensorMultiStationsReadingMqtt2.java#L69]

[3] [https://github.com/tillrohrmann/custom-join]

[4] 
[https://github.com/felipegutierrez/explore-flink/blob/master/src/main/java/org/sense/flink/examples/stream/MqttSensorDataCombinerByKeySkewedDAG.java#L93]

Kind Regards,
Felipe

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: API / DataStream
>Reporter: Gabor Gevay
>Priority: Major
>  Labels: approximate, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2017-04-24 Thread JIRA


[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15981414#comment-15981414
 ] 

Gábor Hermann commented on FLINK-2147:
--

I would prefer emitting updates on a window basis, as a Flink have quite rich 
options for triggering. With overpartitioning, there could be _count-min sketch 
partitions_ (CMS-partitions), more than the number of partitions (i.e. 
subtasks). We could assign the CMS-partition to the input data (based on 
hashing) and keyBy on CMS-partition. Then, we could fold over the CMS-partition 
(a compact array), which is (AFAIK) internally stored as a keyed state. This 
way, we would not keep state for every key separately (saving memory), while 
allowing scaling operators (inc./dec. parallelism). Does that make sense?

Using windows makes easier to define _when to delete old data_ and _when to 
emit results_ and deal with _out-of-orderness_. However, with windows there's 
slightly more memory overhead compared to e.g. storing one count-min sketch 
array per partition.

A question is then *what API should we provide?* The user could specify the 
key, window assigner, trigger, evictor, allowedLateness, and the count-min 
sketch properties (size, hash functions). Then, the window could be translated 
into a another window keyed by the CMS-partition (as I described). But should 
it be a simply function that takes a DataStream as input and returns a 
DataStream with the results? Or should we add DataStream a special 
countMinSketch function to KeyedDataStream?

Alternatively, we could implement count-min sketch without windows. The user 
would specify two streams: one queries and the other writes the count-min 
sketch. So the "triggering" is done by a stream. The problem is then how do we 
specify when to delete old data and how to deal with out-of-orderness?

Another question is *where could we place the API?* In flink-streaming-java 
module? Or flink-streaming-contrib? This, of course, highly depends on what API 
we would provide.

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: DataStream API
>Reporter: Gabor Gevay
>  Labels: approximate, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2017-04-18 Thread Aljoscha Krettek (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15972600#comment-15972600
 ] 

Aljoscha Krettek commented on FLINK-2147:
-

The problem is now: when do you emit results from this? I.e. on a window basis, 
or with timeouts or do you emit an updated count whenever an event for a given 
key arrives?

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: DataStream API
>Reporter: Gabor Gevay
>  Labels: approximate, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2017-04-18 Thread Aljoscha Krettek (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15972596#comment-15972596
 ] 

Aljoscha Krettek commented on FLINK-2147:
-

Yes, I think over partitioning according to the number of key groups 
(maxParallelism) could work.

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: DataStream API
>Reporter: Gabor Gevay
>  Labels: approximate, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2017-04-04 Thread Gabor Gevay (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15955166#comment-15955166
 ] 

Gabor Gevay commented on FLINK-2147:


Maybe we could over-partition the sketch into maxParallelism parts. (similarly, 
as we have more key-groups than actual partitions)

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: DataStream API
>Reporter: Gabor Gevay
>  Labels: approximate, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2017-04-04 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15955164#comment-15955164
 ] 

Stavros Kontopoulos commented on FLINK-2147:


You just pick one of the sketches merge it with another one kill the task (3 
down to 2 case).
For 1 to N. Just split the stream and create N count-min sketched. Wouldn't 
that work?

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: DataStream API
>Reporter: Gabor Gevay
>  Labels: approximate, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2017-04-04 Thread Aljoscha Krettek (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15955147#comment-15955147
 ] 

Aljoscha Krettek commented on FLINK-2147:
-

Yes, but what I'm saying is that it is not easy to deal with these task-local 
states when you change parallelism. For example, assume that you have 
parallelism 3. You have three task-local states. Now, the parallelism is 
changed to 2. How do you redistribute the sketch state? Keep in mind that Flink 
uses a (more or less) fixed partitioner for deciding where to send keyed 
elements. We have this to ensure that elements go to the parallel operator that 
is responsible for a key and that has the correct state.

The reverse problem is even harder, I think. For example. when you want to 
scale from parallelism 1 to a higher parallelism.

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: DataStream API
>Reporter: Gabor Gevay
>  Labels: approximate, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2017-04-04 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15955103#comment-15955103
 ] 

Stavros Kontopoulos commented on FLINK-2147:


I think Count-min sketch can be implemented in way that each task keeps a local 
count-min sketch as state, and as a next step it emits the frequencies after an 
aggregation of count-mi sketches. This could be windows based and would involve 
to implement custom operators. This is a high level description and may not fit 
exactly to the internals.

A distributed implementation here:
https://www.slideshare.net/databricks/sketching-big-data-with-spark-randomized-algorithms-for-largescale-data-analytics
https://github.com/apache/spark/pull/10911/files

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: DataStream API
>Reporter: Gabor Gevay
>  Labels: approximate, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2017-04-04 Thread Aljoscha Krettek (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15955036#comment-15955036
 ] 

Aljoscha Krettek commented on FLINK-2147:
-

If I'm not mistaken, a count-min sketch or similar sketches serve to reduce the 
size of the state that you need to keep by keeping state for several keys in 
one approximate data structure. For example, assume we have an input stream 
with k different keys. A naive solution for determining frequencies would keep 
k * s (where s is the per-key state size) counters. These counters are 
separate, per-key, and the counter is incremented when we see an element for a 
given key. When using a sketch you can reduce the state size but you no longer 
have state that is separate per key, correct? If this is correct then there is 
no easy way of implementing this in Flink because window state (which is 
regular keyed state) is per-key. Keeping state that is not per-key is quite 
tricky when it comes to changing the parallelism of a topology, which you can 
do with savepoints.

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: DataStream API
>Reporter: Gabor Gevay
>  Labels: approximate, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2017-04-03 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953422#comment-15953422
 ] 

Stavros Kontopoulos commented on FLINK-2147:


[~aljoscha]What is your suggestion for this?

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: DataStream API
>Reporter: Gabor Gevay
>  Labels: approximate, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2016-05-19 Thread Gabor Gevay (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15291006#comment-15291006
 ] 

Gabor Gevay commented on FLINK-2147:


> From a first look, something like StreamGroupedFold would be enough right?

Sorry, I'm not sure. I suggest you ask on the mailing list, and then probably 
someone who knows streaming better than me will respond. Unfortunately I don't 
have enough time now to delve deep into this.

By the way, maybe you could start with this Jira: 
https://issues.apache.org/jira/browse/FLINK-2144
There are some similarities to this one, but it is more straightforward to 
implement.

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Gabor Gevay
>  Labels: approximate, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2016-05-18 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15289498#comment-15289498
 ] 

Stavros Kontopoulos commented on FLINK-2147:


How do you want to move on? collaborate on a branch on a fork?

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Gabor Gevay
>  Labels: approximate, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2016-05-18 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15288771#comment-15288771
 ] 

Stavros Kontopoulos commented on FLINK-2147:


>From a first look, something like StreamGroupedFold 
>https://github.com/eBay/Flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/operators/StreamGroupedFold.java
>  ,would be enough right? Define our own operator to keep the value updated.

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Gabor Gevay
>  Labels: approximate, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2016-05-16 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284484#comment-15284484
 ] 

Stavros Kontopoulos commented on FLINK-2147:


ok i will have a look as well to get familiar with it.

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Gabor Gevay
>  Labels: approximate, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2016-05-16 Thread Gabor Gevay (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284479#comment-15284479
 ] 

Gabor Gevay commented on FLINK-2147:


Yes. We should probably look in the direction of FoldFunction, since Flink does 
preaggregation for these, if I'm not mistaken.

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Gabor Gevay
>  Labels: approximate, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2016-05-16 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284476#comment-15284476
 ] 

Stavros Kontopoulos commented on FLINK-2147:


Ok i agree then we calculate statistics per window in isolated manner like sum, 
mean etc without the aggregation in buffer. Ok so lets see how we avoid that 
correct?

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Gabor Gevay
>  Labels: approximate, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2016-05-16 Thread Gabor Gevay (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284468#comment-15284468
 ] 

Gabor Gevay commented on FLINK-2147:


(Note that the out-of-order problem that I mentioned for the incremental median 
stuff was about the situation that I wanted to re-use some part of the 
calculations across different windows (hence the word "incremental"). This is 
probably not necessary here, we can just work on separate windows separately.)

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Gabor Gevay
>  Labels: approximate, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2016-05-16 Thread Gabor Gevay (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284467#comment-15284467
 ] 

Gabor Gevay commented on FLINK-2147:


In my opinion, the semantics would be to calculate the statistic only about 
each window separately. When to emit is handled by the triggers (as with other 
windowing calculations in Flink.) (Note that the windows can be quite large, 
like weekly or monthly.)

I think that having a statistic about the entire stream is rarely what the user 
actually wants. Flink programs are designed to run indefinitely for a long 
time, and the starting point of a stream is just when the user happened to 
start the Flink program, which might have no real semantic meaning if the Flink 
program is analyzing some external system.

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Gabor Gevay
>  Labels: approximate, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2016-05-16 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284446#comment-15284446
 ] 

Stavros Kontopoulos commented on FLINK-2147:


Ok so if there multiple windows evaluated at different times in parallel since 
data comes out of order, what kind of statistic is computable in this model? 
What are the correct semantics here? Emit a statistic update only when ordering 
is reconstructed (appropriate windows are calculated) and delay future results? 
What about count min?

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Gabor Gevay
>  Labels: approximate, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2016-05-16 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284440#comment-15284440
 ] 

Stavros Kontopoulos commented on FLINK-2147:


For sure if you apply it per window then you need to avoid keeping any data 
after you update your algorithm/structure. If you have window results i guess 
you can update a statistic about the whole stream when this is valid, depending 
on the statistic.

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Gabor Gevay
>  Labels: approximate, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2016-05-16 Thread Gabor Gevay (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284437#comment-15284437
 ] 

Gabor Gevay commented on FLINK-2147:


I'm actually now closing the incremental median and similar stuff, as it was 
based on the assumption of the old (pre-0.10) windowing API that events come in 
ordered by time, so it doesn't fit into the new API. (See 
https://github.com/apache/flink/pull/684#issuecomment-195402038)

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Gabor Gevay
>  Labels: approximate, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2016-05-16 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284431#comment-15284431
 ] 

Stavros Kontopoulos commented on FLINK-2147:


Ok... btw I was looking your old PR for median etc, i am wondering what is the 
status of memory management for window buffering in master.

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Gabor Gevay
>  Labels: approximate, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2016-05-16 Thread Gabor Gevay (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284427#comment-15284427
 ] 

Gabor Gevay commented on FLINK-2147:


OK, we can certainly think about it.

It would probably make sense to apply this to windows, and not the entire 
stream. In this case, we have to figure out how to apply the algorithm already 
when Flink is building the windows. (I mean we can't just stuff the algorithm 
into a WindowFunction, because then Flink would buffer up all the elements of a 
window, which defeats the purpose.) Maybe it can be done with a fold, since I 
think that Flink is already doing preaggregation for folds.

Another issue is how do we want the user to specify the key? I mean it would 
probably not be elegant to apply this to the entire elements, so maybe the user 
should specify a field on which to apply the algorithm. But then maybe it would 
be good if we could fit this with keyBy, by maybe adding this as a method of 
KeyedStream? I'm not sure.

Note that, for example [~aljoscha] might have a more informed opinion on all 
this.

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Gabor Gevay
>  Labels: approximate, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2016-05-16 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284410#comment-15284410
 ] 

Stavros Kontopoulos commented on FLINK-2147:


Yes i agree the api is the hard part, but we could work on this if you want or 
at least check if it is a mature task now considering stream api stability etc.

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Gabor Gevay
>  Labels: statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2016-05-16 Thread Gabor Gevay (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284406#comment-15284406
 ] 

Gabor Gevay commented on FLINK-2147:


Yes, I think it would be very good if you worked on this, and starting with a 
design document is a good idea. However, if you haven't yet contributed to 
Flink, then I would suggest starting with some simpler tasks first. Note, that 
the hard part here will not be implementing the algorithm that is described in 
the linked paper, but figuring out how it fits into the API and internals of 
Flink.

(This was a sub-task of my Google Summer of Code project last summer, but then 
I haven't really worked on this, because the streaming API was changing a lot 
at that time. I will convert this sub-task to a stand-alone issue now.)

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Gabor Gevay
>Priority: Minor
>  Labels: statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2016-05-16 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284360#comment-15284360
 ] 

Stavros Kontopoulos commented on FLINK-2147:


Would be ok to give it a shot in the long term, start with a short design 
document etc (although im newbie with Flink)? 

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Gabor Gevay
>Priority: Minor
>  Labels: statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2016-05-16 Thread Gabor Gevay (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284312#comment-15284312
 ] 

Gabor Gevay commented on FLINK-2147:


Unfortunately, no.

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Gabor Gevay
>Priority: Minor
>  Labels: statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2016-05-15 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15283961#comment-15283961
 ] 

Stavros Kontopoulos commented on FLINK-2147:


Anyone working on this?

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Gabor Gevay
>Priority: Minor
>  Labels: statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

30 matches

Site Navigation

Mail list logo

Footer information