Re: buffering in operators, implementing statistics

2016-05-31 Thread Stavros Kontopoulos
Hi Stephan, An external project would be possible and maybe merge it in the future if it makes sense. Just wanted to point out that in general there is a need, but i understand priorities and may also try to work on these. Best, Stavros On Thu, May 26, 2016 at 10:00 PM, Stephan Ewen wrote: > H

Re: buffering in operators, implementing statistics

2016-05-26 Thread Stephan Ewen
Hi Stavros! I think what Aljoscha wants to say is that the community is a bit hard pressed reviewing new and complex things right now. There are a lot of threads going on already. If you want to work on this, why not make your own GitHub project "Approximate algorithms on Apache Flink" or so? Gr

Re: buffering in operators, implementing statistics

2016-05-25 Thread Aljoscha Krettek
Hi, that link was interesting, thanks! As I said though, it's probably not a good fit for Flink right now. The things that I feel are important right now are: - dynamic scaling: the ability of a streaming pipeline to adapt to changes in the amount of incoming data. This is tricky with stateful o

Re: buffering in operators, implementing statistics

2016-05-23 Thread Stavros Kontopoulos
Hey Aljoscha, Thnax for the useful comments. I have recently looked at spark sketches: http://www.slideshare.net/databricks/sketching-big-data-with-spark-randomized-algorithms-for-largescale-data-analytics So there must be value in this effort. In my experience counting in general is a common need

Re: buffering in operators, implementing statistics

2016-05-23 Thread Aljoscha Krettek
Hi, no such changes are planned right now. The separaten between the keys is very strict in order to make the windowing state re-partitionable so that we can implement dynamic rescaling of the parallelism of a program. The WindowAll is only used for specific cases where you need a Trigger that see

Re: buffering in operators, implementing statistics

2016-05-20 Thread Stavros Kontopoulos
Hi thnx for the feedback. So there is a limitation due to parallel windows implementation. No intentions to change that somehow to accommodate similar estimations? WindowAll in practice is used as step in the pipeline? I mean since its inherently not parallel cannot scale correct? Although there

Re: buffering in operators, implementing statistics

2016-05-20 Thread Aljoscha Krettek
Hi, with how the window API currently works this can only be done for non-parallel windows. For keyed windows everything that happens is scoped to the key of the elements: window contents are kept in per-key state, triggers fire on a per-key basis. Therefore a count-min sketch cannot be used becaus

buffering in operators, implementing statistics

2016-05-19 Thread Stavros Kontopoulos
Hi guys, I would like to push forward the work here: https://issues.apache.org/jira/browse/FLINK-2147 Can anyone more familiar with streaming api verify if this could be a mature task. The intention is to summarize data over a window like in the case of StreamGroupedFold. Specifically implement c