Hello everyone and Happy New Year! Regarding the Heavy Hitter tracking...I wanna do it in a distributed manner.
Thus, 1 -- Round Robin the input stream to a number of parallel map instances (say p = env.parallelism) 2 -- Each one of the p mappers maintains approximately the HH of its corresponding portion of the input, utilizing an algorithm like Space Saving, Misha-Gries etc etc. 3 -- Every now and then I would like to concatenate the state of all the p mappers into one, thus producing the global Space Saving summary for the entire input stream. 4 -- Due to the fact that I wanna balance out things given to the p mappers in the beginning, I wanna use rebalance(), i.e. round robin algorithm --> Thus, its is not possible to use Keyed State. 5 -- So, I am going to use ListCheckpointed state as described in [1]. 6 -- When the "every now and then" happens, I wanna merge the partial summaries and I will emit them through a side output, as described in [2]. The question is the following: [1] shows an example of state-redistribution. So...can I change the parallelism of the p instance parallel .map() from within the operator, and merge the summaries for the HH there just before emitting them to the side output??? Essentially, how should I implement the 6th bullet is my question. Any advice, on it or on the general guideline implementation for getting the aforementioned thing done, is more than welcome. Cheer, Max [1] https://ci.apache.org/projects/flink/flink-docs-release-1.4/api/java/ [2] https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/stream/side_output.html -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/