Hello and thanks for the subscription! I am using Streaming API to develop a ML algorithm and i would like your opinions regarding the following issues:
*1)* The input is read from a big size file with d-dimensional points, and i want to perform a parallel count window. In each parallel count window, i want to perform a function that maintains a list of buckets in memory in order to be checkpointed(exploiting state feature) . For every parallel count window some(0*)! of the buckets will be updated or deleted. *My thoughts: * As there is no logical key and there is no parallel countWinowAll, the correct way is to perform a parallel flatmap operator? But then i assume that i must implement a custom buffering of input data using ListState to implement the countwindow? Also i could use again another ListState to maintain the list of buckets in memory. But then every time i want to update a specific buffer of the listState i must clear the ListState and reinsert all buffers again(not Optimal for big buffers)? The other way is to use a deterministic pseudo-key and use keyby.countwindow. The number of different keys will be the number of parallelism. In order to update some of the buckets for every key (parallel instance) i am considering the use of mapState(UK=Bucket index,UV= Bucket elements). In that case i think the use of pseudo-key is not the best technique? and also i am going to use unnecessary data shuffle (keyby)? *What is the best way? Or is there another way to solve the previous issues?* *2)*When there is no more input data (EOF) or when a user “asks” for a part-evaluation of the ML algorithm through an external source, i want to collect the list of buckets from the parallel operator instances to another reduce-style operator with parallelism 1 to find the final list (classic scenario of map-reduce). When there is no user query or EOF, I don't want the parallel operator instances to emit anything. *My thoughts:* *I don't know how the user will “ask” the flink parallel operator instances (parallel count window) to emit their results to the downstream operator of parallelism 1.* I don't know how the operator instances will know that the file ended (if i use keyby.countwindow i can use a custom trigger with timer? Else in flatmap case? ) The concept is that the list of buckets in each parallel operator instance is a local Sketch and i want to collect the local sketches when the user “asks” to calculate the final Sketch (*See the following diagram*). Any thoughts are very much appreciated!!! Thank you in advance. <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1474/Untitled.png> -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/