ML in Streaming API

Thodoris Bitsakis Wed, 30 May 2018 06:08:02 -0700

Hello and thanks for the subscription!

I am using Streaming API to develop a ML algorithm and i would like your
opinions regarding the following issues:

*1)* The input is read from a big size file with d-dimensional points, and i
want to perform a parallel count window. In each parallel count window, i
want to perform a function that maintains a list of buckets in memory in
order to be checkpointed(exploiting state feature) . For every parallel
count window some(0*)! of the buckets will be updated or deleted.

*My thoughts: *
As there is no logical key and there is no parallel countWinowAll, the
correct way is to perform a parallel flatmap operator? But then i assume
that i must implement a custom buffering of input data using ListState to
implement the countwindow? Also i could use again another ListState to
maintain the list of buckets in memory. But then every time i want to update
a specific buffer of the listState i must clear the ListState and reinsert
all buffers again(not Optimal for big buffers)?

The other way is to use a deterministic pseudo-key and use
keyby.countwindow. The number of different keys will be the number of
parallelism. In order to update some of the buckets for every key (parallel
instance) i am considering the use of mapState(UK=Bucket index,UV= Bucket
elements). In that case i think the use of pseudo-key is not the best
technique? and also i am going to use unnecessary data shuffle (keyby)?

*What is the best way? Or is there another way to solve the previous
issues?*

*2)*When there is no more input data (EOF) or when a user “asks” for a
part-evaluation of the ML algorithm through an external source, i want to
collect the list of buckets from the parallel operator instances to another
reduce-style operator with parallelism 1 to find the final list (classic
scenario of map-reduce). When there is no user query or EOF, I don't want
the parallel operator instances to emit anything.

*My thoughts:* *I don't know how the user will “ask” the flink parallel
operator instances (parallel count window) to emit their results to the
downstream operator of parallelism 1.* I don't know how the operator
instances will know that the file ended (if i use keyby.countwindow i can
use a custom trigger with timer? Else in flatmap case? )

The concept is that the list of buckets in each parallel operator instance
is a local Sketch and i want to collect the local sketches when the user
“asks” to calculate the final Sketch (*See the following diagram*).

Any thoughts are very much appreciated!!! Thank you in advance.
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1474/Untitled.png>

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

ML in Streaming API

Reply via email to