I think if the concept makes sense we can add it and runners can reject unsupported pipelines. Starting from the use case is a good way to go.
Kenn On Wed, Apr 25, 2018, 18:23 Reuven Lax <[email protected]> wrote: > A few questions: > > Do you want the state to still be associated with the window? In Beam the > window is per key only. While some window types (e.g. fixed windows) assign > the same windows to all keys giving the illusion that there a single window > across keys, in actuality each key has its own separate set of windows. > > Given that execution of a ParDo is spread across many workers and every > worker can read and write state, how would you prevent state from being > corrupted? Beam runners today ensure that a single key is processed on a > single worker, so at any point in time there is only one writer. > > How do you imagine this to be implemented? The current Beam runners (Spark > Flink Dataflow) all support per-key state natively. Flink has extremely > limited support for operator state, but it's only useful in certain cases. > I'm not sure any of current runners can easily model this. > > What exactly are the use cases people are trying to code up? Simply > porting another system's programming model onto Beam probably won't work > very well. It would be better to understand what problems people are trying > to solve, and to understand how to solve those inside the Beam model. > > Reuven > > On Wed, Apr 25, 2018 at 5:45 PM Xinyu Liu <[email protected]> wrote: > >> Hi, >> >> I am working on adding the stateful ParDo to the upcoming BEAM Samza >> runner, and realized that the state for each ParDo processElement() is not >> only associated with the window of the element, but also the key of the >> element. Chatted with Kenneth over email about this design decision, which >> has the following benefits for keyed state: >> >> 1) No synchronization >> 2) Simple programming model >> 3) No communication between works >> >> The current design doesn't support accessing the state across different >> keys, which seems to be a more general use case. This use case is also very >> common inside LinkedIn where the users have access to the entire state of >> an operator/task, and performing lookups and computations on top of it. >> It's quite hard to make every user here aware that the state is also >> tightly associated with key of the element.. From the stateful ParDo API >> the state looks pretty general too. I am wondering is it possible to extend >> the current API to support both keyed and non-keyed state? Even internally >> BEAM assigns a dummy key for to associate the state with all the elements. >> It will be very beneficial to existing Samza users and help them adopt BEAM. >> >> Thanks, >> Xinyu >> >
