Re: Support non-keyed stateful ParDo

Reuven Lax Wed, 25 Apr 2018 18:31:57 -0700

Do you want execution of a single operator to be distributed across workers
as is the case for Beam? Or do you imagine a single operator existing on a
single worker?


On Wed, Apr 25, 2018 at 6:28 PM Xinyu Liu <[email protected]> wrote:

> @Robert: for your questions:
>
> 1) Side input won't work for us since it returns the whole collection. We
> use rocksDb and usually the state is too big to fit in memory.
>
> 2) One way to achieve our use cases is to assign a single key to all the
> elements so they will be associated with the same keyed state. The state
> will belong to the element window as it is. Kenneth mentioned this solution
> too. It does meet our use case, but it's not very convenient to our users.
>
> 3) Sorry if I wasn't clear about the use case. For our usage, it's pretty
> common to store the elements in the states, and look them up later and do
> some computation. The elements will be in the same window, but doesn't need
> to be of the same key.
>
> Thanks,
> Xinyu
>
> On Wed, Apr 25, 2018 at 6:02 PM, Robert Bradshaw <[email protected]>
> wrote:
>
>> On Wed, Apr 25, 2018 at 5:45 PM Xinyu Liu <[email protected]> wrote:
>>
>> > Hi,
>>
>> > I am working on adding the stateful ParDo to the upcoming BEAM Samza
>> runner, and realized that the state for each ParDo processElement() is not
>> only associated with the window of the element, but also the key of the
>> element. Chatted with Kenneth over email about this design decision, which
>> has the following benefits for keyed state:
>>
>> > 1) No synchronization
>> > 2) Simple programming model
>> > 3) No communication between works
>>
>> > The current design doesn't support accessing the state across different
>> keys, which seems to be a more general use case. This use case is also
>> very
>> common inside LinkedIn where the users have access to the entire state of
>> an operator/task, and performing lookups and computations on top of it.
>> It's quite hard to make every user here aware that the state is also
>> tightly associated with key of the element..
>>
>> Would side inputs be applicable here? (They're read-only, but other than
>> that seem to fit the need.)
>>
>> >  From the stateful ParDo API the state looks pretty general too. I am
>> wondering is it possible to extend the current API to support both keyed
>> and non-keyed state? Even internally BEAM assigns a dummy key for to
>> associate the state with all the elements. It will be very beneficial to
>> existing Samza users and help them adopt BEAM.
>>
>> Could you clarify how you would use this dummy key? You could manually add
>> a random key, but in that case it's unlikely that any state stored would
>> get observed again. Across what scope were you thinking state would be
>> stored? The lifetime of the bundle? The worker? The job? How are
>> conflicting writes resolved?
>>
>> Perhaps rather than describing the mechanism (state) that you're trying to
>> use, it'd be helpful to describe the kinds of computations you're trying
>> to
>> perform, to figure out how the model should be adapted/extended if it
>> doesn't meet those needs.
>>
>
>

Re: Support non-keyed stateful ParDo

Reply via email to