To be clear, the intent was always that ValueState would be not usable in
merging pipelines. So no danger of clobbering, but also limited
functionality. Is there a runner than accepts it and clobbers? The whole
idea of the new DoFn is that it is easy to do the construction-time
analysis and reject the invalid pipeline. It is actually runner independent
and I think already implemented in ParDo's validation, no?

Kenn

On Fri, Apr 26, 2019 at 10:14 AM Lukasz Cwik <lc...@google.com> wrote:

> I am in the camp where we should only support merging state (either
> naturally via things like bags or via combiners). I believe that having the
> wrapper that Brian suggests is useful for users. As for the @OnMerge
> method, I believe combiners should have the ability to look at the window
> information and we should treat @OnMerge as syntactic sugar over a combiner
> if the combiner API is too cumbersome.
>
> I believe using combiners can also extend to side inputs and help us deal
> with singleton and map like side inputs when multiple firings occur. I also
> like treating everything like a combiner because it will give us a lot
> reuse of combiner implementations across all the places they could be used
> and will be especially useful when we start exposing APIs related to
> retractions on combiners.
>
> On Fri, Apr 26, 2019 at 9:43 AM Brian Hulette <bhule...@google.com> wrote:
>
>> Yeah the danger with out of order processing concerns me more than the
>> merging as well. As a new Beam user, I immediately gravitated towards
>> ValueState since it was easy to think about and I just assumed there wasn't
>> anything to be concerned about. So it was shocking to learn that there is
>> this dangerous edge-case.
>>
>> What if ValueState were just implemented as a wrapper of CombiningState
>> with a LatestCombineFn and documented as such (and perhaps we encourage
>> users to consider using a CombiningState explicitly if at all possible)?
>>
>> Brian
>>
>>
>>
>> On Fri, Apr 26, 2019 at 2:29 AM Robert Bradshaw <rober...@google.com>
>> wrote:
>>
>>> On Fri, Apr 26, 2019 at 6:40 AM Kenneth Knowles <k...@apache.org> wrote:
>>> >
>>> > You could use a CombiningState with a CombineFn that returns the
>>> minimum for this case.
>>>
>>> We've also wanted to be able to set data when setting a timer that
>>> would be returned when the timer fires. (It's in the FnAPI, but not
>>> the SDKs yet.)
>>>
>>> The metadata is an interesting usecase, do you have some more specific
>>> examples? Might boil down to not having a rich enough (single) state
>>> type.
>>>
>>> > But I've come to feel there is a mismatch. On the one hand,
>>> ParDo(<stateful DoFn>) is a way to drop to a lower level and write logic
>>> that does not fit a more general computational pattern, really taking fine
>>> control. On the other hand, automatically merging state via CombiningState
>>> or BagState is more of a no-knobs higher level of programming. To me there
>>> seems to be a bit of a philosophical conflict.
>>> >
>>> > These days, I feel like an @OnMerge method would be more natural. If
>>> you are using state and timers, you probably often want more direct control
>>> over how state from windows gets merged. An of course we don't even have a
>>> design for timers - you would need some kind of timestamp CombineFn but I
>>> think setting/unsetting timers manually makes more sense. Especially
>>> considering the trickiness around merging windows in the absence of
>>> retractions, you really need this callback, so you can issue retractions
>>> manually for any output your stateful DoFn emitted in windows that no
>>> longer exist.
>>>
>>> I agree we'll probably need an @OnMerge. On the other hand, I like
>>> being able to have good defaults. The high/low level thing is a
>>> continuum (the indexing example falling towards the high end).
>>>
>>> Actually, the merging questions bother me less than how easy it is to
>>> accidentally clobber previous values. It looks so easy (like the
>>> easiest state to use) but is actually the most dangerous. If one wants
>>> this behavior, I would rather an explicit AnyCombineFn or
>>> LatestCombineFn which makes you think about the semantics.
>>>
>>> - Robert
>>>
>>> > On Thu, Apr 25, 2019 at 5:49 PM Reza Rokni <r...@google.com> wrote:
>>> >>
>>> >> +1 on the metadata use case.
>>> >>
>>> >> For performance reasons the Timer API does not support a read()
>>> operation, which for the  vast majority of use cases is not a required
>>> feature. In the small set of use cases where it is needed, for example when
>>> you need to set a Timer in EventTime based on the smallest timestamp seen
>>> in the elements within a DoFn, we can make use of a ValueState object to
>>> keep track of the value.
>>> >>
>>> >> On Fri, 26 Apr 2019 at 00:38, Reuven Lax <re...@google.com> wrote:
>>> >>>
>>> >>> I see examples of people using ValueState that I think are not
>>> captured CombiningState. For example, one common one is users who set a
>>> timer and then record the timestamp of that timer in a ValueState. In
>>> general when you store state that is metadata about other state you store,
>>> then ValueState will usually make more sense than CombiningState.
>>> >>>
>>> >>> On Thu, Apr 25, 2019 at 9:32 AM Brian Hulette <bhule...@google.com>
>>> wrote:
>>> >>>>
>>> >>>> Currently the Python SDK does not make ValueState available to
>>> users. My initial inclination was to go ahead and implement it there to be
>>> consistent with Java, but Robert brings up a great point here that
>>> ValueState has an inherent race condition for out of order data, and a lot
>>> of it's use cases can actually be implemented with a CombiningState instead.
>>> >>>>
>>> >>>> It seems to me that at the very least we should discourage the use
>>> of ValueState by noting the danger in the documentation and preferring
>>> CombiningState in examples, and perhaps we should go further and deprecate
>>> it in Java and not implement it in python. Either way I think we should be
>>> consistent between Java and Python.
>>> >>>>
>>> >>>> I'm curious what people think about this, are there use cases that
>>> we really need to keep ValueState around for?
>>> >>>>
>>> >>>> Brian
>>> >>>>
>>> >>>> ---------- Forwarded message ---------
>>> >>>> From: Robert Bradshaw <rober...@google.com>
>>> >>>> Date: Thu, Apr 25, 2019, 08:31
>>> >>>> Subject: Re: [docs] Python State & Timers
>>> >>>> To: dev <dev@beam.apache.org>
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> On Thu, Apr 25, 2019, 5:26 PM Maximilian Michels <m...@apache.org>
>>> wrote:
>>> >>>>>
>>> >>>>> Completely agree that CombiningState is nicer in this example.
>>> Users may
>>> >>>>> still want to use ValueState when there is nothing to combine.
>>> >>>>
>>> >>>>
>>> >>>> I've always had trouble coming up with any good examples of this.
>>> >>>>
>>> >>>>> Also,
>>> >>>>> users already know ValueState from the Java SDK.
>>> >>>>
>>> >>>>
>>> >>>> Maybe we should deprecate that :)
>>> >>>>
>>> >>>>
>>> >>>>> On 25.04.19 17:12, Robert Bradshaw wrote:
>>> >>>>> > On Thu, Apr 25, 2019 at 4:58 PM Maximilian Michels <
>>> m...@apache.org> wrote:
>>> >>>>> >>
>>> >>>>> >> I forgot to give an example, just to clarify for others:
>>> >>>>> >>
>>> >>>>> >>> What was the specific example that was less natural?
>>> >>>>> >>
>>> >>>>> >> Basically every time we use ListState to express ValueState,
>>> e.g.
>>> >>>>> >>
>>> >>>>> >>     next_index, = list(state.read()) or [0]
>>> >>>>> >>
>>> >>>>> >> Taken from:
>>> >>>>> >>
>>> https://github.com/apache/beam/pull/8363/files#diff-ba1a2aed98079ccce869cd660ca9d97dR301
>>> >>>>> >
>>> >>>>> > Yes, ListState is much less natural here. I think generally
>>> >>>>> > CombiningValue is often a better replacement. E.g. the Java
>>> example
>>> >>>>> > reads
>>> >>>>> >
>>> >>>>> >
>>> >>>>> > public void processElement(
>>> >>>>> >        ProcessContext context, @StateId("index")
>>> ValueState<Integer> index) {
>>> >>>>> >      int current = firstNonNull(index.read(), 0);
>>> >>>>> >      context.output(KV.of(current, context.element()));
>>> >>>>> >      index.write(current+1);
>>> >>>>> > }
>>> >>>>> >
>>> >>>>> >
>>> >>>>> > which is replaced with bag state
>>> >>>>> >
>>> >>>>> >
>>> >>>>> > def process(self, element, state=DoFn.StateParam(INDEX_STATE)):
>>> >>>>> >      next_index, = list(state.read()) or [0]
>>> >>>>> >      yield (element, next_index)
>>> >>>>> >      state.clear()
>>> >>>>> >      state.add(next_index + 1)
>>> >>>>> >
>>> >>>>> >
>>> >>>>> > whereas CombiningState would be more natural (than ListState, and
>>> >>>>> > arguably than even ValueState), giving
>>> >>>>> >
>>> >>>>> >
>>> >>>>> > def process(self, element, index=DoFn.StateParam(INDEX_STATE)):
>>> >>>>> >      yield element, index.read()
>>> >>>>> >      index.add(1)
>>> >>>>> >
>>> >>>>> >
>>> >>>>> >
>>> >>>>> >
>>> >>>>> >>
>>> >>>>> >> -Max
>>> >>>>> >>
>>> >>>>> >> On 25.04.19 16:40, Robert Bradshaw wrote:
>>> >>>>> >>> https://github.com/apache/beam/pull/8402
>>> >>>>> >>>
>>> >>>>> >>> On Thu, Apr 25, 2019 at 4:26 PM Robert Bradshaw <
>>> rober...@google.com> wrote:
>>> >>>>> >>>>
>>> >>>>> >>>> Oh, this is for the indexing example.
>>> >>>>> >>>>
>>> >>>>> >>>> I actually think using CombiningState is more cleaner than
>>> ValueState.
>>> >>>>> >>>>
>>> >>>>> >>>>
>>> https://github.com/apache/beam/blob/release-2.12.0/sdks/python/apache_beam/runners/portability/fn_api_runner_test.py#L262
>>> >>>>> >>>>
>>> >>>>> >>>> (The fact that one must specify the accumulator coder is,
>>> however,
>>> >>>>> >>>> unfortunate. We should probably infer that if we can.)
>>> >>>>> >>>>
>>> >>>>> >>>> On Thu, Apr 25, 2019 at 4:19 PM Robert Bradshaw <
>>> rober...@google.com> wrote:
>>> >>>>> >>>>>
>>> >>>>> >>>>> The desire was to avoid the implicit disallowed combination
>>> wart in
>>> >>>>> >>>>> Python (until we could make sense of it), and also
>>> ValueState could be
>>> >>>>> >>>>> surprising with respect to older values overwriting newer
>>> ones. What
>>> >>>>> >>>>> was the specific example that was less natural?
>>> >>>>> >>>>>
>>> >>>>> >>>>> On Thu, Apr 25, 2019 at 3:01 PM Maximilian Michels <
>>> m...@apache.org> wrote:
>>> >>>>> >>>>>>
>>> >>>>> >>>>>> @Pablo: Thanks for following up with the PR! :)
>>> >>>>> >>>>>>
>>> >>>>> >>>>>> @Brian: I was wondering about this as well. It makes the
>>> Python state
>>> >>>>> >>>>>> code a bit unnatural. I'd suggest to add a ValueState
>>> wrapper around
>>> >>>>> >>>>>> ListState/CombiningState.
>>> >>>>> >>>>>>
>>> >>>>> >>>>>> @Robert: Like Reuven pointed out, we can disallow
>>> ValueState for merging
>>> >>>>> >>>>>> windows with state.
>>> >>>>> >>>>>>
>>> >>>>> >>>>>> @Reza: Great. Let's make sure it has Python examples out of
>>> the box.
>>> >>>>> >>>>>> Either Pablo or me could help there.
>>> >>>>> >>>>>>
>>> >>>>> >>>>>> Thanks,
>>> >>>>> >>>>>> Max
>>> >>>>> >>>>>>
>>> >>>>> >>>>>> On 25.04.19 04:14, Reza Ardeshir Rokni wrote:
>>> >>>>> >>>>>>> Pablo, Kenneth and I have a new blog ready for publication
>>> which covers
>>> >>>>> >>>>>>> how to create a "looping timer" it allows for default
>>> values to be
>>> >>>>> >>>>>>> created in a window when no incoming elements exists. We
>>> just need to
>>> >>>>> >>>>>>> clear a few bits before publication, but would be great to
>>> have that
>>> >>>>> >>>>>>> also include a python example, I wrote it in java...
>>> >>>>> >>>>>>>
>>> >>>>> >>>>>>> Cheers
>>> >>>>> >>>>>>>
>>> >>>>> >>>>>>> Reza
>>> >>>>> >>>>>>>
>>> >>>>> >>>>>>> On Thu, 25 Apr 2019 at 04:34, Reuven Lax <re...@google.com
>>> >>>>> >>>>>>> <mailto:re...@google.com>> wrote:
>>> >>>>> >>>>>>>
>>> >>>>> >>>>>>>       Well state is still not implemented for merging
>>> windows even for
>>> >>>>> >>>>>>>       Java (though I believe the idea was to disallow
>>> ValueState there).
>>> >>>>> >>>>>>>
>>> >>>>> >>>>>>>       On Wed, Apr 24, 2019 at 1:11 PM Robert Bradshaw <
>>> rober...@google.com
>>> >>>>> >>>>>>>       <mailto:rober...@google.com>> wrote:
>>> >>>>> >>>>>>>
>>> >>>>> >>>>>>>           It was unclear what the semantics were for
>>> ValueState for merging
>>> >>>>> >>>>>>>           windows. (It's also a bit weird as it's
>>> inherently a race condition
>>> >>>>> >>>>>>>           wrt element ordering, unlike Bag and
>>> CombineState, though you can
>>> >>>>> >>>>>>>           always implement it as a CombineState that
>>> always returns the latest
>>> >>>>> >>>>>>>           value which is a bit more explicit about the
>>> dangers here.)
>>> >>>>> >>>>>>>
>>> >>>>> >>>>>>>           On Wed, Apr 24, 2019 at 10:08 PM Brian Hulette
>>> >>>>> >>>>>>>           <bhule...@google.com <mailto:bhule...@google.com>>
>>> wrote:
>>> >>>>> >>>>>>>            >
>>> >>>>> >>>>>>>            > That's a great idea! I thought about this too
>>> after those
>>> >>>>> >>>>>>>           posts came up on the list recently. I started to
>>> look into it,
>>> >>>>> >>>>>>>           but I noticed that there's actually no
>>> implementation of
>>> >>>>> >>>>>>>           ValueState in userstate. Is there a reason for
>>> that? I started
>>> >>>>> >>>>>>>           to work on a patch to add it but I was just
>>> curious if there was
>>> >>>>> >>>>>>>           some reason it was omitted that I should be
>>> aware of.
>>> >>>>> >>>>>>>            >
>>> >>>>> >>>>>>>            > We could certainly replicate the example
>>> without ValueState
>>> >>>>> >>>>>>>           by using BagState and clearing it before each
>>> write, but it
>>> >>>>> >>>>>>>           would be nice if we could draw a direct parallel.
>>> >>>>> >>>>>>>            >
>>> >>>>> >>>>>>>            > Brian
>>> >>>>> >>>>>>>            >
>>> >>>>> >>>>>>>            > On Fri, Apr 12, 2019 at 7:05 AM Maximilian
>>> Michels
>>> >>>>> >>>>>>>           <m...@apache.org <mailto:m...@apache.org>> wrote:
>>> >>>>> >>>>>>>            >>
>>> >>>>> >>>>>>>            >> > It would probably be pretty easy to add
>>> the corresponding
>>> >>>>> >>>>>>>           code snippets to the docs as well.
>>> >>>>> >>>>>>>            >>
>>> >>>>> >>>>>>>            >> It's probably a bit more work because there
>>> is no section
>>> >>>>> >>>>>>>           dedicated to
>>> >>>>> >>>>>>>            >> state/timer yet in the documentation.
>>> Tracked here:
>>> >>>>> >>>>>>>            >>
>>> https://jira.apache.org/jira/browse/BEAM-2472
>>> >>>>> >>>>>>>            >>
>>> >>>>> >>>>>>>            >> > I've been going over this topic a bit.
>>> I'll add the
>>> >>>>> >>>>>>>           snippets next week, if that's fine by y'all.
>>> >>>>> >>>>>>>            >>
>>> >>>>> >>>>>>>            >> That would be great. The blog posts are a
>>> great way to get
>>> >>>>> >>>>>>>           started with
>>> >>>>> >>>>>>>            >> state/timers.
>>> >>>>> >>>>>>>            >>
>>> >>>>> >>>>>>>            >> Thanks,
>>> >>>>> >>>>>>>            >> Max
>>> >>>>> >>>>>>>            >>
>>> >>>>> >>>>>>>            >> On 11.04.19 20:21, Pablo Estrada wrote:
>>> >>>>> >>>>>>>            >> > I've been going over this topic a bit.
>>> I'll add the
>>> >>>>> >>>>>>>           snippets next week,
>>> >>>>> >>>>>>>            >> > if that's fine by y'all.
>>> >>>>> >>>>>>>            >> > Best
>>> >>>>> >>>>>>>            >> > -P.
>>> >>>>> >>>>>>>            >> >
>>> >>>>> >>>>>>>            >> > On Thu, Apr 11, 2019 at 5:27 AM Robert
>>> Bradshaw
>>> >>>>> >>>>>>>           <rober...@google.com <mailto:rober...@google.com
>>> >
>>> >>>>> >>>>>>>            >> > <mailto:rober...@google.com <mailto:
>>> rober...@google.com>>>
>>> >>>>> >>>>>>>           wrote:
>>> >>>>> >>>>>>>            >> >
>>> >>>>> >>>>>>>            >> >     That's a great idea! It would probably
>>> be pretty easy
>>> >>>>> >>>>>>>           to add the
>>> >>>>> >>>>>>>            >> >     corresponding code snippets to the
>>> docs as well.
>>> >>>>> >>>>>>>            >> >
>>> >>>>> >>>>>>>            >> >     On Thu, Apr 11, 2019 at 2:00 PM
>>> Maximilian Michels
>>> >>>>> >>>>>>>           <m...@apache.org <mailto:m...@apache.org>
>>> >>>>> >>>>>>>            >> >     <mailto:m...@apache.org <mailto:
>>> m...@apache.org>>> wrote:
>>> >>>>> >>>>>>>            >> >      >
>>> >>>>> >>>>>>>            >> >      > Hi everyone,
>>> >>>>> >>>>>>>            >> >      >
>>> >>>>> >>>>>>>            >> >      > The Python SDK still lacks
>>> documentation on state
>>> >>>>> >>>>>>>           and timers.
>>> >>>>> >>>>>>>            >> >      >
>>> >>>>> >>>>>>>            >> >      > As a first step, what do you think
>>> about updating
>>> >>>>> >>>>>>>           these two blog
>>> >>>>> >>>>>>>            >> >     posts
>>> >>>>> >>>>>>>            >> >      > with the corresponding Python code?
>>> >>>>> >>>>>>>            >> >      >
>>> >>>>> >>>>>>>            >> >      >
>>> >>>>> >>>>>>>
>>> https://beam.apache.org/blog/2017/02/13/stateful-processing.html
>>> >>>>> >>>>>>>            >> >      >
>>> >>>>> >>>>>>>
>>> https://beam.apache.org/blog/2017/08/28/timely-processing.html
>>> >>>>> >>>>>>>            >> >      >
>>> >>>>> >>>>>>>            >> >      > Thanks,
>>> >>>>> >>>>>>>            >> >      > Max
>>> >>>>> >>>>>>>            >> >
>>> >>>>> >>>>>>>
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >>
>>> >> This email may be confidential and privileged. If you received this
>>> communication by mistake, please don't forward it to anyone else, please
>>> erase all copies and attachments, and please let me know that it has gone
>>> to the wrong person.
>>> >>
>>> >> The above terms reflect a potential business arrangement, are
>>> provided solely as a basis for further discussion, and are not intended to
>>> be and do not constitute a legally binding obligation. No legally binding
>>> obligations will be created, implied, or inferred until an agreement in
>>> final form is executed in writing by all parties involved.
>>>
>>

Reply via email to