Yeah the danger with out of order processing concerns me more than the
merging as well. As a new Beam user, I immediately gravitated towards
ValueState since it was easy to think about and I just assumed there wasn't
anything to be concerned about. So it was shocking to learn that there is
this dangerous edge-case.

What if ValueState were just implemented as a wrapper of CombiningState
with a LatestCombineFn and documented as such (and perhaps we encourage
users to consider using a CombiningState explicitly if at all possible)?

Brian



On Fri, Apr 26, 2019 at 2:29 AM Robert Bradshaw <rober...@google.com> wrote:

> On Fri, Apr 26, 2019 at 6:40 AM Kenneth Knowles <k...@apache.org> wrote:
> >
> > You could use a CombiningState with a CombineFn that returns the minimum
> for this case.
>
> We've also wanted to be able to set data when setting a timer that
> would be returned when the timer fires. (It's in the FnAPI, but not
> the SDKs yet.)
>
> The metadata is an interesting usecase, do you have some more specific
> examples? Might boil down to not having a rich enough (single) state
> type.
>
> > But I've come to feel there is a mismatch. On the one hand,
> ParDo(<stateful DoFn>) is a way to drop to a lower level and write logic
> that does not fit a more general computational pattern, really taking fine
> control. On the other hand, automatically merging state via CombiningState
> or BagState is more of a no-knobs higher level of programming. To me there
> seems to be a bit of a philosophical conflict.
> >
> > These days, I feel like an @OnMerge method would be more natural. If you
> are using state and timers, you probably often want more direct control
> over how state from windows gets merged. An of course we don't even have a
> design for timers - you would need some kind of timestamp CombineFn but I
> think setting/unsetting timers manually makes more sense. Especially
> considering the trickiness around merging windows in the absence of
> retractions, you really need this callback, so you can issue retractions
> manually for any output your stateful DoFn emitted in windows that no
> longer exist.
>
> I agree we'll probably need an @OnMerge. On the other hand, I like
> being able to have good defaults. The high/low level thing is a
> continuum (the indexing example falling towards the high end).
>
> Actually, the merging questions bother me less than how easy it is to
> accidentally clobber previous values. It looks so easy (like the
> easiest state to use) but is actually the most dangerous. If one wants
> this behavior, I would rather an explicit AnyCombineFn or
> LatestCombineFn which makes you think about the semantics.
>
> - Robert
>
> > On Thu, Apr 25, 2019 at 5:49 PM Reza Rokni <r...@google.com> wrote:
> >>
> >> +1 on the metadata use case.
> >>
> >> For performance reasons the Timer API does not support a read()
> operation, which for the  vast majority of use cases is not a required
> feature. In the small set of use cases where it is needed, for example when
> you need to set a Timer in EventTime based on the smallest timestamp seen
> in the elements within a DoFn, we can make use of a ValueState object to
> keep track of the value.
> >>
> >> On Fri, 26 Apr 2019 at 00:38, Reuven Lax <re...@google.com> wrote:
> >>>
> >>> I see examples of people using ValueState that I think are not
> captured CombiningState. For example, one common one is users who set a
> timer and then record the timestamp of that timer in a ValueState. In
> general when you store state that is metadata about other state you store,
> then ValueState will usually make more sense than CombiningState.
> >>>
> >>> On Thu, Apr 25, 2019 at 9:32 AM Brian Hulette <bhule...@google.com>
> wrote:
> >>>>
> >>>> Currently the Python SDK does not make ValueState available to users.
> My initial inclination was to go ahead and implement it there to be
> consistent with Java, but Robert brings up a great point here that
> ValueState has an inherent race condition for out of order data, and a lot
> of it's use cases can actually be implemented with a CombiningState instead.
> >>>>
> >>>> It seems to me that at the very least we should discourage the use of
> ValueState by noting the danger in the documentation and preferring
> CombiningState in examples, and perhaps we should go further and deprecate
> it in Java and not implement it in python. Either way I think we should be
> consistent between Java and Python.
> >>>>
> >>>> I'm curious what people think about this, are there use cases that we
> really need to keep ValueState around for?
> >>>>
> >>>> Brian
> >>>>
> >>>> ---------- Forwarded message ---------
> >>>> From: Robert Bradshaw <rober...@google.com>
> >>>> Date: Thu, Apr 25, 2019, 08:31
> >>>> Subject: Re: [docs] Python State & Timers
> >>>> To: dev <dev@beam.apache.org>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Thu, Apr 25, 2019, 5:26 PM Maximilian Michels <m...@apache.org>
> wrote:
> >>>>>
> >>>>> Completely agree that CombiningState is nicer in this example. Users
> may
> >>>>> still want to use ValueState when there is nothing to combine.
> >>>>
> >>>>
> >>>> I've always had trouble coming up with any good examples of this.
> >>>>
> >>>>> Also,
> >>>>> users already know ValueState from the Java SDK.
> >>>>
> >>>>
> >>>> Maybe we should deprecate that :)
> >>>>
> >>>>
> >>>>> On 25.04.19 17:12, Robert Bradshaw wrote:
> >>>>> > On Thu, Apr 25, 2019 at 4:58 PM Maximilian Michels <m...@apache.org>
> wrote:
> >>>>> >>
> >>>>> >> I forgot to give an example, just to clarify for others:
> >>>>> >>
> >>>>> >>> What was the specific example that was less natural?
> >>>>> >>
> >>>>> >> Basically every time we use ListState to express ValueState, e.g.
> >>>>> >>
> >>>>> >>     next_index, = list(state.read()) or [0]
> >>>>> >>
> >>>>> >> Taken from:
> >>>>> >>
> https://github.com/apache/beam/pull/8363/files#diff-ba1a2aed98079ccce869cd660ca9d97dR301
> >>>>> >
> >>>>> > Yes, ListState is much less natural here. I think generally
> >>>>> > CombiningValue is often a better replacement. E.g. the Java example
> >>>>> > reads
> >>>>> >
> >>>>> >
> >>>>> > public void processElement(
> >>>>> >        ProcessContext context, @StateId("index")
> ValueState<Integer> index) {
> >>>>> >      int current = firstNonNull(index.read(), 0);
> >>>>> >      context.output(KV.of(current, context.element()));
> >>>>> >      index.write(current+1);
> >>>>> > }
> >>>>> >
> >>>>> >
> >>>>> > which is replaced with bag state
> >>>>> >
> >>>>> >
> >>>>> > def process(self, element, state=DoFn.StateParam(INDEX_STATE)):
> >>>>> >      next_index, = list(state.read()) or [0]
> >>>>> >      yield (element, next_index)
> >>>>> >      state.clear()
> >>>>> >      state.add(next_index + 1)
> >>>>> >
> >>>>> >
> >>>>> > whereas CombiningState would be more natural (than ListState, and
> >>>>> > arguably than even ValueState), giving
> >>>>> >
> >>>>> >
> >>>>> > def process(self, element, index=DoFn.StateParam(INDEX_STATE)):
> >>>>> >      yield element, index.read()
> >>>>> >      index.add(1)
> >>>>> >
> >>>>> >
> >>>>> >
> >>>>> >
> >>>>> >>
> >>>>> >> -Max
> >>>>> >>
> >>>>> >> On 25.04.19 16:40, Robert Bradshaw wrote:
> >>>>> >>> https://github.com/apache/beam/pull/8402
> >>>>> >>>
> >>>>> >>> On Thu, Apr 25, 2019 at 4:26 PM Robert Bradshaw <
> rober...@google.com> wrote:
> >>>>> >>>>
> >>>>> >>>> Oh, this is for the indexing example.
> >>>>> >>>>
> >>>>> >>>> I actually think using CombiningState is more cleaner than
> ValueState.
> >>>>> >>>>
> >>>>> >>>>
> https://github.com/apache/beam/blob/release-2.12.0/sdks/python/apache_beam/runners/portability/fn_api_runner_test.py#L262
> >>>>> >>>>
> >>>>> >>>> (The fact that one must specify the accumulator coder is,
> however,
> >>>>> >>>> unfortunate. We should probably infer that if we can.)
> >>>>> >>>>
> >>>>> >>>> On Thu, Apr 25, 2019 at 4:19 PM Robert Bradshaw <
> rober...@google.com> wrote:
> >>>>> >>>>>
> >>>>> >>>>> The desire was to avoid the implicit disallowed combination
> wart in
> >>>>> >>>>> Python (until we could make sense of it), and also ValueState
> could be
> >>>>> >>>>> surprising with respect to older values overwriting newer
> ones. What
> >>>>> >>>>> was the specific example that was less natural?
> >>>>> >>>>>
> >>>>> >>>>> On Thu, Apr 25, 2019 at 3:01 PM Maximilian Michels <
> m...@apache.org> wrote:
> >>>>> >>>>>>
> >>>>> >>>>>> @Pablo: Thanks for following up with the PR! :)
> >>>>> >>>>>>
> >>>>> >>>>>> @Brian: I was wondering about this as well. It makes the
> Python state
> >>>>> >>>>>> code a bit unnatural. I'd suggest to add a ValueState wrapper
> around
> >>>>> >>>>>> ListState/CombiningState.
> >>>>> >>>>>>
> >>>>> >>>>>> @Robert: Like Reuven pointed out, we can disallow ValueState
> for merging
> >>>>> >>>>>> windows with state.
> >>>>> >>>>>>
> >>>>> >>>>>> @Reza: Great. Let's make sure it has Python examples out of
> the box.
> >>>>> >>>>>> Either Pablo or me could help there.
> >>>>> >>>>>>
> >>>>> >>>>>> Thanks,
> >>>>> >>>>>> Max
> >>>>> >>>>>>
> >>>>> >>>>>> On 25.04.19 04:14, Reza Ardeshir Rokni wrote:
> >>>>> >>>>>>> Pablo, Kenneth and I have a new blog ready for publication
> which covers
> >>>>> >>>>>>> how to create a "looping timer" it allows for default values
> to be
> >>>>> >>>>>>> created in a window when no incoming elements exists. We
> just need to
> >>>>> >>>>>>> clear a few bits before publication, but would be great to
> have that
> >>>>> >>>>>>> also include a python example, I wrote it in java...
> >>>>> >>>>>>>
> >>>>> >>>>>>> Cheers
> >>>>> >>>>>>>
> >>>>> >>>>>>> Reza
> >>>>> >>>>>>>
> >>>>> >>>>>>> On Thu, 25 Apr 2019 at 04:34, Reuven Lax <re...@google.com
> >>>>> >>>>>>> <mailto:re...@google.com>> wrote:
> >>>>> >>>>>>>
> >>>>> >>>>>>>       Well state is still not implemented for merging
> windows even for
> >>>>> >>>>>>>       Java (though I believe the idea was to disallow
> ValueState there).
> >>>>> >>>>>>>
> >>>>> >>>>>>>       On Wed, Apr 24, 2019 at 1:11 PM Robert Bradshaw <
> rober...@google.com
> >>>>> >>>>>>>       <mailto:rober...@google.com>> wrote:
> >>>>> >>>>>>>
> >>>>> >>>>>>>           It was unclear what the semantics were for
> ValueState for merging
> >>>>> >>>>>>>           windows. (It's also a bit weird as it's inherently
> a race condition
> >>>>> >>>>>>>           wrt element ordering, unlike Bag and CombineState,
> though you can
> >>>>> >>>>>>>           always implement it as a CombineState that always
> returns the latest
> >>>>> >>>>>>>           value which is a bit more explicit about the
> dangers here.)
> >>>>> >>>>>>>
> >>>>> >>>>>>>           On Wed, Apr 24, 2019 at 10:08 PM Brian Hulette
> >>>>> >>>>>>>           <bhule...@google.com <mailto:bhule...@google.com>>
> wrote:
> >>>>> >>>>>>>            >
> >>>>> >>>>>>>            > That's a great idea! I thought about this too
> after those
> >>>>> >>>>>>>           posts came up on the list recently. I started to
> look into it,
> >>>>> >>>>>>>           but I noticed that there's actually no
> implementation of
> >>>>> >>>>>>>           ValueState in userstate. Is there a reason for
> that? I started
> >>>>> >>>>>>>           to work on a patch to add it but I was just
> curious if there was
> >>>>> >>>>>>>           some reason it was omitted that I should be aware
> of.
> >>>>> >>>>>>>            >
> >>>>> >>>>>>>            > We could certainly replicate the example
> without ValueState
> >>>>> >>>>>>>           by using BagState and clearing it before each
> write, but it
> >>>>> >>>>>>>           would be nice if we could draw a direct parallel.
> >>>>> >>>>>>>            >
> >>>>> >>>>>>>            > Brian
> >>>>> >>>>>>>            >
> >>>>> >>>>>>>            > On Fri, Apr 12, 2019 at 7:05 AM Maximilian
> Michels
> >>>>> >>>>>>>           <m...@apache.org <mailto:m...@apache.org>> wrote:
> >>>>> >>>>>>>            >>
> >>>>> >>>>>>>            >> > It would probably be pretty easy to add the
> corresponding
> >>>>> >>>>>>>           code snippets to the docs as well.
> >>>>> >>>>>>>            >>
> >>>>> >>>>>>>            >> It's probably a bit more work because there is
> no section
> >>>>> >>>>>>>           dedicated to
> >>>>> >>>>>>>            >> state/timer yet in the documentation. Tracked
> here:
> >>>>> >>>>>>>            >> https://jira.apache.org/jira/browse/BEAM-2472
> >>>>> >>>>>>>            >>
> >>>>> >>>>>>>            >> > I've been going over this topic a bit. I'll
> add the
> >>>>> >>>>>>>           snippets next week, if that's fine by y'all.
> >>>>> >>>>>>>            >>
> >>>>> >>>>>>>            >> That would be great. The blog posts are a
> great way to get
> >>>>> >>>>>>>           started with
> >>>>> >>>>>>>            >> state/timers.
> >>>>> >>>>>>>            >>
> >>>>> >>>>>>>            >> Thanks,
> >>>>> >>>>>>>            >> Max
> >>>>> >>>>>>>            >>
> >>>>> >>>>>>>            >> On 11.04.19 20:21, Pablo Estrada wrote:
> >>>>> >>>>>>>            >> > I've been going over this topic a bit. I'll
> add the
> >>>>> >>>>>>>           snippets next week,
> >>>>> >>>>>>>            >> > if that's fine by y'all.
> >>>>> >>>>>>>            >> > Best
> >>>>> >>>>>>>            >> > -P.
> >>>>> >>>>>>>            >> >
> >>>>> >>>>>>>            >> > On Thu, Apr 11, 2019 at 5:27 AM Robert
> Bradshaw
> >>>>> >>>>>>>           <rober...@google.com <mailto:rober...@google.com>
> >>>>> >>>>>>>            >> > <mailto:rober...@google.com <mailto:
> rober...@google.com>>>
> >>>>> >>>>>>>           wrote:
> >>>>> >>>>>>>            >> >
> >>>>> >>>>>>>            >> >     That's a great idea! It would probably
> be pretty easy
> >>>>> >>>>>>>           to add the
> >>>>> >>>>>>>            >> >     corresponding code snippets to the docs
> as well.
> >>>>> >>>>>>>            >> >
> >>>>> >>>>>>>            >> >     On Thu, Apr 11, 2019 at 2:00 PM
> Maximilian Michels
> >>>>> >>>>>>>           <m...@apache.org <mailto:m...@apache.org>
> >>>>> >>>>>>>            >> >     <mailto:m...@apache.org <mailto:
> m...@apache.org>>> wrote:
> >>>>> >>>>>>>            >> >      >
> >>>>> >>>>>>>            >> >      > Hi everyone,
> >>>>> >>>>>>>            >> >      >
> >>>>> >>>>>>>            >> >      > The Python SDK still lacks
> documentation on state
> >>>>> >>>>>>>           and timers.
> >>>>> >>>>>>>            >> >      >
> >>>>> >>>>>>>            >> >      > As a first step, what do you think
> about updating
> >>>>> >>>>>>>           these two blog
> >>>>> >>>>>>>            >> >     posts
> >>>>> >>>>>>>            >> >      > with the corresponding Python code?
> >>>>> >>>>>>>            >> >      >
> >>>>> >>>>>>>            >> >      >
> >>>>> >>>>>>>
> https://beam.apache.org/blog/2017/02/13/stateful-processing.html
> >>>>> >>>>>>>            >> >      >
> >>>>> >>>>>>>
> https://beam.apache.org/blog/2017/08/28/timely-processing.html
> >>>>> >>>>>>>            >> >      >
> >>>>> >>>>>>>            >> >      > Thanks,
> >>>>> >>>>>>>            >> >      > Max
> >>>>> >>>>>>>            >> >
> >>>>> >>>>>>>
> >>
> >>
> >>
> >> --
> >>
> >> This email may be confidential and privileged. If you received this
> communication by mistake, please don't forward it to anyone else, please
> erase all copies and attachments, and please let me know that it has gone
> to the wrong person.
> >>
> >> The above terms reflect a potential business arrangement, are provided
> solely as a basis for further discussion, and are not intended to be and do
> not constitute a legally binding obligation. No legally binding obligations
> will be created, implied, or inferred until an agreement in final form is
> executed in writing by all parties involved.
>

Reply via email to