Brian, I just want to follow up. Let me know if you are working on this. Otherwise, I can implement ReadModifyWriteState.
On Mon, May 6, 2019 at 4:52 PM Reza Rokni <[email protected]> wrote: > When used as metadata I think the ReadModifyWrite naming is very accurate > for the majority of cases. > > The only case that does not follow that pattern is if its being used as a > Boolean to indicate that something should be done in the OnTimer call based > on an event that has been seen in the OnProcess code. But even then calling > it ReadModifyWrite would still be ok and easily understood. > > *From: *Kenneth Knowles <[email protected]> > *Date: *Fri, 3 May 2019 at 10:58 > *To: *dev > > Agree with all of your points about the drawbacks of ValueState. It is >> definitely a pro/con weighing sort of situation. Considering the number of >> users who are new to the orthogonality of event time and processing time, >> ValueState could certainly lead to confusion about why things are not in >> any particular order. Perhaps a middle ground is to have a bad of "written >> values" with sequence numbers, and document some patterns/examples of how a >> user may care to resolve these sequences numbers on read, or perhaps some >> combiners like you describe. Overall, I'd defer to Reza and Reuven as they >> seem very familiar with real-world use cases here. >> >> Kenn >> >> On Thu, May 2, 2019 at 2:53 AM Robert Bradshaw <[email protected]> >> wrote: >> >>> On Wed, May 1, 2019 at 8:09 PM Kenneth Knowles <[email protected]> wrote: >>> > >>> > On Wed, May 1, 2019 at 8:51 AM Reuven Lax <[email protected]> wrote: >>> >> >>> >> ValueState is not necessarily racy if you're doing a >>> read-modify-write. It's only racy if you're doing something like writing >>> last element seen. >>> > >>> > Race conditions are not inherently a problem. They are neither >>> necessary nor sufficient for correctness. In this case, it is not the >>> classic sense of race condition anyhow, it is simply a nondeterministic >>> result, which may often be perfectly fine. >>> >>> One can write correct code with ValueState, but it's harder to do. >>> This is exacerbated by the fact that at first glance it looks easier >>> to use. >>> >>> >>>>> On Wed, May 1, 2019 at 8:30 AM Lukasz Cwik <[email protected]> >>> wrote: >>> >>>>>> >>> >>>>>> Isn't a value state just a bag state with at most one element and >>> the usage pattern would be? >>> >>>>>> 1) value_state.get == bag_state.read.next() (both have to handle >>> the case when neither have been set) >>> >>>>>> 2) user logic on what to do with current state + additional >>> information to produce new state >>> >>>>>> 3) value_state.set == bag_state.clear + bag_state.append? (note >>> that Runners should optimize clear + append to become a single >>> transaction/write) >>> > >>> > Your unpacking is accurate, but "X is just a Y" is not accurate. In >>> this case you've demonstrated that value state *can be implemented using* >>> bag state / has a workaround. But it is not subsumed by bag state. One >>> important feature of ValueState is that it is statically determined that >>> the transform cannot be used with merging windows. >>> >>> The flip side is that it makes it easy to write code that cannot be >>> used with merging windows. Which hurts composition (especially if >>> these operations are used as part of larger composite operations). >>> >>> > Another feature is that it is impossible to accidentally write more >>> than one value. And a third important feature is that it declares what it >>> is so that the code is more readable. >>> >>> +1. Which is why I think CombiningState is a better substitute than >>> BagState where it makes sense (and often does, and often can even be >>> an improvement over ValueState for performance and readability). >>> >>> Perhaps instead we could call it ReadModifyWrite state. It could make >>> sense, as well as a read() and write() operation, that we even offer a >>> modify(I, (I, S) -> S) operation. >>> >>> (Also, yes, when I said Latest I too meant a hypothetical "throw away >>> everything else when a new element is written" one, not the specific >>> one in the code. Sorry for the confusion.) >>> >>> >>>>>> For example, the blog post with the counter example would be: >>> >>>>>> @StateId("buffer") >>> >>>>>> private final StateSpec<BagState<Event>> bufferedEvents = >>> StateSpecs.bag(); >>> >>>>>> >>> >>>>>> @StateId("count") >>> >>>>>> private final StateSpec<BagState<Integer>> countState = >>> StateSpecs.bag(); >>> >>>>>> >>> >>>>>> @ProcessElement >>> >>>>>> public void process( >>> >>>>>> ProcessContext context, >>> >>>>>> @StateId("buffer") BagState<Event> bufferState, >>> >>>>>> @StateId("count") BagState<Integer> countState) { >>> >>>>>> >>> >>>>>> int count = Iterables.getFirst(countState.read(), 0); >>> >>>>>> count = count + 1; >>> >>>>>> countState.clear(); >>> >>>>>> countState.append(count); >>> >>>>>> bufferState.add(context.element()); >>> >>>>>> >>> >>>>>> if (count > MAX_BUFFER_SIZE) { >>> >>>>>> for (EnrichedEvent enrichedEvent : >>> enrichEvents(bufferState.read())) { >>> >>>>>> context.output(enrichedEvent); >>> >>>>>> } >>> >>>>>> bufferState.clear(); >>> >>>>>> countState.clear(); >>> >>>>>> } >>> >>>>>> } >>> >>>>>> >>> >>>>>> On Tue, Apr 30, 2019 at 5:39 PM Kenneth Knowles <[email protected]> >>> wrote: >>> >>>>>>> >>> >>>>>>> Anything where the state evolves serially but arbitrarily - the >>> toy example is the integer counter in my blog post - needs ValueState. You >>> can't do it with AnyCombineFn. And I think LatestCombineFn is dangerous, >>> especially when it comes to CombingState. ValueState is more explicit, and >>> I still maintain that it is status quo, modulo unimplemented features in >>> the Python SDK. The reads and writes are explicitly serial per (key, >>> window), unlike for a CombiningState. Because of a CombineFn's requirement >>> to be associative and commutative, I would interpret it that the runner is >>> allowed to reorder inputs even after they are "written" by the user's DoFn >>> - for example doing blind writes to an unordered bag, and only later >>> reading the elements out and combining them in arbitrary order. Or any >>> other strategy mixing addInput and mergeAccumulators. It would be a >>> violation of the meaning of CombineFn to overspecify CombiningState further >>> than this. So you cannot actually implement a consistent >>> CombiningState(LatestCombineFn). >>> >>>>>>> >>> >>>>>>> Kenn >>> >>>>>>> >>> >>>>>>> On Tue, Apr 30, 2019 at 5:19 PM Robert Bradshaw < >>> [email protected]> wrote: >>> >>>>>>>> >>> >>>>>>>> On Wed, May 1, 2019 at 1:55 AM Brian Hulette < >>> [email protected]> wrote: >>> >>>>>>>> > >>> >>>>>>>> > Reza - you're definitely not derailing, that's exactly what I >>> was looking for! >>> >>>>>>>> > >>> >>>>>>>> > I've actually recently encountered an additional use case >>> where I'd like to use ValueState in the Python SDK. I'm experimenting with >>> an ArrowBatchingDoFn that uses state and timers to batch up python >>> dictionaries into arrow record batches (actually my entire purpose for >>> jumping down this python state rabbit hole). >>> >>>>>>>> > >>> >>>>>>>> > At first blush it seems like the best way to do this would be >>> to just replicate the batching approach in the timely processing post [1], >>> but when the bag is full combine the elements into an arrow record batch, >>> rather than enriching all of the elements and writing them out separately. >>> However, if possible I'd like to pre-allocate buffers for each column and >>> populate them as elements arrive (at least for columns with a fixed size >>> type), so a bag state wouldn't be ideal. >>> >>>>>>>> >>> >>>>>>>> It seems it'd be preferable to do the conversion from a bag of >>> >>>>>>>> elements to a single arrow frame all at once, when emitting, >>> rather >>> >>>>>>>> than repeatedly reading and writing the partial batch to and >>> from >>> >>>>>>>> state with every element that comes in. (Bag state has blind >>> append.) >>> >>>>>>>> >>> >>>>>>>> > Also, a CombiningValueState is not ideal because I'd need to >>> implement a merge_accumulators function that combines several in-progress >>> batches. I could certainly implement that, but I'd prefer that it never be >>> called unless absolutely necessary, which doesn't seem to be the case for >>> CombiningValueState. (As an aside, maybe there's some room there for a >>> middle ground between ValueState and CombiningValueState >>> >>>>>>>> >>> >>>>>>>> This does actually feel natural (to me), because you're >>> repeatedly >>> >>>>>>>> adding elements to build something up. merge_accumulators would >>> >>>>>>>> probably be pretty easy (concatenation) but unless your windows >>> are >>> >>>>>>>> merging could just throw a not implemented error to really guard >>> >>>>>>>> against it being used. >>> >>>>>>>> >>> >>>>>>>> > I suppose you could argue that this is a pretty low-level >>> optimization we should be able to shield our users from, but right now I >>> just wish I had ValueState in python so I didn't have to hack it up with a >>> BagState :) >>> >>>>>>>> > >>> >>>>>>>> > Anyway, in light of this and all the other use-cases >>> mentioned here, I think the resolution is to just implement ValueState in >>> python, and document the danger with ValueState in both Python and Java. >>> Just to be clear, the danger I'm referring to is that users might easily >>> forget that data can be out of order, and use ValueState in a way that >>> assumes it's been populated with data from the most recent element in event >>> time, then in practice out of order data clobbers their state. I'm happy to >>> write up a PR for this - are there any objections to that? >>> >>>>>>>> >>> >>>>>>>> I still haven't seen a good case for it (though I haven't >>> looked at >>> >>>>>>>> Reza's BiTemporalStream yet). Much harder to remove things once >>> >>>>>>>> they're in. Can we just add a Any and/or LatestCombineFn and >>> use (and >>> >>>>>>>> point to) that instead? With the comment that if you're doing >>> >>>>>>>> read-modify-write, an add_input may be better. >>> >>>>>>>> >>> >>>>>>>> > [1] >>> https://beam.apache.org/blog/2017/08/28/timely-processing.html >>> >>>>>>>> > >>> >>>>>>>> > On Mon, Apr 29, 2019 at 12:23 AM Robert Bradshaw < >>> [email protected]> wrote: >>> >>>>>>>> >> >>> >>>>>>>> >> On Mon, Apr 29, 2019 at 3:43 AM Reza Rokni <[email protected]> >>> wrote: >>> >>>>>>>> >> > >>> >>>>>>>> >> > @Robert Bradshaw Some examples, mostly built out from >>> cases around Timeseries data, don't want to derail this thread so at a hi >>> level : >>> >>>>>>>> >> >>> >>>>>>>> >> Thanks. Perfectly on-topic for the thread. >>> >>>>>>>> >> >>> >>>>>>>> >> > Looping timers, a timer which allows for creation of a >>> value within a window when no external input has been seen. Requires >>> metadata like "is timer set". >>> >>>>>>>> >> > >>> >>>>>>>> >> > BiTemporalStream join, where we need to match >>> leftCol.timestamp to a value == (max(rightCol.timestamp) where >>> rightCol.timestamp <= leftCol.timestamp)) , this if for a application >>> matching trades to quotes. >>> >>>>>>>> >> >>> >>>>>>>> >> I'd be interested in seeing the code here. The fact that you >>> have a >>> >>>>>>>> >> max here makes me wonder if combining would be applicable. >>> >>>>>>>> >> >>> >>>>>>>> >> (FWIW, I've long thought it would be useful to do this kind >>> of thing >>> >>>>>>>> >> with Windows. Basically, it'd be like session windows with >>> one side >>> >>>>>>>> >> being the window from the timestamp forward into the future, >>> and the >>> >>>>>>>> >> other side being from the timestamp back a certain amount in >>> the past. >>> >>>>>>>> >> This seems a common join pattern.) >>> >>>>>>>> >> >>> >>>>>>>> >> > Metadata is used for >>> >>>>>>>> >> > >>> >>>>>>>> >> > Taking the Key from the KV for use within the OnTimer >>> call. >>> >>>>>>>> >> > Knowing where we are in watermarks for GC of objects in >>> state. >>> >>>>>>>> >> > More timer metadata (min timer ..) >>> >>>>>>>> >> > >>> >>>>>>>> >> > It could be argued that what we are using state for mostly >>> workarounds for things that could eventually end up in the API itself. For >>> example >>> >>>>>>>> >> > >>> >>>>>>>> >> > There is a Jira for OnTimer Context to have Key. >>> >>>>>>>> >> > The GC needs are mostly due to not having a Map State >>> object in all runners yet. >>> >>>>>>>> >> >>> >>>>>>>> >> Yeah. GC could probably be done with a max combine. The Key >>> (which >>> >>>>>>>> >> should be in the API) could be an AnyCombine for now (safe to >>> >>>>>>>> >> overwrite because it's always the same). >>> >>>>>>>> >> >>> >>>>>>>> >> > However I think as folks explore Beam there will always be >>> little things that require Metadata and so having access to something which >>> gives us fine grain control ( as Kenneth mentioned) is useful. >>> >>>>>>>> >> >>> >>>>>>>> >> Likely. I guess in line with making easy things easy, I'd >>> like to make >>> >>>>>>>> >> dangerous things hard(er). As Kenn says, we'll probably need >>> some kind >>> >>>>>>>> >> of lower-level thing, especially if we introduce OnMerge. >>> >>>>>>>> >> >>> >>>>>>>> >> > Cheers >>> >>>>>>>> >> > >>> >>>>>>>> >> > Reza >>> >>>>>>>> >> > >>> >>>>>>>> >> > On Sat, 27 Apr 2019 at 02:59, Kenneth Knowles < >>> [email protected]> wrote: >>> >>>>>>>> >> >> >>> >>>>>>>> >> >> To be clear, the intent was always that ValueState would >>> be not usable in merging pipelines. So no danger of clobbering, but also >>> limited functionality. Is there a runner than accepts it and clobbers? The >>> whole idea of the new DoFn is that it is easy to do the construction-time >>> analysis and reject the invalid pipeline. It is actually runner independent >>> and I think already implemented in ParDo's validation, no? >>> >>>>>>>> >> >> >>> >>>>>>>> >> >> Kenn >>> >>>>>>>> >> >> >>> >>>>>>>> >> >> On Fri, Apr 26, 2019 at 10:14 AM Lukasz Cwik < >>> [email protected]> wrote: >>> >>>>>>>> >> >>> >>> >>>>>>>> >> >>> I am in the camp where we should only support merging >>> state (either naturally via things like bags or via combiners). I believe >>> that having the wrapper that Brian suggests is useful for users. As for the >>> @OnMerge method, I believe combiners should have the ability to look at the >>> window information and we should treat @OnMerge as syntactic sugar over a >>> combiner if the combiner API is too cumbersome. >>> >>>>>>>> >> >>> >>> >>>>>>>> >> >>> I believe using combiners can also extend to side inputs >>> and help us deal with singleton and map like side inputs when multiple >>> firings occur. I also like treating everything like a combiner because it >>> will give us a lot reuse of combiner implementations across all the places >>> they could be used and will be especially useful when we start exposing >>> APIs related to retractions on combiners. >>> >>>>>>>> >> >>> >>> >>>>>>>> >> >>> On Fri, Apr 26, 2019 at 9:43 AM Brian Hulette < >>> [email protected]> wrote: >>> >>>>>>>> >> >>>> >>> >>>>>>>> >> >>>> Yeah the danger with out of order processing concerns >>> me more than the merging as well. As a new Beam user, I immediately >>> gravitated towards ValueState since it was easy to think about and I just >>> assumed there wasn't anything to be concerned about. So it was shocking to >>> learn that there is this dangerous edge-case. >>> >>>>>>>> >> >>>> >>> >>>>>>>> >> >>>> What if ValueState were just implemented as a wrapper >>> of CombiningState with a LatestCombineFn and documented as such (and >>> perhaps we encourage users to consider using a CombiningState explicitly if >>> at all possible)? >>> >>>>>>>> >> >>>> >>> >>>>>>>> >> >>>> Brian >>> >>>>>>>> >> >>>> >>> >>>>>>>> >> >>>> >>> >>>>>>>> >> >>>> >>> >>>>>>>> >> >>>> On Fri, Apr 26, 2019 at 2:29 AM Robert Bradshaw < >>> [email protected]> wrote: >>> >>>>>>>> >> >>>>> >>> >>>>>>>> >> >>>>> On Fri, Apr 26, 2019 at 6:40 AM Kenneth Knowles < >>> [email protected]> wrote: >>> >>>>>>>> >> >>>>> > >>> >>>>>>>> >> >>>>> > You could use a CombiningState with a CombineFn that >>> returns the minimum for this case. >>> >>>>>>>> >> >>>>> >>> >>>>>>>> >> >>>>> We've also wanted to be able to set data when setting >>> a timer that >>> >>>>>>>> >> >>>>> would be returned when the timer fires. (It's in the >>> FnAPI, but not >>> >>>>>>>> >> >>>>> the SDKs yet.) >>> >>>>>>>> >> >>>>> >>> >>>>>>>> >> >>>>> The metadata is an interesting usecase, do you have >>> some more specific >>> >>>>>>>> >> >>>>> examples? Might boil down to not having a rich enough >>> (single) state >>> >>>>>>>> >> >>>>> type. >>> >>>>>>>> >> >>>>> >>> >>>>>>>> >> >>>>> > But I've come to feel there is a mismatch. On the >>> one hand, ParDo(<stateful DoFn>) is a way to drop to a lower level and >>> write logic that does not fit a more general computational pattern, really >>> taking fine control. On the other hand, automatically merging state via >>> CombiningState or BagState is more of a no-knobs higher level of >>> programming. To me there seems to be a bit of a philosophical conflict. >>> >>>>>>>> >> >>>>> > >>> >>>>>>>> >> >>>>> > These days, I feel like an @OnMerge method would be >>> more natural. If you are using state and timers, you probably often want >>> more direct control over how state from windows gets merged. An of course >>> we don't even have a design for timers - you would need some kind of >>> timestamp CombineFn but I think setting/unsetting timers manually makes >>> more sense. Especially considering the trickiness around merging windows in >>> the absence of retractions, you really need this callback, so you can issue >>> retractions manually for any output your stateful DoFn emitted in windows >>> that no longer exist. >>> >>>>>>>> >> >>>>> >>> >>>>>>>> >> >>>>> I agree we'll probably need an @OnMerge. On the other >>> hand, I like >>> >>>>>>>> >> >>>>> being able to have good defaults. The high/low level >>> thing is a >>> >>>>>>>> >> >>>>> continuum (the indexing example falling towards the >>> high end). >>> >>>>>>>> >> >>>>> >>> >>>>>>>> >> >>>>> Actually, the merging questions bother me less than >>> how easy it is to >>> >>>>>>>> >> >>>>> accidentally clobber previous values. It looks so easy >>> (like the >>> >>>>>>>> >> >>>>> easiest state to use) but is actually the most >>> dangerous. If one wants >>> >>>>>>>> >> >>>>> this behavior, I would rather an explicit AnyCombineFn >>> or >>> >>>>>>>> >> >>>>> LatestCombineFn which makes you think about the >>> semantics. >>> >>>>>>>> >> >>>>> >>> >>>>>>>> >> >>>>> - Robert >>> >>>>>>>> >> >>>>> >>> >>>>>>>> >> >>>>> > On Thu, Apr 25, 2019 at 5:49 PM Reza Rokni < >>> [email protected]> wrote: >>> >>>>>>>> >> >>>>> >> >>> >>>>>>>> >> >>>>> >> +1 on the metadata use case. >>> >>>>>>>> >> >>>>> >> >>> >>>>>>>> >> >>>>> >> For performance reasons the Timer API does not >>> support a read() operation, which for the vast majority of use cases is >>> not a required feature. In the small set of use cases where it is needed, >>> for example when you need to set a Timer in EventTime based on the smallest >>> timestamp seen in the elements within a DoFn, we can make use of a >>> ValueState object to keep track of the value. >>> >>>>>>>> >> >>>>> >> >>> >>>>>>>> >> >>>>> >> On Fri, 26 Apr 2019 at 00:38, Reuven Lax < >>> [email protected]> wrote: >>> >>>>>>>> >> >>>>> >>> >>> >>>>>>>> >> >>>>> >>> I see examples of people using ValueState that I >>> think are not captured CombiningState. For example, one common one is users >>> who set a timer and then record the timestamp of that timer in a >>> ValueState. In general when you store state that is metadata about other >>> state you store, then ValueState will usually make more sense than >>> CombiningState. >>> >>>>>>>> >> >>>>> >>> >>> >>>>>>>> >> >>>>> >>> On Thu, Apr 25, 2019 at 9:32 AM Brian Hulette < >>> [email protected]> wrote: >>> >>>>>>>> >> >>>>> >>>> >>> >>>>>>>> >> >>>>> >>>> Currently the Python SDK does not make ValueState >>> available to users. My initial inclination was to go ahead and implement it >>> there to be consistent with Java, but Robert brings up a great point here >>> that ValueState has an inherent race condition for out of order data, and a >>> lot of it's use cases can actually be implemented with a CombiningState >>> instead. >>> >>>>>>>> >> >>>>> >>>> >>> >>>>>>>> >> >>>>> >>>> It seems to me that at the very least we should >>> discourage the use of ValueState by noting the danger in the documentation >>> and preferring CombiningState in examples, and perhaps we should go further >>> and deprecate it in Java and not implement it in python. Either way I think >>> we should be consistent between Java and Python. >>> >>>>>>>> >> >>>>> >>>> >>> >>>>>>>> >> >>>>> >>>> I'm curious what people think about this, are >>> there use cases that we really need to keep ValueState around for? >>> >>>>>>>> >> >>>>> >>>> >>> >>>>>>>> >> >>>>> >>>> Brian >>> >>>>>>>> >> >>>>> >>>> >>> >>>>>>>> >> >>>>> >>>> ---------- Forwarded message --------- >>> >>>>>>>> >> >>>>> >>>> From: Robert Bradshaw <[email protected]> >>> >>>>>>>> >> >>>>> >>>> Date: Thu, Apr 25, 2019, 08:31 >>> >>>>>>>> >> >>>>> >>>> Subject: Re: [docs] Python State & Timers >>> >>>>>>>> >> >>>>> >>>> To: dev <[email protected]> >>> >>>>>>>> >> >>>>> >>>> >>> >>>>>>>> >> >>>>> >>>> >>> >>>>>>>> >> >>>>> >>>> >>> >>>>>>>> >> >>>>> >>>> >>> >>>>>>>> >> >>>>> >>>> On Thu, Apr 25, 2019, 5:26 PM Maximilian Michels < >>> [email protected]> wrote: >>> >>>>>>>> >> >>>>> >>>>> >>> >>>>>>>> >> >>>>> >>>>> Completely agree that CombiningState is nicer in >>> this example. Users may >>> >>>>>>>> >> >>>>> >>>>> still want to use ValueState when there is >>> nothing to combine. >>> >>>>>>>> >> >>>>> >>>> >>> >>>>>>>> >> >>>>> >>>> >>> >>>>>>>> >> >>>>> >>>> I've always had trouble coming up with any good >>> examples of this. >>> >>>>>>>> >> >>>>> >>>> >>> >>>>>>>> >> >>>>> >>>>> Also, >>> >>>>>>>> >> >>>>> >>>>> users already know ValueState from the Java SDK. >>> >>>>>>>> >> >>>>> >>>> >>> >>>>>>>> >> >>>>> >>>> >>> >>>>>>>> >> >>>>> >>>> Maybe we should deprecate that :) >>> >>>>>>>> >> >>>>> >>>> >>> >>>>>>>> >> >>>>> >>>> >>> >>>>>>>> >> >>>>> >>>>> On 25.04.19 17:12, Robert Bradshaw wrote: >>> >>>>>>>> >> >>>>> >>>>> > On Thu, Apr 25, 2019 at 4:58 PM Maximilian >>> Michels <[email protected]> wrote: >>> >>>>>>>> >> >>>>> >>>>> >> >>> >>>>>>>> >> >>>>> >>>>> >> I forgot to give an example, just to clarify >>> for others: >>> >>>>>>>> >> >>>>> >>>>> >> >>> >>>>>>>> >> >>>>> >>>>> >>> What was the specific example that was less >>> natural? >>> >>>>>>>> >> >>>>> >>>>> >> >>> >>>>>>>> >> >>>>> >>>>> >> Basically every time we use ListState to >>> express ValueState, e.g. >>> >>>>>>>> >> >>>>> >>>>> >> >>> >>>>>>>> >> >>>>> >>>>> >> next_index, = list(state.read()) or [0] >>> >>>>>>>> >> >>>>> >>>>> >> >>> >>>>>>>> >> >>>>> >>>>> >> Taken from: >>> >>>>>>>> >> >>>>> >>>>> >> >>> https://github.com/apache/beam/pull/8363/files#diff-ba1a2aed98079ccce869cd660ca9d97dR301 >>> >>>>>>>> >> >>>>> >>>>> > >>> >>>>>>>> >> >>>>> >>>>> > Yes, ListState is much less natural here. I >>> think generally >>> >>>>>>>> >> >>>>> >>>>> > CombiningValue is often a better replacement. >>> E.g. the Java example >>> >>>>>>>> >> >>>>> >>>>> > reads >>> >>>>>>>> >> >>>>> >>>>> > >>> >>>>>>>> >> >>>>> >>>>> > >>> >>>>>>>> >> >>>>> >>>>> > public void processElement( >>> >>>>>>>> >> >>>>> >>>>> > ProcessContext context, >>> @StateId("index") ValueState<Integer> index) { >>> >>>>>>>> >> >>>>> >>>>> > int current = firstNonNull(index.read(), >>> 0); >>> >>>>>>>> >> >>>>> >>>>> > context.output(KV.of(current, >>> context.element())); >>> >>>>>>>> >> >>>>> >>>>> > index.write(current+1); >>> >>>>>>>> >> >>>>> >>>>> > } >>> >>>>>>>> >> >>>>> >>>>> > >>> >>>>>>>> >> >>>>> >>>>> > >>> >>>>>>>> >> >>>>> >>>>> > which is replaced with bag state >>> >>>>>>>> >> >>>>> >>>>> > >>> >>>>>>>> >> >>>>> >>>>> > >>> >>>>>>>> >> >>>>> >>>>> > def process(self, element, >>> state=DoFn.StateParam(INDEX_STATE)): >>> >>>>>>>> >> >>>>> >>>>> > next_index, = list(state.read()) or [0] >>> >>>>>>>> >> >>>>> >>>>> > yield (element, next_index) >>> >>>>>>>> >> >>>>> >>>>> > state.clear() >>> >>>>>>>> >> >>>>> >>>>> > state.add(next_index + 1) >>> >>>>>>>> >> >>>>> >>>>> > >>> >>>>>>>> >> >>>>> >>>>> > >>> >>>>>>>> >> >>>>> >>>>> > whereas CombiningState would be more natural >>> (than ListState, and >>> >>>>>>>> >> >>>>> >>>>> > arguably than even ValueState), giving >>> >>>>>>>> >> >>>>> >>>>> > >>> >>>>>>>> >> >>>>> >>>>> > >>> >>>>>>>> >> >>>>> >>>>> > def process(self, element, >>> index=DoFn.StateParam(INDEX_STATE)): >>> >>>>>>>> >> >>>>> >>>>> > yield element, index.read() >>> >>>>>>>> >> >>>>> >>>>> > index.add(1) >>> >>>>>>>> >> >>>>> >>>>> > >>> >>>>>>>> >> >>>>> >>>>> > >>> >>>>>>>> >> >>>>> >>>>> > >>> >>>>>>>> >> >>>>> >>>>> > >>> >>>>>>>> >> >>>>> >>>>> >> >>> >>>>>>>> >> >>>>> >>>>> >> -Max >>> >>>>>>>> >> >>>>> >>>>> >> >>> >>>>>>>> >> >>>>> >>>>> >> On 25.04.19 16:40, Robert Bradshaw wrote: >>> >>>>>>>> >> >>>>> >>>>> >>> https://github.com/apache/beam/pull/8402 >>> >>>>>>>> >> >>>>> >>>>> >>> >>> >>>>>>>> >> >>>>> >>>>> >>> On Thu, Apr 25, 2019 at 4:26 PM Robert >>> Bradshaw <[email protected]> wrote: >>> >>>>>>>> >> >>>>> >>>>> >>>> >>> >>>>>>>> >> >>>>> >>>>> >>>> Oh, this is for the indexing example. >>> >>>>>>>> >> >>>>> >>>>> >>>> >>> >>>>>>>> >> >>>>> >>>>> >>>> I actually think using CombiningState is >>> more cleaner than ValueState. >>> >>>>>>>> >> >>>>> >>>>> >>>> >>> >>>>>>>> >> >>>>> >>>>> >>>> >>> https://github.com/apache/beam/blob/release-2.12.0/sdks/python/apache_beam/runners/portability/fn_api_runner_test.py#L262 >>> >>>>>>>> >> >>>>> >>>>> >>>> >>> >>>>>>>> >> >>>>> >>>>> >>>> (The fact that one must specify the >>> accumulator coder is, however, >>> >>>>>>>> >> >>>>> >>>>> >>>> unfortunate. We should probably infer that >>> if we can.) >>> >>>>>>>> >> >>>>> >>>>> >>>> >>> >>>>>>>> >> >>>>> >>>>> >>>> On Thu, Apr 25, 2019 at 4:19 PM Robert >>> Bradshaw <[email protected]> wrote: >>> >>>>>>>> >> >>>>> >>>>> >>>>> >>> >>>>>>>> >> >>>>> >>>>> >>>>> The desire was to avoid the implicit >>> disallowed combination wart in >>> >>>>>>>> >> >>>>> >>>>> >>>>> Python (until we could make sense of it), >>> and also ValueState could be >>> >>>>>>>> >> >>>>> >>>>> >>>>> surprising with respect to older values >>> overwriting newer ones. What >>> >>>>>>>> >> >>>>> >>>>> >>>>> was the specific example that was less >>> natural? >>> >>>>>>>> >> >>>>> >>>>> >>>>> >>> >>>>>>>> >> >>>>> >>>>> >>>>> On Thu, Apr 25, 2019 at 3:01 PM Maximilian >>> Michels <[email protected]> wrote: >>> >>>>>>>> >> >>>>> >>>>> >>>>>> >>> >>>>>>>> >> >>>>> >>>>> >>>>>> @Pablo: Thanks for following up with the >>> PR! :) >>> >>>>>>>> >> >>>>> >>>>> >>>>>> >>> >>>>>>>> >> >>>>> >>>>> >>>>>> @Brian: I was wondering about this as >>> well. It makes the Python state >>> >>>>>>>> >> >>>>> >>>>> >>>>>> code a bit unnatural. I'd suggest to add >>> a ValueState wrapper around >>> >>>>>>>> >> >>>>> >>>>> >>>>>> ListState/CombiningState. >>> >>>>>>>> >> >>>>> >>>>> >>>>>> >>> >>>>>>>> >> >>>>> >>>>> >>>>>> @Robert: Like Reuven pointed out, we can >>> disallow ValueState for merging >>> >>>>>>>> >> >>>>> >>>>> >>>>>> windows with state. >>> >>>>>>>> >> >>>>> >>>>> >>>>>> >>> >>>>>>>> >> >>>>> >>>>> >>>>>> @Reza: Great. Let's make sure it has >>> Python examples out of the box. >>> >>>>>>>> >> >>>>> >>>>> >>>>>> Either Pablo or me could help there. >>> >>>>>>>> >> >>>>> >>>>> >>>>>> >>> >>>>>>>> >> >>>>> >>>>> >>>>>> Thanks, >>> >>>>>>>> >> >>>>> >>>>> >>>>>> Max >>> >>>>>>>> >> >>>>> >>>>> >>>>>> >>> >>>>>>>> >> >>>>> >>>>> >>>>>> On 25.04.19 04:14, Reza Ardeshir Rokni >>> wrote: >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> Pablo, Kenneth and I have a new blog >>> ready for publication which covers >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> how to create a "looping timer" it >>> allows for default values to be >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> created in a window when no incoming >>> elements exists. We just need to >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> clear a few bits before publication, but >>> would be great to have that >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> also include a python example, I wrote >>> it in java... >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> Cheers >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> Reza >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> On Thu, 25 Apr 2019 at 04:34, Reuven Lax >>> <[email protected] >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> <mailto:[email protected]>> wrote: >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> Well state is still not >>> implemented for merging windows even for >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> Java (though I believe the idea >>> was to disallow ValueState there). >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> On Wed, Apr 24, 2019 at 1:11 PM >>> Robert Bradshaw <[email protected] >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> <mailto:[email protected]>> >>> wrote: >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> It was unclear what the >>> semantics were for ValueState for merging >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> windows. (It's also a bit >>> weird as it's inherently a race condition >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> wrt element ordering, unlike >>> Bag and CombineState, though you can >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> always implement it as a >>> CombineState that always returns the latest >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> value which is a bit more >>> explicit about the dangers here.) >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> On Wed, Apr 24, 2019 at 10:08 >>> PM Brian Hulette >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> <[email protected] <mailto: >>> [email protected]>> wrote: >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> > >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> > That's a great idea! I >>> thought about this too after those >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> posts came up on the list >>> recently. I started to look into it, >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> but I noticed that there's >>> actually no implementation of >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> ValueState in userstate. Is >>> there a reason for that? I started >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> to work on a patch to add it >>> but I was just curious if there was >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> some reason it was omitted >>> that I should be aware of. >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> > >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> > We could certainly >>> replicate the example without ValueState >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> by using BagState and clearing >>> it before each write, but it >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> would be nice if we could draw >>> a direct parallel. >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> > >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> > Brian >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> > >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> > On Fri, Apr 12, 2019 at >>> 7:05 AM Maximilian Michels >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> <[email protected] <mailto: >>> [email protected]>> wrote: >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >> >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >> > It would probably be >>> pretty easy to add the corresponding >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> code snippets to the docs as >>> well. >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >> >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >> It's probably a bit more >>> work because there is no section >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> dedicated to >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >> state/timer yet in the >>> documentation. Tracked here: >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >> >>> https://jira.apache.org/jira/browse/BEAM-2472 >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >> >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >> > I've been going over >>> this topic a bit. I'll add the >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> snippets next week, if that's >>> fine by y'all. >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >> >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >> That would be great. The >>> blog posts are a great way to get >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> started with >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >> state/timers. >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >> >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >> Thanks, >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >> Max >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >> >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >> On 11.04.19 20:21, Pablo >>> Estrada wrote: >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >> > I've been going over >>> this topic a bit. I'll add the >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> snippets next week, >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >> > if that's fine by y'all. >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >> > Best >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >> > -P. >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >> > >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >> > On Thu, Apr 11, 2019 at >>> 5:27 AM Robert Bradshaw >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> <[email protected] <mailto: >>> [email protected]> >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >> > <mailto: >>> [email protected] <mailto:[email protected]>>> >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> wrote: >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >> > >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >> > That's a great idea! >>> It would probably be pretty easy >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> to add the >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >> > corresponding code >>> snippets to the docs as well. >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >> > >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >> > On Thu, Apr 11, 2019 >>> at 2:00 PM Maximilian Michels >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> <[email protected] <mailto: >>> [email protected]> >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >> > <mailto: >>> [email protected] <mailto:[email protected]>>> wrote: >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >> > > >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >> > > Hi everyone, >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >> > > >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >> > > The Python SDK >>> still lacks documentation on state >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> and timers. >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >> > > >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >> > > As a first step, >>> what do you think about updating >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> these two blog >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >> > posts >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >> > > with the >>> corresponding Python code? >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >> > > >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >> > > >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >>> https://beam.apache.org/blog/2017/02/13/stateful-processing.html >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >> > > >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >>> https://beam.apache.org/blog/2017/08/28/timely-processing.html >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >> > > >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >> > > Thanks, >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >> > > Max >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >> > >>> >>>>>>>> >> >>>>> >>>>> >>>>>>> >>> >>>>>>>> >> >>>>> >> >>> >>>>>>>> >> >>>>> >> >>> >>>>>>>> >> >>>>> >> >>> >>>>>>>> >> >>>>> >> -- >>> >>>>>>>> >> >>>>> >> >>> >>>>>>>> >> >>>>> >> This email may be confidential and privileged. If >>> you received this communication by mistake, please don't forward it to >>> anyone else, please erase all copies and attachments, and please let me >>> know that it has gone to the wrong person. >>> >>>>>>>> >> >>>>> >> >>> >>>>>>>> >> >>>>> >> The above terms reflect a potential business >>> arrangement, are provided solely as a basis for further discussion, and are >>> not intended to be and do not constitute a legally binding obligation. No >>> legally binding obligations will be created, implied, or inferred until an >>> agreement in final form is executed in writing by all parties involved. >>> >>>>>>>> >> > >>> >>>>>>>> >> > >>> >>>>>>>> >> > >>> >>>>>>>> >> > -- >>> >>>>>>>> >> > >>> >>>>>>>> >> > This email may be confidential and privileged. If you >>> received this communication by mistake, please don't forward it to anyone >>> else, please erase all copies and attachments, and please let me know that >>> it has gone to the wrong person. >>> >>>>>>>> >> > >>> >>>>>>>> >> > The above terms reflect a potential business arrangement, >>> are provided solely as a basis for further discussion, and are not intended >>> to be and do not constitute a legally binding obligation. No legally >>> binding obligations will be created, implied, or inferred until an >>> agreement in final form is executed in writing by all parties involved. >>> >> > > -- > > This email may be confidential and privileged. If you received this > communication by mistake, please don't forward it to anyone else, please > erase all copies and attachments, and please let me know that it has gone > to the wrong person. > > The above terms reflect a potential business arrangement, are provided > solely as a basis for further discussion, and are not intended to be and do > not constitute a legally binding obligation. No legally binding obligations > will be created, implied, or inferred until an agreement in final form is > executed in writing by all parties involved. >
