Re: Existing transactionality inconsistency in the Beam Java State API

Lukasz Cwik Fri, 25 May 2018 10:57:06 -0700

Great, I was confused in the description that was provided and then the
follow up by Ben. I think its worthwhile to describe the differences with
actual examples of what happens.


On Fri, May 25, 2018 at 10:54 AM Kenneth Knowles <[email protected]> wrote:

> I think the return value of read() should always be an immutable value.
>
> Kenn
>
> On Fri, May 25, 2018 at 10:44 AM Lukasz Cwik <[email protected]> wrote:
>
>> Kenn, in the second example where we are creating views whenever read()
>> is called, is it that the view's underlying data is immutable. For example:
>> Iterable<String> values = state.read();
>> state.append("newValue");
>> If I iterate over values, does values now contain "newValues"?
>>
>>
>> On Thu, May 24, 2018 at 10:38 AM Kenneth Knowles <[email protected]> wrote:
>>
>>> I see what you mean but I don't agree that futures imply anything other
>>> than "it is a value that you have to force", with deliberately many
>>> possible implementations. When at the point in 1 and you've got
>>>
>>>     interface ReadableState<T> {
>>>         T read()
>>>     }
>>>
>>> and you want to improve performance, both approaches "void readLater()"
>>> and "StateFuture<T> read()" are natural evolutions. They both gain the same
>>> 10x and they both support the "unchanging committed state plus buffered
>>> mutations" implementation well. And snapshots are essentially free for that
>>> implementation if the buffered mutations are stored in a decent data
>>> structure.
>>>
>>> My recollection was that futures were seen as more cumbersome. They
>>> affect the types even for simple uses. The only appealing future API was
>>> Guava's, but we didn't want that on the API surface. And we did not intend
>>> for these to be used in complex ways, so the usability & perf benefits of a
>>> future-based API weren't really realized anyhow.
>>>
>>> The only reason I belabor this is that if we ever wanted to support more
>>> complex use cases, such as concurrent use of state, then my preference
>>> would flip. I wouldn't want to make XYZState a synchronized monitor. At
>>> that point I'd favor using a snapshots-are-free concurrent data structure
>>> under the hood of a future-based API.
>>>
>>> Since there is really only one implementation in mind for this, maybe
>>> only one that works reasonably, we should just document it as such. The
>>> docs on ReadableState do make it sound like writes will invalidate the
>>> usefulness of readLater, even though that isn't the case for the intended
>>> implementation strategy.
>>>
>>> Kenn
>>>
>>> On Thu, May 24, 2018 at 9:40 AM Ben Chambers <[email protected]>
>>> wrote:
>>>
>>>> I think Kenn's second option accurately reflects my memory of the
>>>> original intentions:
>>>>
>>>> 1. I remember we we considered either using the Future interface or
>>>> calling the ReadableState interface a future, and explicitly said "no,
>>>> future implies asynchrony and that the value returned by `get` won't change
>>>> over multiple calls, but we want the latest value each time". So, I
>>>> remember us explicitly considering and rejecting Future, thus the name
>>>> "ReadableState".
>>>>
>>>> 2. The intuition behind the implementation was analogous to a
>>>> mutable-reference cell in languages like ML / Scheme / etc. The
>>>> ReadableState is just a pointer to the the reference cell. Calling read
>>>> returns the value currently in the cell. If we have 100 ReadableStates
>>>> pointing at the same cell, they all get the same value regardless of when
>>>> they were created. This avoids needing to duplicate/snapshot values at any
>>>> point in time.
>>>>
>>>> 3. ReadLater was added, as noted by Charles, to suggest prefetching the
>>>> associated value. This was added after benchmarks showed 10x (if I remember
>>>> correctly) performance improvements in things like GroupAlsoByWindows by
>>>> minimizing round-trips asking for more state. The intuition being -- if we
>>>> need to make an RPC to load one state value, we are better off making an
>>>> RPC to load all the values we need.
>>>>
>>>> Overall, I too lean towards maintaining the second interpretation since
>>>> it seems to be consistent and I believe we had additional reasons for
>>>> preferring it over futures.
>>>>
>>>> Given the confusion, I think strengthening the class documentation
>>>> makes sense -- I note the only hint of the current behavior is that
>>>> ReadableState indicates it gets the *current* value (emphasis mine). We
>>>> should emphasize that and perhaps even mention that the ReadableState
>>>> should be understood as just a reference or handle to the underlying state,
>>>> and thus its value will reflect the latest write.
>>>>
>>>> Charles, if it helps, the plan I remember regarding prefetching was
>>>> something like:
>>>>
>>>> interface ReadableMapState<K, V> {
>>>>    ReadableState<V> get(K key);
>>>>    ReadableState<Iterable<V>> getIterable();
>>>>    ReadableState<Map<K, V>> get();
>>>>    // ... more things ...
>>>> }
>>>>
>>>> Then prefetching a value is `mapState.get(key).readLater()` and
>>>> prefetching the entire map is `mapState.get().readLater()`, etc.
>>>>
>>>> On Wed, May 23, 2018 at 7:13 PM Charles Chen <[email protected]> wrote:
>>>>
>>>>> Thanks Kenn.  I think there are two issues to highlight: (1) the API
>>>>> should allow for some sort of prefetching / batching / background I/O for
>>>>> state; and (2) it should be clear what the semantics are for reading (e.g.
>>>>> so we don't have confusing read after write behavior).
>>>>>
>>>>> The approach I'm leaning towards for (1) is to allow a
>>>>> state.prefetch() method (to prefetch a value, iterable or [entire] map
>>>>> state) and maybe something like state.prefetch_key(key) to prefetch a
>>>>> specific KV in the map.  Issue (2) seems to be okay in either of Kenn's
>>>>> positions.
>>>>>
>>>>> On Wed, May 23, 2018 at 5:33 PM Robert Bradshaw <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Thanks for laying this out so well, Kenn. I'm also leaning towards
>>>>>> the second option, despite its drawbacks. (In particular, readLater
>>>>>> should not influence what's returned at read(), it's just a hint.)
>>>>>>
>>>>>> On Wed, May 23, 2018 at 4:43 PM Kenneth Knowles <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Great idea to bring it to dev@. I think it is better to focus here
>>>>>>> than long doc comment threads.
>>>>>>>
>>>>>>> I had strong opinions that I think were a bit confused and wrong.
>>>>>>> Sorry for that. I stated this position:
>>>>>>>
>>>>>>>  - XYZState class is a handle to a mutable location
>>>>>>>  - its methods like isEmpty() or contents() should return immutable
>>>>>>> future values (implicitly means their contents are semantically frozen 
>>>>>>> when
>>>>>>> they are created)
>>>>>>>  - the fact that you created the future is a hint that all necessary
>>>>>>> fetching/computation should be kicked off
>>>>>>>  - later forced with get()
>>>>>>>  - when it was designed, pure async style was not a viable option
>>>>>>>
>>>>>>> I see now that the actual position of some of its original designers
>>>>>>> is:
>>>>>>>
>>>>>>>  - XYZState class is a view on a mutable location
>>>>>>>  - its methods return new views on that mutable location
>>>>>>>  - calling readLater() is a hint that some fetching/computation
>>>>>>> should be kicked off
>>>>>>>  - later read() will combine whatever readLater() did with
>>>>>>> additional local info to give the current value
>>>>>>>  - async style not applicable nor desirable as per Beam's focus on
>>>>>>> naive straight-line coding + autoscaling
>>>>>>>
>>>>>>> These are both internally consistent I think. In fact, I like the
>>>>>>> second perspective better than the one I have been promoting. There are
>>>>>>> some weaknesses: readLater() is pretty tightly coupled to a particular
>>>>>>> implementation style, and futures are decades old so you can get good 
>>>>>>> APIs
>>>>>>> and performance without inventing anything. But I still like the 
>>>>>>> non-future
>>>>>>> version a little better.
>>>>>>>
>>>>>>> Kenn
>>>>>>>
>>>>>>> On Wed, May 23, 2018 at 4:05 PM Charles Chen <[email protected]> wrote:
>>>>>>>
>>>>>>>> During the design of the Beam Python State API, we noticed some
>>>>>>>> transactionality inconsistencies in the existing Beam Java State API 
>>>>>>>> (these
>>>>>>>> are the unresolved bugs BEAM-2980
>>>>>>>> <https://issues.apache.org/jira/browse/BEAM-2980> and BEAM-2975
>>>>>>>> <https://issues.apache.org/jira/browse/BEAM-2975>).  We are
>>>>>>>> therefore having a discussion about this API which can have 
>>>>>>>> implications
>>>>>>>> for its future development in all Beam languages:
>>>>>>>> https://docs.google.com/document/d/1GadEkAmtbJQjmqiqfSzGw3b66TKerm8tyn6TK4blAys/edit#heading=h.ofyl9jspiz3b
>>>>>>>>
>>>>>>>> If you have an opinion on the possible design approaches, it would
>>>>>>>> be very helpful to bring up in the doc or on this thread.  Thanks!
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Charles
>>>>>>>>
>>>>>>>

Re: Existing transactionality inconsistency in the Beam Java State API

Reply via email to