Re: DISCUSS: Sorted MapState API

Kenneth Knowles Fri, 24 May 2019 09:58:45 -0700

On Fri, May 24, 2019 at 9:51 AM Kenneth Knowles <[email protected]> wrote:


>
>
> On Fri, May 24, 2019 at 8:14 AM Reuven Lax <[email protected]> wrote:
>
>> Some great comments!
>>
>> *Aljoscha*: absolutely this would have to be implemented by runners to
>> be efficient. We can of course provide a default (inefficient)
>> implementation, but ideally runners would provide better ones.
>>
>> *Jan* Exactly. I think MapState can be dropped or backed by this. E.g.
>>
>> *Robert* Great point about standard coders not satisfying this. That's
>> why I suggested that we provide a way to tag the coders that do preserve
>> order, and only accept those as key coders Alternatively we could present a
>> more limited API - e.g. only allowing a hard-coded set of types to be used
>> as keys - but that seems counter to the direction Beam usually goes.
>>
>
> I think we got it right with GroupByKey: the encoded form of a key is
> authoritative/portable.
>
> Instead of seeing the in-language type as the "real" value and the coder a
> way to serialize it, the portable encoded bytestring is the "real" value
> and the representation in a particular SDK is the responsibility of the SDK.
>
> This is very important, because many in-language representations,
> especially low-level representations, do not have the desired equality. For
> example, Java arrays. Coder.structuralValue(...) is required to have
> equality that matches the equality of the encoded form. It can be a noop if
> the in-language equality already matches. Or it can be a full encoding if
> there is not a more efficient option. I think we could add another method
> "lexicalValue" or add the requirement that structuralValue also sort
> equivalently to the wire format.
>
> Now, since many of our wire formats do not sort in the natural
> mathematical order, SDKs should help users avoid the pitfall of using
> these, as we do for GBK by checking "determinism" of the coder. Note that I
> am *not* referring to the order they would sort in any particular
> programming language implementation.
>

Another related features request. Today we do this:

(1) infer any coder for a type
(2) crash if it is not suitable for GBK

I have proposed in the past that instead we:

(1) notice that a PCollection is input to a GBK
(2) infer an appropriate coder for a type that will work with GBK

This generalizes to the idea of registering multiple coders for different
purposes, particularly inferring a coder that has good lexical sorting.

I don't recall that there was any objection, but neither I nor anyone else
has gotten around to working on this. It does have the problem that it
could change the coders and impact pipeline update. I suggest that update
is fragile enough that we should develop pipeline options that allow opt-in
to improvements without a major version bump.

Kenn


>
> Kenn
>
>
>> So users will have two ways .of creating multimap state specs:
>>
>>    private final StateSpec<MultimapState<Long, String>> state =
>> StateSpecs.multimap(VarLongCoder.of(), StringUtf8Coder.of());
>>
>> or
>>    private final StateSpec<MultimapState<Long, String>> state =
>> StateSpecs.orderedMultimap(VarLongCoder.of(), StringUtf8Coder.of());
>>
>> The second one will validate that the key coder preserves order, and
>> fails otherwise (similar to coder determinism checking in GroupByKey). (BTW
>> we would also have versions of these functions that use coder inference to
>> "guess" the coder, but those will do the same checking)
>>
>> Also the API I proposed did support random access! We could separate out
>> OrderedBagState again if we think the use cases are fundamentally
>> different. I merged the proposal into that of MultimapState because there
>> seemed be 99% overlap.
>>
>> Reuven
>>
>> On Fri, May 24, 2019 at 6:19 AM Robert Bradshaw <[email protected]>
>> wrote:
>>
>>> On Fri, May 24, 2019 at 5:32 AM Reuven Lax <[email protected]> wrote:
>>> >
>>> > On Thu, May 23, 2019 at 1:53 PM Ahmet Altay <[email protected]> wrote:
>>> >>
>>> >>
>>> >>
>>> >> On Thu, May 23, 2019 at 1:38 PM Lukasz Cwik <[email protected]> wrote:
>>> >>>
>>> >>>
>>> >>>
>>> >>> On Thu, May 23, 2019 at 11:37 AM Rui Wang <[email protected]> wrote:
>>> >>>>>
>>> >>>>> A few obvious problems with this code:
>>> >>>>>   1. Removing the elements already processed from the bag requires
>>> clearing and rewriting the entire bag. This is O(n^2) in the number of
>>> input trades.
>>> >>>>
>>> >>>> why it's not O(2 * n) to clearing and rewriting trade state?
>>> >>>>
>>> >>>>>
>>> >>>>> public interface SortedMultimapState<K, V> extends State {
>>> >>>>>   // Add a value to the map.
>>> >>>>>   void put(K key, V value);
>>> >>>>>   // Get all values for a given key.
>>> >>>>>   ReadableState<Iterable<V>> get(K key);
>>> >>>>>  // Return all entries in the map.
>>> >>>>>   ReadableState<Iterable<KV<K, V>>> allEntries();
>>> >>>>>   // Return all entries in the map with keys <= limit. returned
>>> elements are sorted by the key.
>>> >>>>>   ReadableState<Iterable<KV<K, V>>> entriesUntil(K limit);
>>> >>>>>
>>> >>>>>  // Remove all values with the given key;
>>> >>>>>   void remove(K key);
>>> >>>>>  // Remove all entries in the map with keys <= limit.
>>> >>>>>   void removeUntil(K limit);
>>> >>>>
>>> >>>> Will removeUntilExcl(K limit) also useful? It will remove all
>>> entries in the map with keys < limit.
>>> >>>>
>>> >>>>>
>>> >>>>> Runners will sort based on the encoded value of the key. In order
>>> to make this easier for users, I propose that we introduce a new tag on
>>> Coders PreservesOrder. A Coder that contains this tag guarantees that the
>>> encoded value preserves the same ordering as the base Java type.
>>> >>>>
>>> >>>>
>>> >>>> Could you clarify what is  "encoded value preserves the same
>>> ordering as the base Java type"?
>>> >>>
>>> >>>
>>> >>> Lets say A and B represent two different instances of the same Java
>>> type like a double, then A < B (using the languages comparison operator)
>>> iff encode(A) < encode(B) (note the encoded versions are compared
>>> lexicographically)
>>> >>
>>> >>
>>> >> Since coders are shared across SDKs, do we expect A < B iff e(A) <
>>> e(P) property to hold for all languages we support? What happens A, B sort
>>> differently in different languages?
>>> >
>>> >
>>> > That would have to be the property of the coder (which means that this
>>> property probably needs to be represented in the portability representation
>>> of the coder). I imagine the common use cases will be for simple coders
>>> like int, long, string, etc., which are likely to sort the same in most
>>> languages.
>>>
>>> The standard coders for both double and integral types do not respect
>>> the natural ordering (consider negative values). KV coders violate the
>>> "natural" lexicographic ordering on components as well. I think
>>> implicitly sorting on encoded value would yield many surprises. (The
>>> state, of course, could take a order-preserving, bytes
>>> (string?)-producing callable as a parameter of course). (As for
>>> naming, I'd probably call this OrderedBagState or something like
>>> that...rather than Map which tends to imply random access.)
>>>
>>

Re: DISCUSS: Sorted MapState API

Reply via email to