On Fri, May 24, 2019 at 9:51 AM Kenneth Knowles <k...@apache.org> wrote:
> > > On Fri, May 24, 2019 at 8:14 AM Reuven Lax <re...@google.com> wrote: > >> Some great comments! >> >> *Aljoscha*: absolutely this would have to be implemented by runners to >> be efficient. We can of course provide a default (inefficient) >> implementation, but ideally runners would provide better ones. >> >> *Jan* Exactly. I think MapState can be dropped or backed by this. E.g. >> >> *Robert* Great point about standard coders not satisfying this. That's >> why I suggested that we provide a way to tag the coders that do preserve >> order, and only accept those as key coders Alternatively we could present a >> more limited API - e.g. only allowing a hard-coded set of types to be used >> as keys - but that seems counter to the direction Beam usually goes. >> > > I think we got it right with GroupByKey: the encoded form of a key is > authoritative/portable. > > Instead of seeing the in-language type as the "real" value and the coder a > way to serialize it, the portable encoded bytestring is the "real" value > and the representation in a particular SDK is the responsibility of the SDK. > > This is very important, because many in-language representations, > especially low-level representations, do not have the desired equality. For > example, Java arrays. Coder.structuralValue(...) is required to have > equality that matches the equality of the encoded form. It can be a noop if > the in-language equality already matches. Or it can be a full encoding if > there is not a more efficient option. I think we could add another method > "lexicalValue" or add the requirement that structuralValue also sort > equivalently to the wire format. > > Now, since many of our wire formats do not sort in the natural > mathematical order, SDKs should help users avoid the pitfall of using > these, as we do for GBK by checking "determinism" of the coder. Note that I > am *not* referring to the order they would sort in any particular > programming language implementation. > Another related features request. Today we do this: (1) infer any coder for a type (2) crash if it is not suitable for GBK I have proposed in the past that instead we: (1) notice that a PCollection is input to a GBK (2) infer an appropriate coder for a type that will work with GBK This generalizes to the idea of registering multiple coders for different purposes, particularly inferring a coder that has good lexical sorting. I don't recall that there was any objection, but neither I nor anyone else has gotten around to working on this. It does have the problem that it could change the coders and impact pipeline update. I suggest that update is fragile enough that we should develop pipeline options that allow opt-in to improvements without a major version bump. Kenn > > Kenn > > >> So users will have two ways .of creating multimap state specs: >> >> private final StateSpec<MultimapState<Long, String>> state = >> StateSpecs.multimap(VarLongCoder.of(), StringUtf8Coder.of()); >> >> or >> private final StateSpec<MultimapState<Long, String>> state = >> StateSpecs.orderedMultimap(VarLongCoder.of(), StringUtf8Coder.of()); >> >> The second one will validate that the key coder preserves order, and >> fails otherwise (similar to coder determinism checking in GroupByKey). (BTW >> we would also have versions of these functions that use coder inference to >> "guess" the coder, but those will do the same checking) >> >> Also the API I proposed did support random access! We could separate out >> OrderedBagState again if we think the use cases are fundamentally >> different. I merged the proposal into that of MultimapState because there >> seemed be 99% overlap. >> >> Reuven >> >> On Fri, May 24, 2019 at 6:19 AM Robert Bradshaw <rober...@google.com> >> wrote: >> >>> On Fri, May 24, 2019 at 5:32 AM Reuven Lax <re...@google.com> wrote: >>> > >>> > On Thu, May 23, 2019 at 1:53 PM Ahmet Altay <al...@google.com> wrote: >>> >> >>> >> >>> >> >>> >> On Thu, May 23, 2019 at 1:38 PM Lukasz Cwik <lc...@google.com> wrote: >>> >>> >>> >>> >>> >>> >>> >>> On Thu, May 23, 2019 at 11:37 AM Rui Wang <ruw...@google.com> wrote: >>> >>>>> >>> >>>>> A few obvious problems with this code: >>> >>>>> 1. Removing the elements already processed from the bag requires >>> clearing and rewriting the entire bag. This is O(n^2) in the number of >>> input trades. >>> >>>> >>> >>>> why it's not O(2 * n) to clearing and rewriting trade state? >>> >>>> >>> >>>>> >>> >>>>> public interface SortedMultimapState<K, V> extends State { >>> >>>>> // Add a value to the map. >>> >>>>> void put(K key, V value); >>> >>>>> // Get all values for a given key. >>> >>>>> ReadableState<Iterable<V>> get(K key); >>> >>>>> // Return all entries in the map. >>> >>>>> ReadableState<Iterable<KV<K, V>>> allEntries(); >>> >>>>> // Return all entries in the map with keys <= limit. returned >>> elements are sorted by the key. >>> >>>>> ReadableState<Iterable<KV<K, V>>> entriesUntil(K limit); >>> >>>>> >>> >>>>> // Remove all values with the given key; >>> >>>>> void remove(K key); >>> >>>>> // Remove all entries in the map with keys <= limit. >>> >>>>> void removeUntil(K limit); >>> >>>> >>> >>>> Will removeUntilExcl(K limit) also useful? It will remove all >>> entries in the map with keys < limit. >>> >>>> >>> >>>>> >>> >>>>> Runners will sort based on the encoded value of the key. In order >>> to make this easier for users, I propose that we introduce a new tag on >>> Coders PreservesOrder. A Coder that contains this tag guarantees that the >>> encoded value preserves the same ordering as the base Java type. >>> >>>> >>> >>>> >>> >>>> Could you clarify what is "encoded value preserves the same >>> ordering as the base Java type"? >>> >>> >>> >>> >>> >>> Lets say A and B represent two different instances of the same Java >>> type like a double, then A < B (using the languages comparison operator) >>> iff encode(A) < encode(B) (note the encoded versions are compared >>> lexicographically) >>> >> >>> >> >>> >> Since coders are shared across SDKs, do we expect A < B iff e(A) < >>> e(P) property to hold for all languages we support? What happens A, B sort >>> differently in different languages? >>> > >>> > >>> > That would have to be the property of the coder (which means that this >>> property probably needs to be represented in the portability representation >>> of the coder). I imagine the common use cases will be for simple coders >>> like int, long, string, etc., which are likely to sort the same in most >>> languages. >>> >>> The standard coders for both double and integral types do not respect >>> the natural ordering (consider negative values). KV coders violate the >>> "natural" lexicographic ordering on components as well. I think >>> implicitly sorting on encoded value would yield many surprises. (The >>> state, of course, could take a order-preserving, bytes >>> (string?)-producing callable as a parameter of course). (As for >>> naming, I'd probably call this OrderedBagState or something like >>> that...rather than Map which tends to imply random access.) >>> >>