On Fri, May 24, 2019 at 6:57 PM Kenneth Knowles <k...@apache.org> wrote: > > On Fri, May 24, 2019 at 9:51 AM Kenneth Knowles <k...@apache.org> wrote: >> >> On Fri, May 24, 2019 at 8:14 AM Reuven Lax <re...@google.com> wrote: >>> >>> Some great comments! >>> >>> Aljoscha: absolutely this would have to be implemented by runners to be >>> efficient. We can of course provide a default (inefficient) implementation, >>> but ideally runners would provide better ones. >>> >>> Jan Exactly. I think MapState can be dropped or backed by this. E.g. >>> >>> Robert Great point about standard coders not satisfying this. That's why I >>> suggested that we provide a way to tag the coders that do preserve order, >>> and only accept those as key coders Alternatively we could present a more >>> limited API - e.g. only allowing a hard-coded set of types to be used as >>> keys - but that seems counter to the direction Beam usually goes. >> >> >> I think we got it right with GroupByKey: the encoded form of a key is >> authoritative/portable. >> >> Instead of seeing the in-language type as the "real" value and the coder a >> way to serialize it, the portable encoded bytestring is the "real" value and >> the representation in a particular SDK is the responsibility of the SDK. >> >> This is very important, because many in-language representations, especially >> low-level representations, do not have the desired equality. For example, >> Java arrays. Coder.structuralValue(...) is required to have equality that >> matches the equality of the encoded form. It can be a noop if the >> in-language equality already matches. Or it can be a full encoding if there >> is not a more efficient option. I think we could add another method >> "lexicalValue" or add the requirement that structuralValue also sort >> equivalently to the wire format. >> >> Now, since many of our wire formats do not sort in the natural mathematical >> order, SDKs should help users avoid the pitfall of using these, as we do for >> GBK by checking "determinism" of the coder. Note that I am *not* referring >> to the order they would sort in any particular programming language >> implementation. > > > Another related features request. Today we do this: > > (1) infer any coder for a type > (2) crash if it is not suitable for GBK > > I have proposed in the past that instead we: > > (1) notice that a PCollection is input to a GBK > (2) infer an appropriate coder for a type that will work with GBK > > This generalizes to the idea of registering multiple coders for different > purposes, particularly inferring a coder that has good lexical sorting. > > I don't recall that there was any objection, but neither I nor anyone else > has gotten around to working on this. It does have the problem that it could > change the coders and impact pipeline update. I suggest that update is > fragile enough that we should develop pipeline options that allow opt-in to > improvements without a major version bump.
It also has the difficulty that coders cannot be inferred until one knows all their consumers (which breaks a transform being able to inspect or act on the coder(s) of its input(s). Possibly, if we move to something more generic, like specifying element types/schemas on collections and then doing whole-pipeline-analysis to infer all the coders, this would be solvable (but potentially backwards incompatible, and fraught with difficulties if users (aka transform authors) specify their own coders (e.g. when inference fails, or as part of inference code)). Another way to look at this is that there are certain transformations ("upgrades") one can do to coders when they need certain properties, which change the encoded form but preserve the types. Upgrading a coder C to its length-prefixed one is one such operation. Upgrading a coder to a deterministic version, or a natural-order-preserving one, are two other possible transformations (which should be idempotent, may be the identity, and may be an error).