If we want to have a Pipeline level registry, we could add it to Components [1].
message Components { ... map<string, LogicalType> logical_types; } And in FieldType reference the logical types by id: oneof field_type { AtomicType atomic_type; ArrayType array_type; ... string logical_type_id; // was LogicalType logical_type; } I'm not sure I like this idea though. The reason we started discussing a "registry" was just to separate the SDK-specific bits from the representation type, and this doesn't accomplish that, it just de-dupes logical types used across the pipeline. I think instead I'd rather just come back to the message we have now in the doc, used directly in FieldType's oneof: message LogicalType { FieldType representation = 1; string logical_urn = 2; bytes logical_payload = 3; } We can have a URN for SDK-specific types (user type aliases), like "beam:logical:javasdk", and the logical_payload could itself be a protobuf with attributes of 1) a serialized class and 2/3) to/from functions. For truly portable types it would instead have a well-known URN and optionally a logical_payload with some agreed-upon representation of parameters. It seems like maybe SdkFunctionSpec/Environment should be used for this somehow, but I can't find a good example of this in the Runner API to use as a model. For example, what we're trying to accomplish is basically the same as Java custom coders vs. standard coders. But that is accomplished with a magic "javasdk" URN, as I suggested here, not with Environment [2,3]. There is a "TODO: standardize such things" where that URN is defined, is it possible that Environment is that standard and just hasn't been utilized for custom coders yet? Brian [1] https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L54 [2] https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L542 [3] https://github.com/apache/beam/blob/master/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/CoderTranslation.java#L121 On Tue, Jun 4, 2019 at 2:24 PM Brian Hulette <bhule...@google.com> wrote: > Yeah that's what I meant. It does seem logical reasonable to scope any > registry by pipeline and not by PCollection. Then it seems we would want > the entire LogicalType (including the `FieldType representation` field) as > the value type, and not just LogicalTypeConversion. Otherwise we're > separating the representations from the conversions, and duplicating the > representations. You did say a "registry of logical types", so maybe that > is what you meant. > > Brian > > On Tue, Jun 4, 2019 at 1:21 PM Reuven Lax <re...@google.com> wrote: > >> >> >> On Tue, Jun 4, 2019 at 9:20 AM Brian Hulette <bhule...@google.com> wrote: >> >>> >>> >>> On Mon, Jun 3, 2019 at 10:04 PM Reuven Lax <re...@google.com> wrote: >>> >>>> >>>> >>>> On Mon, Jun 3, 2019 at 12:27 PM Brian Hulette <bhule...@google.com> >>>> wrote: >>>> >>>>> > It has to go into the proto somewhere (since that's the only way >>>>> the SDK can get it), but I'm not sure they should be considered integral >>>>> parts of the type. >>>>> Are you just advocating for an approach where any SDK-specific >>>>> information is stored outside of the Schema message itself so that Schema >>>>> really does just represent the type? That seems reasonable to me, and >>>>> alleviates my concerns about how this applies to columnar encodings a bit >>>>> as well. >>>>> >>>> >>>> Yes, that's exactly what I'm advocating. >>>> >>>> >>>>> >>>>> We could lift all of the LogicalTypeConversion messages out of the >>>>> Schema and the LogicalType like this: >>>>> >>>>> message SchemaCoder { >>>>> Schema schema = 1; >>>>> LogicalTypeConversion root_conversion = 2; >>>>> map<string, LogicalTypeConversion> attribute_conversions = 3; // >>>>> only necessary for user type aliases, portable logical types by definition >>>>> have nothing SDK-specific >>>>> } >>>>> >>>> >>>> I'm not sure what the map is for? I think we have status quo wihtout it. >>>> >>> >>> My intention was that the SDK-specific information (to/from functions) >>> for any nested fields that are themselves user type aliases would be stored >>> in this map. That was the motivation for my next question, if we don't >>> allow user types to be nested within other user types we may not need it. >>> >> >> Oh, is this meant to contain the ids of all the logical types in this >> schema? If so I don't think SchemaCoder is the right place for this. Any >> "registry" of logical types should be global to the pipeline, not scoped to >> a single PCollection IMO. >> >> >>> I may be missing your meaning - but I think we currently only have >>> status quo without this map in the Java SDK because Schema.LogicalType is >>> just an interface that must be implemented. It's appropriate for just >>> portable logical types, not user-type aliases. Note I've adopted Kenn's >>> terminology where portable logical type is a type that can be identified by >>> just a URN and maybe some parameters, while a user type alias needs some >>> SDK specific information, like a class and to/from UDFs. >>> >>> >>>> >>>>> I think a critical question (that has implications for the above >>>>> proposal) is how/if the two different concepts Kenn mentioned are allowed >>>>> to nest. For example, you could argue it's redundant to have a user type >>>>> alias that has a Row representation with a field that is itself a user >>>>> type >>>>> alias, because instead you could just have a single top-level type alias >>>>> with to/from functions that pack and unpack the entire hierarchy. On the >>>>> other hand, I think it does make sense for a user type alias or a truly >>>>> portable logical type to have a field that is itself a truly portable >>>>> logical type (e.g. a user type alias or portable type with a DateTime). >>>>> >>>>> I've been assuming that user-type aliases could be nested, but should >>>>> we disallow that? Or should we go the other way and require that logical >>>>> types define at most one "level"? >>>>> >>>> >>>> No I think it's useful to allow things to be nested (though of course >>>> the nesting must terminate). >>>> >>> >>>> >>>>> >>>>> Brian >>>>> >>>>> On Mon, Jun 3, 2019 at 11:08 AM Kenneth Knowles <k...@apache.org> >>>>> wrote: >>>>> >>>>>> >>>>>> On Mon, Jun 3, 2019 at 10:53 AM Reuven Lax <re...@google.com> wrote: >>>>>> >>>>>>> So I feel a bit leery about making the to/from functions a >>>>>>> fundamental part of the portability representation. In my mind, that is >>>>>>> very tied to a specific SDK/language. A SDK (say the Java SDK) wants to >>>>>>> allow users to use a wide variety of native types with schemas, and >>>>>>> under >>>>>>> the covers uses the to/from functions to implement that. However from >>>>>>> the >>>>>>> portable Beam perspective, the schema itself should be the real "type" >>>>>>> of >>>>>>> the PCollection; the to/from methods are simply a way that a particular >>>>>>> SDK >>>>>>> makes schemas easier to use. It has to go into the proto somewhere >>>>>>> (since >>>>>>> that's the only way the SDK can get it), but I'm not sure they should be >>>>>>> considered integral parts of the type. >>>>>>> >>>>>> >>>>>> On the doc in a couple places this distinction was made: >>>>>> >>>>>> * For truly portable logical types, no instructions for the SDK are >>>>>> needed. Instead, they require: >>>>>> - URN: a standardized identifier any SDK can recognize >>>>>> - A spec: what is the universe of values in this type? >>>>>> - A representation: how is it represented in built-in types? This >>>>>> is how SDKs who do not know/care about the URN will process it >>>>>> - (optional): SDKs choose preferred SDK-specific types to embed >>>>>> the values in. SDKs have to know about the URN and choose for themselves. >>>>>> >>>>>> *For user-level type aliases, written as convenience by the user in >>>>>> their pipeline, what Java schemas have today: >>>>>> - to/from UDFs: the code is SDK-specific >>>>>> - some representation of the intended type (like java class): also >>>>>> SDK specific >>>>>> - a representation >>>>>> - any "id" is just like other ids in the pipeline, just avoiding >>>>>> duplicating the proto >>>>>> - Luke points out that nesting these can give multiple SDKs a hint >>>>>> >>>>>> In my mind the remaining complexity is whether or not we need to be >>>>>> able to move between the two. Composite PTransforms, for example, do have >>>>>> fluidity between being strictly user-defined versus portable URN+payload. >>>>>> But it requires lots of engineering, namely the current work on expansion >>>>>> service. >>>>>> >>>>>> Kenn >>>>>> >>>>>> >>>>>>> On Mon, Jun 3, 2019 at 10:23 AM Brian Hulette <bhule...@google.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Ah I see, I didn't realize that. Then I suppose we'll need to/from >>>>>>>> functions somewhere in the logical type conversion to preserve the >>>>>>>> current >>>>>>>> behavior. >>>>>>>> >>>>>>>> I'm still a little hesitant to make these functions an explicit >>>>>>>> part of LogicalTypeConversion for another reason. Down the road, >>>>>>>> schemas >>>>>>>> could give us an avenue to use a batched columnar format (presumably >>>>>>>> arrow, >>>>>>>> but of course others are possible). By making to/from an explicit part >>>>>>>> of >>>>>>>> logical types we add some element-wise logic to a schema representation >>>>>>>> that's otherwise ambivalent to element-wise vs. batched encodings. >>>>>>>> >>>>>>>> I suppose you could make an argument that to/from are only for >>>>>>>> custom types. There will also be some set of well-known types >>>>>>>> identified >>>>>>>> only by URN and some parameters, which could easily be translated to a >>>>>>>> columnar format. We could just not support custom types fully if we >>>>>>>> add a >>>>>>>> columnar encoding, or maybe add optional toBatch/fromBatch functions >>>>>>>> when/if we get there. >>>>>>>> >>>>>>>> What about something like this that makes the two different types >>>>>>>> of logical types explicit? >>>>>>>> >>>>>>>> // Describes a logical type and how to convert between it and its >>>>>>>> representation (e.g. Row). >>>>>>>> message LogicalTypeConversion { >>>>>>>> oneof conversion { >>>>>>>> message Standard standard = 1; >>>>>>>> message Custom custom = 2; >>>>>>>> } >>>>>>>> >>>>>>>> message Standard { >>>>>>>> String urn = 1; >>>>>>>> repeated string args = 2; // could also be a map >>>>>>>> } >>>>>>>> >>>>>>>> message Custom { >>>>>>>> FunctionSpec(?) toRepresentation = 1; >>>>>>>> FunctionSpec(?) fromRepresentation = 2; >>>>>>>> bytes type = 3; // e.g. serialized class for Java >>>>>>>> } >>>>>>>> } >>>>>>>> >>>>>>>> And LogicalType and Schema become: >>>>>>>> >>>>>>>> message LogicalType { >>>>>>>> FieldType representation = 1; >>>>>>>> LogicalTypeConversion conversion = 2; >>>>>>>> } >>>>>>>> >>>>>>>> message Schema { >>>>>>>> ... >>>>>>>> repeated Field fields = 1; >>>>>>>> LogicalTypeConversion conversion = 2; // implied that >>>>>>>> representation is Row >>>>>>>> } >>>>>>>> >>>>>>>> Brian >>>>>>>> >>>>>>>> On Sat, Jun 1, 2019 at 10:44 AM Reuven Lax <re...@google.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Keep in mind that right now the SchemaRegistry is only assumed to >>>>>>>>> exist at graph-construction time, not at execution time; all >>>>>>>>> information in >>>>>>>>> the schema registry is embedded in the SchemaCoder, which is the only >>>>>>>>> thing >>>>>>>>> we keep around when the pipeline is actually running. We could look >>>>>>>>> into >>>>>>>>> changing this, but it would potentially be a very big change, and I do >>>>>>>>> think we should start getting users actively using schemas soon. >>>>>>>>> >>>>>>>>> On Fri, May 31, 2019 at 3:40 PM Brian Hulette <bhule...@google.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> > Can you propose what the protos would look like in this case? >>>>>>>>>> Right now LogicalType does not contain the to/from conversion >>>>>>>>>> functions in >>>>>>>>>> the proto. Do you think we'll need to add these in? >>>>>>>>>> >>>>>>>>>> Maybe. Right now the proposed LogicalType message is pretty >>>>>>>>>> simple/generic: >>>>>>>>>> message LogicalType { >>>>>>>>>> FieldType representation = 1; >>>>>>>>>> string logical_urn = 2; >>>>>>>>>> bytes logical_payload = 3; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> If we keep just logical_urn and logical_payload, the >>>>>>>>>> logical_payload could itself be a protobuf with attributes of 1) a >>>>>>>>>> serialized class and 2/3) to/from functions. Or, alternatively, we >>>>>>>>>> could >>>>>>>>>> have a generalization of the SchemaRegistry for logical types. >>>>>>>>>> Implementations for standard types and user-defined types would be >>>>>>>>>> registered by URN, and the SDK could look them up given just a URN. >>>>>>>>>> I put a >>>>>>>>>> brief section about this alternative in the doc last week [1]. What I >>>>>>>>>> suggested there included removing the logical_payload field, which is >>>>>>>>>> probably overkill. The critical piece is just relying on a registry >>>>>>>>>> in the >>>>>>>>>> SDK to look up types and to/from functions rather than storing them >>>>>>>>>> in the >>>>>>>>>> portable schema itself. >>>>>>>>>> >>>>>>>>>> I kind of like keeping the LogicalType message generic for now, >>>>>>>>>> since it gives us a way to try out these various approaches, but >>>>>>>>>> maybe >>>>>>>>>> that's just a cop out. >>>>>>>>>> >>>>>>>>>> [1] >>>>>>>>>> https://docs.google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?ts=5cdf6a5b#heading=h.jlt5hdrolfy >>>>>>>>>> >>>>>>>>>> On Fri, May 31, 2019 at 12:36 PM Reuven Lax <re...@google.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Tue, May 28, 2019 at 10:11 AM Brian Hulette < >>>>>>>>>>> bhule...@google.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Sun, May 26, 2019 at 1:25 PM Reuven Lax <re...@google.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, May 24, 2019 at 11:42 AM Brian Hulette < >>>>>>>>>>>>> bhule...@google.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> *tl;dr:* SchemaCoder represents a logical type with a base >>>>>>>>>>>>>> type of Row and we should think about that. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I'm a little concerned that the current proposals for a >>>>>>>>>>>>>> portable representation don't actually fully represent Schemas. >>>>>>>>>>>>>> It seems to >>>>>>>>>>>>>> me that the current java-only Schemas are made up three concepts >>>>>>>>>>>>>> that are >>>>>>>>>>>>>> intertwined: >>>>>>>>>>>>>> (a) The Java SDK specific code for schema inference, type >>>>>>>>>>>>>> coercion, and "schema-aware" transforms. >>>>>>>>>>>>>> (b) A RowCoder[1] that encodes Rows[2] which have a >>>>>>>>>>>>>> particular Schema[3]. >>>>>>>>>>>>>> (c) A SchemaCoder[4] that has a RowCoder for a >>>>>>>>>>>>>> particular schema, and functions for converting Rows with that >>>>>>>>>>>>>> schema >>>>>>>>>>>>>> to/from a Java type T. Those functions and the RowCoder are then >>>>>>>>>>>>>> composed >>>>>>>>>>>>>> to provider a Coder for the type T. >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> RowCoder is currently just an internal implementation detail, >>>>>>>>>>>>> it can be eliminated. SchemaCoder is the only thing that >>>>>>>>>>>>> determines a >>>>>>>>>>>>> schema today. >>>>>>>>>>>>> >>>>>>>>>>>> Why not keep it around? I think it would make sense to have a >>>>>>>>>>>> RowCoder implementation in every SDK, as well as something like >>>>>>>>>>>> SchemaCoder >>>>>>>>>>>> that defines a conversion from that SDK's "Row" to the language >>>>>>>>>>>> type. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> The point is that from a programmer's perspective, there is >>>>>>>>>>> nothing much special about Row. Any type can have a schema, and the >>>>>>>>>>> only >>>>>>>>>>> special thing about Row is that it's always guaranteed to exist. >>>>>>>>>>> From that >>>>>>>>>>> standpoint, Row is nearly an implementation detail. Today RowCoder >>>>>>>>>>> is never >>>>>>>>>>> set on _any_ PCollection, it's literally just used as a helper >>>>>>>>>>> library, so >>>>>>>>>>> there's no real need for it to exist as a "Coder." >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> We're not concerned with (a) at this time since that's >>>>>>>>>>>>>> specific to the SDK, not the interface between them. My >>>>>>>>>>>>>> understanding is we >>>>>>>>>>>>>> just want to define a portable representation for (b) and/or (c). >>>>>>>>>>>>>> >>>>>>>>>>>>>> What has been discussed so far is really just a portable >>>>>>>>>>>>>> representation for (b), the RowCoder, since the discussion is >>>>>>>>>>>>>> only around >>>>>>>>>>>>>> how to represent the schema itself and not the to/from functions. >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Correct. The to/from functions are actually related to a). One >>>>>>>>>>>>> of the big goals of schemas was that users should not be forced >>>>>>>>>>>>> to operate >>>>>>>>>>>>> on rows to get schemas. A user can create >>>>>>>>>>>>> PCollection<MyRandomType> and as >>>>>>>>>>>>> long as the SDK can infer a schema from MyRandomType, the user >>>>>>>>>>>>> never needs >>>>>>>>>>>>> to even see a Row object. The to/fromRow functions are what make >>>>>>>>>>>>> this work >>>>>>>>>>>>> today. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> One of the points I'd like to make is that this type coercion >>>>>>>>>>>> is a useful concept on it's own, separate from schemas. It's >>>>>>>>>>>> especially >>>>>>>>>>>> useful for a type that has a schema and is encoded by RowCoder >>>>>>>>>>>> since that >>>>>>>>>>>> can represent many more types, but the type coercion doesn't have >>>>>>>>>>>> to be >>>>>>>>>>>> tied to just schemas and RowCoder. We could also do type coercion >>>>>>>>>>>> for types >>>>>>>>>>>> that are effectively wrappers around an integer or a string. It >>>>>>>>>>>> could just >>>>>>>>>>>> be a general way to map language types to base types (i.e. types >>>>>>>>>>>> that we >>>>>>>>>>>> have a coder for). Then it just becomes a general framework for >>>>>>>>>>>> extending >>>>>>>>>>>> coders to represent more language types. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Let's not tie those conversations. Maybe a similar concept will >>>>>>>>>>> hold true for general coders (or we might decide to get rid of >>>>>>>>>>> coders in >>>>>>>>>>> favor of schemas, in which case that becomes moot), but I don't >>>>>>>>>>> think we >>>>>>>>>>> should prematurely generalize. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> One of the outstanding questions for that schema >>>>>>>>>>>>>> representation is how to represent logical types, which may or >>>>>>>>>>>>>> may not have >>>>>>>>>>>>>> some language type in each SDK (the canonical example being a >>>>>>>>>>>>>> timsetamp type with seconds and nanos and java.time.Instant). I >>>>>>>>>>>>>> think this >>>>>>>>>>>>>> question is critically important, because (c), the SchemaCoder, >>>>>>>>>>>>>> is actually >>>>>>>>>>>>>> *defining a logical type* with a language type T in the Java >>>>>>>>>>>>>> SDK. This >>>>>>>>>>>>>> becomes clear when you compare SchemaCoder[4] to the >>>>>>>>>>>>>> Schema.LogicalType >>>>>>>>>>>>>> interface[5] - both essentially have three attributes: a base >>>>>>>>>>>>>> type, and two >>>>>>>>>>>>>> functions for converting to/from that base type. The only >>>>>>>>>>>>>> difference is for >>>>>>>>>>>>>> SchemaCoder that base type must be a Row so it can be >>>>>>>>>>>>>> represented by a >>>>>>>>>>>>>> Schema alone, while LogicalType can have any base type that can >>>>>>>>>>>>>> be >>>>>>>>>>>>>> represented by FieldType, including a Row. >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> This is not true actually. SchemaCoder can have any base type, >>>>>>>>>>>>> that's why (in Java) it's SchemaCoder<T>. This is why >>>>>>>>>>>>> PCollection<T> can >>>>>>>>>>>>> have a schema, even if T is not Row. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> I'm not sure I effectively communicated what I meant - When I >>>>>>>>>>>> said SchemaCoder's "base type" I wasn't referring to T, I was >>>>>>>>>>>> referring to >>>>>>>>>>>> the base FieldType, whose coder we use for this type. I meant >>>>>>>>>>>> "base type" >>>>>>>>>>>> to be analogous to LogicalType's `getBaseType`, or what Kenn is >>>>>>>>>>>> suggesting >>>>>>>>>>>> we call "representation" in the portable beam schemas doc. To >>>>>>>>>>>> define some >>>>>>>>>>>> terms from my original message: >>>>>>>>>>>> base type = an instance of FieldType, crucially this is >>>>>>>>>>>> something that we have a coder for (be it VarIntCoder, Utf8Coder, >>>>>>>>>>>> RowCoder, >>>>>>>>>>>> ...) >>>>>>>>>>>> language type (or "T", "type T", "logical type") = Some Java >>>>>>>>>>>> class (or something analogous in the other SDKs) that we may or >>>>>>>>>>>> may not >>>>>>>>>>>> have a coder for. It's possible to define functions for converting >>>>>>>>>>>> instances of the language type to/from the base type. >>>>>>>>>>>> >>>>>>>>>>>> I was just trying to make the case that SchemaCoder is really a >>>>>>>>>>>> special case of LogicalType, where `getBaseType` always returns a >>>>>>>>>>>> Row with >>>>>>>>>>>> the stored Schema. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Yeah, I think I got that point. >>>>>>>>>>> >>>>>>>>>>> Can you propose what the protos would look like in this case? >>>>>>>>>>> Right now LogicalType does not contain the to/from conversion >>>>>>>>>>> functions in >>>>>>>>>>> the proto. Do you think we'll need to add these in? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> To make the point with code: SchemaCoder<T> can be made to >>>>>>>>>>>> implement Schema.LogicalType<T,Row> with trivial implementations of >>>>>>>>>>>> getBaseType, toBaseType, and toInputType (I'm not trying to say we >>>>>>>>>>>> should >>>>>>>>>>>> or shouldn't do this, just using it illustrate my point): >>>>>>>>>>>> >>>>>>>>>>>> class SchemaCoder extends CustomCoder<T> implements >>>>>>>>>>>> Schema.LogicalType<T, Row> { >>>>>>>>>>>> ... >>>>>>>>>>>> >>>>>>>>>>>> @Override >>>>>>>>>>>> FieldType getBaseType() { >>>>>>>>>>>> return FieldType.row(getSchema()); >>>>>>>>>>>> } >>>>>>>>>>>> >>>>>>>>>>>> @Override >>>>>>>>>>>> public Row toBaseType() { >>>>>>>>>>>> return this.toRowFunction.apply(input); >>>>>>>>>>>> } >>>>>>>>>>>> >>>>>>>>>>>> @Override >>>>>>>>>>>> public T toInputType(Row base) { >>>>>>>>>>>> return this.fromRowFunction.apply(base); >>>>>>>>>>>> } >>>>>>>>>>>> ... >>>>>>>>>>>> } >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>> I think it may make sense to fully embrace this duality, by >>>>>>>>>>>>>> letting SchemaCoder have a baseType other than just Row and >>>>>>>>>>>>>> renaming it to >>>>>>>>>>>>>> LogicalTypeCoder/LanguageTypeCoder. The current Java SDK >>>>>>>>>>>>>> schema-aware >>>>>>>>>>>>>> transforms (a) would operate only on LogicalTypeCoders with a >>>>>>>>>>>>>> Row base >>>>>>>>>>>>>> type. Perhaps some of the current schema logic could alsobe >>>>>>>>>>>>>> applied more >>>>>>>>>>>>>> generally to any logical type - for example, to provide type >>>>>>>>>>>>>> coercion for >>>>>>>>>>>>>> logical types with a base type other than Row, like int64 and a >>>>>>>>>>>>>> timestamp >>>>>>>>>>>>>> class backed by millis, or fixed size bytes and a UUID class. >>>>>>>>>>>>>> And having a >>>>>>>>>>>>>> portable representation that represents those (non Row backed) >>>>>>>>>>>>>> logical >>>>>>>>>>>>>> types with some URN would also allow us to pass them to other >>>>>>>>>>>>>> languages >>>>>>>>>>>>>> without unnecessarily wrapping them in a Row in order to use >>>>>>>>>>>>>> SchemaCoder. >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> I think the actual overlap here is between the to/from >>>>>>>>>>>>> functions in SchemaCoder (which is what allows SchemaCoder<T> >>>>>>>>>>>>> where T != >>>>>>>>>>>>> Row) and the equivalent functionality in LogicalType. However >>>>>>>>>>>>> making all of >>>>>>>>>>>>> schemas simply just a logical type feels a bit awkward and >>>>>>>>>>>>> circular to me. >>>>>>>>>>>>> Maybe we should refactor that part out into a >>>>>>>>>>>>> LogicalTypeConversion proto, >>>>>>>>>>>>> and reference that from both LogicalType and from SchemaCoder? >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> LogicalType is already potentially circular though. A schema >>>>>>>>>>>> can have a field with a logical type, and that logical type can >>>>>>>>>>>> have a base >>>>>>>>>>>> type of Row with a field with a logical type (and on and on...). >>>>>>>>>>>> To me it >>>>>>>>>>>> seems elegant, not awkward, to recognize that SchemaCoder is just >>>>>>>>>>>> a special >>>>>>>>>>>> case of this concept. >>>>>>>>>>>> >>>>>>>>>>>> Something like the LogicalTypeConversion proto would definitely >>>>>>>>>>>> be an improvement, but I would still prefer just using a top-level >>>>>>>>>>>> logical >>>>>>>>>>>> type :) >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> I've added a section to the doc [6] to propose this >>>>>>>>>>>>>> alternative in the context of the portable representation but I >>>>>>>>>>>>>> wanted to >>>>>>>>>>>>>> bring it up here as well to solicit feedback. >>>>>>>>>>>>>> >>>>>>>>>>>>>> [1] >>>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/RowCoder.java#L41 >>>>>>>>>>>>>> [2] >>>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/Row.java#L59 >>>>>>>>>>>>>> [3] >>>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L48 >>>>>>>>>>>>>> [4] >>>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/SchemaCoder.java#L33 >>>>>>>>>>>>>> [5] >>>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L489 >>>>>>>>>>>>>> [6] >>>>>>>>>>>>>> https://docs.google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?ts=5cdf6a5b#heading=h.7570feur1qin >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, May 10, 2019 at 9:16 AM Brian Hulette < >>>>>>>>>>>>>> bhule...@google.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Ah thanks! I added some language there. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> *From: *Kenneth Knowles <k...@apache.org> >>>>>>>>>>>>>>> *Date: *Thu, May 9, 2019 at 5:31 PM >>>>>>>>>>>>>>> *To: *dev >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> *From: *Brian Hulette <bhule...@google.com> >>>>>>>>>>>>>>>> *Date: *Thu, May 9, 2019 at 2:02 PM >>>>>>>>>>>>>>>> *To: * <dev@beam.apache.org> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> We briefly discussed using arrow schemas in place of beam >>>>>>>>>>>>>>>>> schemas entirely in an arrow thread [1]. The biggest reason >>>>>>>>>>>>>>>>> not to this was >>>>>>>>>>>>>>>>> that we wanted to have a type for large iterables in beam >>>>>>>>>>>>>>>>> schemas. But >>>>>>>>>>>>>>>>> given that large iterables aren't currently implemented, beam >>>>>>>>>>>>>>>>> schemas look >>>>>>>>>>>>>>>>> very similar to arrow schemas. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I think it makes sense to take inspiration from arrow >>>>>>>>>>>>>>>>> schemas where possible, and maybe even copy them outright. >>>>>>>>>>>>>>>>> Arrow already >>>>>>>>>>>>>>>>> has a portable (flatbuffers) schema representation [2], and >>>>>>>>>>>>>>>>> implementations >>>>>>>>>>>>>>>>> for it in many languages that we may be able to re-use as we >>>>>>>>>>>>>>>>> bring schemas >>>>>>>>>>>>>>>>> to more SDKs (the project has Python and Go implementations). >>>>>>>>>>>>>>>>> There are a >>>>>>>>>>>>>>>>> couple of concepts in Arrow schemas that are specific for the >>>>>>>>>>>>>>>>> format and >>>>>>>>>>>>>>>>> wouldn't make sense for us, (fields can indicate whether or >>>>>>>>>>>>>>>>> not they are >>>>>>>>>>>>>>>>> dictionary encoded, and the schema has an endianness field), >>>>>>>>>>>>>>>>> but if you >>>>>>>>>>>>>>>>> drop those concepts the arrow spec looks pretty similar to >>>>>>>>>>>>>>>>> the beam proto >>>>>>>>>>>>>>>>> spec. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> FWIW I left a blank section in the doc for filling out what >>>>>>>>>>>>>>>> the differences are and why, and conversely what the interop >>>>>>>>>>>>>>>> opportunities >>>>>>>>>>>>>>>> may be. Such sections are some of my favorite sections of >>>>>>>>>>>>>>>> design docs. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Kenn >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Brian >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>>>>> https://lists.apache.org/thread.html/6be7715e13b71c2d161e4378c5ca1c76ac40cfc5988a03ba87f1c434@%3Cdev.beam.apache.org%3E >>>>>>>>>>>>>>>>> [2] >>>>>>>>>>>>>>>>> https://github.com/apache/arrow/blob/master/format/Schema.fbs#L194 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> *From: *Robert Bradshaw <rober...@google.com> >>>>>>>>>>>>>>>>> *Date: *Thu, May 9, 2019 at 1:38 PM >>>>>>>>>>>>>>>>> *To: *dev >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> From: Reuven Lax <re...@google.com> >>>>>>>>>>>>>>>>>> Date: Thu, May 9, 2019 at 7:29 PM >>>>>>>>>>>>>>>>>> To: dev >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> > Also in the future we might be able to do optimizations >>>>>>>>>>>>>>>>>> at the runner level if at the portability layer we >>>>>>>>>>>>>>>>>> understood schemes >>>>>>>>>>>>>>>>>> instead of just raw coders. This could be things like only >>>>>>>>>>>>>>>>>> parsing a subset >>>>>>>>>>>>>>>>>> of a row (if we know only a few fields are accessed) or >>>>>>>>>>>>>>>>>> using a columnar >>>>>>>>>>>>>>>>>> data structure like Arrow to encode batches of rows across >>>>>>>>>>>>>>>>>> portability. >>>>>>>>>>>>>>>>>> This doesn't affect data semantics of course, but having a >>>>>>>>>>>>>>>>>> richer, >>>>>>>>>>>>>>>>>> more-expressive type system opens up other opportunities. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> But we could do all of that with a RowCoder we understood >>>>>>>>>>>>>>>>>> to designate >>>>>>>>>>>>>>>>>> the type(s), right? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> > On Thu, May 9, 2019 at 10:16 AM Robert Bradshaw < >>>>>>>>>>>>>>>>>> rober...@google.com> wrote: >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> >> On the flip side, Schemas are equivalent to the space >>>>>>>>>>>>>>>>>> of Coders with >>>>>>>>>>>>>>>>>> >> the addition of a RowCoder and the ability to >>>>>>>>>>>>>>>>>> materialize to something >>>>>>>>>>>>>>>>>> >> other than bytes, right? (Perhaps I'm missing >>>>>>>>>>>>>>>>>> something big here...) >>>>>>>>>>>>>>>>>> >> This may make a backwards-compatible transition >>>>>>>>>>>>>>>>>> easier. (SDK-side, the >>>>>>>>>>>>>>>>>> >> ability to reason about and operate on such types is >>>>>>>>>>>>>>>>>> of course much >>>>>>>>>>>>>>>>>> >> richer than anything Coders offer right now.) >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> >> From: Reuven Lax <re...@google.com> >>>>>>>>>>>>>>>>>> >> Date: Thu, May 9, 2019 at 4:52 PM >>>>>>>>>>>>>>>>>> >> To: dev >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> >> > FYI I can imagine a world in which we have no >>>>>>>>>>>>>>>>>> coders. We could define the entire model on top of schemas. >>>>>>>>>>>>>>>>>> Today's "Coder" >>>>>>>>>>>>>>>>>> is completely equivalent to a single-field schema with a >>>>>>>>>>>>>>>>>> logical-type field >>>>>>>>>>>>>>>>>> (actually the latter is slightly more expressive as you >>>>>>>>>>>>>>>>>> aren't forced to >>>>>>>>>>>>>>>>>> serialize into bytes). >>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>> >> > Due to compatibility constraints and the effort that >>>>>>>>>>>>>>>>>> would be involved in such a change, I think the practical >>>>>>>>>>>>>>>>>> decision should >>>>>>>>>>>>>>>>>> be for schemas and coders to coexist for the time being. >>>>>>>>>>>>>>>>>> However when we >>>>>>>>>>>>>>>>>> start planning Beam 3.0, deprecating coders is something I >>>>>>>>>>>>>>>>>> would like to >>>>>>>>>>>>>>>>>> suggest. >>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>> >> > On Thu, May 9, 2019 at 7:48 AM Robert Bradshaw < >>>>>>>>>>>>>>>>>> rober...@google.com> wrote: >>>>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>>>> >> >> From: Kenneth Knowles <k...@apache.org> >>>>>>>>>>>>>>>>>> >> >> Date: Thu, May 9, 2019 at 10:05 AM >>>>>>>>>>>>>>>>>> >> >> To: dev >>>>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>>>> >> >> > This is a huge development. Top posting because I >>>>>>>>>>>>>>>>>> can be more compact. >>>>>>>>>>>>>>>>>> >> >> > >>>>>>>>>>>>>>>>>> >> >> > I really think after the initial idea converges >>>>>>>>>>>>>>>>>> this needs a design doc with goals and alternatives. It is an >>>>>>>>>>>>>>>>>> extraordinarily consequential model change. So in the spirit >>>>>>>>>>>>>>>>>> of doing the >>>>>>>>>>>>>>>>>> work / bias towards action, I created a quick draft at >>>>>>>>>>>>>>>>>> https://s.apache.org/beam-schemas and added everyone on >>>>>>>>>>>>>>>>>> this thread as editors. I am still in the process of writing >>>>>>>>>>>>>>>>>> this to match >>>>>>>>>>>>>>>>>> the thread. >>>>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>>>> >> >> Thanks! Added some comments there. >>>>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>>>> >> >> > *Multiple timestamp resolutions*: you can use >>>>>>>>>>>>>>>>>> logcial types to represent nanos the same way Java and proto >>>>>>>>>>>>>>>>>> do. >>>>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>>>> >> >> As per the other discussion, I'm unsure the value >>>>>>>>>>>>>>>>>> in supporting >>>>>>>>>>>>>>>>>> >> >> multiple timestamp resolutions is high enough to >>>>>>>>>>>>>>>>>> outweigh the cost. >>>>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>>>> >> >> > *Why multiple int types?* The domain of values >>>>>>>>>>>>>>>>>> for these types are different. For a language with one "int" >>>>>>>>>>>>>>>>>> or "number" >>>>>>>>>>>>>>>>>> type, that's another domain of values. >>>>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>>>> >> >> What is the value in having different domains? If >>>>>>>>>>>>>>>>>> your data has a >>>>>>>>>>>>>>>>>> >> >> natural domain, chances are it doesn't line up >>>>>>>>>>>>>>>>>> exactly with one of >>>>>>>>>>>>>>>>>> >> >> these. I guess it's for languages whose types have >>>>>>>>>>>>>>>>>> specific domains? >>>>>>>>>>>>>>>>>> >> >> (There's also compactness in representation, >>>>>>>>>>>>>>>>>> encoded and in-memory, >>>>>>>>>>>>>>>>>> >> >> though I'm not sure that's high.) >>>>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>>>> >> >> > *Columnar/Arrow*: making sure we unlock the >>>>>>>>>>>>>>>>>> ability to take this path is Paramount. So tying it directly >>>>>>>>>>>>>>>>>> to a >>>>>>>>>>>>>>>>>> row-oriented coder seems counterproductive. >>>>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>>>> >> >> I don't think Coders are necessarily row-oriented. >>>>>>>>>>>>>>>>>> They are, however, >>>>>>>>>>>>>>>>>> >> >> bytes-oriented. (Perhaps they need not be.) There >>>>>>>>>>>>>>>>>> seems to be a lot of >>>>>>>>>>>>>>>>>> >> >> overlap between what Coders express in terms of >>>>>>>>>>>>>>>>>> element typing >>>>>>>>>>>>>>>>>> >> >> information and what Schemas express, and I'd >>>>>>>>>>>>>>>>>> rather have one concept >>>>>>>>>>>>>>>>>> >> >> if possible. Or have a clear division of >>>>>>>>>>>>>>>>>> responsibilities. >>>>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>>>> >> >> > *Multimap*: what does it add over an array-valued >>>>>>>>>>>>>>>>>> map or large-iterable-valued map? (honest question, not >>>>>>>>>>>>>>>>>> rhetorical) >>>>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>>>> >> >> Multimap has a different notion of what it means to >>>>>>>>>>>>>>>>>> contain a value, >>>>>>>>>>>>>>>>>> >> >> can handle (unordered) unions of non-disjoint keys, >>>>>>>>>>>>>>>>>> etc. Maybe this >>>>>>>>>>>>>>>>>> >> >> isn't worth a new primitive type. >>>>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>>>> >> >> > *URN/enum for type names*: I see the case for >>>>>>>>>>>>>>>>>> both. The core types are fundamental enough they should >>>>>>>>>>>>>>>>>> never really change >>>>>>>>>>>>>>>>>> - after all, proto, thrift, avro, arrow, have addressed this >>>>>>>>>>>>>>>>>> (not to >>>>>>>>>>>>>>>>>> mention most programming languages). Maybe additions once >>>>>>>>>>>>>>>>>> every few years. >>>>>>>>>>>>>>>>>> I prefer the smallest intersection of these schema >>>>>>>>>>>>>>>>>> languages. A oneof is >>>>>>>>>>>>>>>>>> more clear, while URN emphasizes the similarity of built-in >>>>>>>>>>>>>>>>>> and logical >>>>>>>>>>>>>>>>>> types. >>>>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>>>> >> >> Hmm... Do we have any examples of the multi-level >>>>>>>>>>>>>>>>>> primitive/logical >>>>>>>>>>>>>>>>>> >> >> type in any of these other systems? I have a bias >>>>>>>>>>>>>>>>>> towards all types >>>>>>>>>>>>>>>>>> >> >> being on the same footing unless there is >>>>>>>>>>>>>>>>>> compelling reason to divide >>>>>>>>>>>>>>>>>> >> >> things into primitive/use-defined ones. >>>>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>>>> >> >> Here it seems like the most essential value of the >>>>>>>>>>>>>>>>>> primitive type set >>>>>>>>>>>>>>>>>> >> >> is to describe the underlying representation, for >>>>>>>>>>>>>>>>>> encoding elements in >>>>>>>>>>>>>>>>>> >> >> a variety of ways (notably columnar, but also >>>>>>>>>>>>>>>>>> interfacing with other >>>>>>>>>>>>>>>>>> >> >> external systems like IOs). Perhaps, rather than >>>>>>>>>>>>>>>>>> the previous >>>>>>>>>>>>>>>>>> >> >> suggestion of making everything a logical of bytes, >>>>>>>>>>>>>>>>>> this could be made >>>>>>>>>>>>>>>>>> >> >> clear by still making everything a logical type, >>>>>>>>>>>>>>>>>> but renaming >>>>>>>>>>>>>>>>>> >> >> "TypeName" to Representation. There would be URNs >>>>>>>>>>>>>>>>>> (typically with >>>>>>>>>>>>>>>>>> >> >> empty payloads) for the various primitive types >>>>>>>>>>>>>>>>>> (whose mapping to >>>>>>>>>>>>>>>>>> >> >> their representations would be the identity). >>>>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>>>> >> >> - Robert >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>