Wouldn't SDK specific types always be under the "coders" component instead of the logical type listing?
Offhand, having a separate normalized listing of logical schema types in the pipeline components message of the types seems about right. Then they're unambiguous, but can also either refer to other logical types or existing coders as needed. When SDKs don't understand a given coder, the field could be just represented by a blob of bytes. On Wed, Jun 5, 2019, 11:29 PM Brian Hulette <[email protected]> wrote: > If we want to have a Pipeline level registry, we could add it to > Components [1]. > > message Components { > ... > map<string, LogicalType> logical_types; > } > > And in FieldType reference the logical types by id: > oneof field_type { > AtomicType atomic_type; > ArrayType array_type; > ... > string logical_type_id; // was LogicalType logical_type; > } > > I'm not sure I like this idea though. The reason we started discussing a > "registry" was just to separate the SDK-specific bits from the > representation type, and this doesn't accomplish that, it just de-dupes > logical types used > across the pipeline. > > I think instead I'd rather just come back to the message we have now in > the doc, used directly in FieldType's oneof: > > message LogicalType { > FieldType representation = 1; > string logical_urn = 2; > bytes logical_payload = 3; > } > > We can have a URN for SDK-specific types (user type aliases), like > "beam:logical:javasdk", and the logical_payload could itself be a protobuf > with attributes of 1) a serialized class and 2/3) to/from functions. For > truly portable types it would instead have a well-known URN and optionally > a logical_payload with some agreed-upon representation of parameters. > > It seems like maybe SdkFunctionSpec/Environment should be used for this > somehow, but I can't find a good example of this in the Runner API to use > as a model. For example, what we're trying to accomplish is basically the > same as Java custom coders vs. standard coders. But that is accomplished > with a magic "javasdk" URN, as I suggested here, not with Environment > [2,3]. There is a "TODO: standardize such things" where that URN is > defined, is it possible that Environment is that standard and just hasn't > been utilized for custom coders yet? > > Brian > > [1] > https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L54 > [2] > https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L542 > [3] > https://github.com/apache/beam/blob/master/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/CoderTranslation.java#L121 > > On Tue, Jun 4, 2019 at 2:24 PM Brian Hulette <[email protected]> wrote: > >> Yeah that's what I meant. It does seem logical reasonable to scope any >> registry by pipeline and not by PCollection. Then it seems we would want >> the entire LogicalType (including the `FieldType representation` field) as >> the value type, and not just LogicalTypeConversion. Otherwise we're >> separating the representations from the conversions, and duplicating the >> representations. You did say a "registry of logical types", so maybe that >> is what you meant. >> >> Brian >> >> On Tue, Jun 4, 2019 at 1:21 PM Reuven Lax <[email protected]> wrote: >> >>> >>> >>> On Tue, Jun 4, 2019 at 9:20 AM Brian Hulette <[email protected]> >>> wrote: >>> >>>> >>>> >>>> On Mon, Jun 3, 2019 at 10:04 PM Reuven Lax <[email protected]> wrote: >>>> >>>>> >>>>> >>>>> On Mon, Jun 3, 2019 at 12:27 PM Brian Hulette <[email protected]> >>>>> wrote: >>>>> >>>>>> > It has to go into the proto somewhere (since that's the only way >>>>>> the SDK can get it), but I'm not sure they should be considered integral >>>>>> parts of the type. >>>>>> Are you just advocating for an approach where any SDK-specific >>>>>> information is stored outside of the Schema message itself so that Schema >>>>>> really does just represent the type? That seems reasonable to me, and >>>>>> alleviates my concerns about how this applies to columnar encodings a bit >>>>>> as well. >>>>>> >>>>> >>>>> Yes, that's exactly what I'm advocating. >>>>> >>>>> >>>>>> >>>>>> We could lift all of the LogicalTypeConversion messages out of the >>>>>> Schema and the LogicalType like this: >>>>>> >>>>>> message SchemaCoder { >>>>>> Schema schema = 1; >>>>>> LogicalTypeConversion root_conversion = 2; >>>>>> map<string, LogicalTypeConversion> attribute_conversions = 3; // >>>>>> only necessary for user type aliases, portable logical types by >>>>>> definition >>>>>> have nothing SDK-specific >>>>>> } >>>>>> >>>>> >>>>> I'm not sure what the map is for? I think we have status quo wihtout >>>>> it. >>>>> >>>> >>>> My intention was that the SDK-specific information (to/from functions) >>>> for any nested fields that are themselves user type aliases would be stored >>>> in this map. That was the motivation for my next question, if we don't >>>> allow user types to be nested within other user types we may not need it. >>>> >>> >>> Oh, is this meant to contain the ids of all the logical types in this >>> schema? If so I don't think SchemaCoder is the right place for this. Any >>> "registry" of logical types should be global to the pipeline, not scoped to >>> a single PCollection IMO. >>> >>> >>>> I may be missing your meaning - but I think we currently only have >>>> status quo without this map in the Java SDK because Schema.LogicalType is >>>> just an interface that must be implemented. It's appropriate for just >>>> portable logical types, not user-type aliases. Note I've adopted Kenn's >>>> terminology where portable logical type is a type that can be identified by >>>> just a URN and maybe some parameters, while a user type alias needs some >>>> SDK specific information, like a class and to/from UDFs. >>>> >>>> >>>>> >>>>>> I think a critical question (that has implications for the above >>>>>> proposal) is how/if the two different concepts Kenn mentioned are allowed >>>>>> to nest. For example, you could argue it's redundant to have a user type >>>>>> alias that has a Row representation with a field that is itself a user >>>>>> type >>>>>> alias, because instead you could just have a single top-level type alias >>>>>> with to/from functions that pack and unpack the entire hierarchy. On the >>>>>> other hand, I think it does make sense for a user type alias or a truly >>>>>> portable logical type to have a field that is itself a truly portable >>>>>> logical type (e.g. a user type alias or portable type with a DateTime). >>>>>> >>>>>> I've been assuming that user-type aliases could be nested, but should >>>>>> we disallow that? Or should we go the other way and require that logical >>>>>> types define at most one "level"? >>>>>> >>>>> >>>>> No I think it's useful to allow things to be nested (though of course >>>>> the nesting must terminate). >>>>> >>>> >>>>> >>>>>> >>>>>> Brian >>>>>> >>>>>> On Mon, Jun 3, 2019 at 11:08 AM Kenneth Knowles <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> >>>>>>> On Mon, Jun 3, 2019 at 10:53 AM Reuven Lax <[email protected]> wrote: >>>>>>> >>>>>>>> So I feel a bit leery about making the to/from functions a >>>>>>>> fundamental part of the portability representation. In my mind, that is >>>>>>>> very tied to a specific SDK/language. A SDK (say the Java SDK) wants to >>>>>>>> allow users to use a wide variety of native types with schemas, and >>>>>>>> under >>>>>>>> the covers uses the to/from functions to implement that. However from >>>>>>>> the >>>>>>>> portable Beam perspective, the schema itself should be the real "type" >>>>>>>> of >>>>>>>> the PCollection; the to/from methods are simply a way that a >>>>>>>> particular SDK >>>>>>>> makes schemas easier to use. It has to go into the proto somewhere >>>>>>>> (since >>>>>>>> that's the only way the SDK can get it), but I'm not sure they should >>>>>>>> be >>>>>>>> considered integral parts of the type. >>>>>>>> >>>>>>> >>>>>>> On the doc in a couple places this distinction was made: >>>>>>> >>>>>>> * For truly portable logical types, no instructions for the SDK are >>>>>>> needed. Instead, they require: >>>>>>> - URN: a standardized identifier any SDK can recognize >>>>>>> - A spec: what is the universe of values in this type? >>>>>>> - A representation: how is it represented in built-in types? This >>>>>>> is how SDKs who do not know/care about the URN will process it >>>>>>> - (optional): SDKs choose preferred SDK-specific types to embed >>>>>>> the values in. SDKs have to know about the URN and choose for >>>>>>> themselves. >>>>>>> >>>>>>> *For user-level type aliases, written as convenience by the user in >>>>>>> their pipeline, what Java schemas have today: >>>>>>> - to/from UDFs: the code is SDK-specific >>>>>>> - some representation of the intended type (like java class): >>>>>>> also SDK specific >>>>>>> - a representation >>>>>>> - any "id" is just like other ids in the pipeline, just avoiding >>>>>>> duplicating the proto >>>>>>> - Luke points out that nesting these can give multiple SDKs a hint >>>>>>> >>>>>>> In my mind the remaining complexity is whether or not we need to be >>>>>>> able to move between the two. Composite PTransforms, for example, do >>>>>>> have >>>>>>> fluidity between being strictly user-defined versus portable >>>>>>> URN+payload. >>>>>>> But it requires lots of engineering, namely the current work on >>>>>>> expansion >>>>>>> service. >>>>>>> >>>>>>> Kenn >>>>>>> >>>>>>> >>>>>>>> On Mon, Jun 3, 2019 at 10:23 AM Brian Hulette <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Ah I see, I didn't realize that. Then I suppose we'll need to/from >>>>>>>>> functions somewhere in the logical type conversion to preserve the >>>>>>>>> current >>>>>>>>> behavior. >>>>>>>>> >>>>>>>>> I'm still a little hesitant to make these functions an explicit >>>>>>>>> part of LogicalTypeConversion for another reason. Down the road, >>>>>>>>> schemas >>>>>>>>> could give us an avenue to use a batched columnar format (presumably >>>>>>>>> arrow, >>>>>>>>> but of course others are possible). By making to/from an explicit >>>>>>>>> part of >>>>>>>>> logical types we add some element-wise logic to a schema >>>>>>>>> representation >>>>>>>>> that's otherwise ambivalent to element-wise vs. batched encodings. >>>>>>>>> >>>>>>>>> I suppose you could make an argument that to/from are only for >>>>>>>>> custom types. There will also be some set of well-known types >>>>>>>>> identified >>>>>>>>> only by URN and some parameters, which could easily be translated to a >>>>>>>>> columnar format. We could just not support custom types fully if we >>>>>>>>> add a >>>>>>>>> columnar encoding, or maybe add optional toBatch/fromBatch functions >>>>>>>>> when/if we get there. >>>>>>>>> >>>>>>>>> What about something like this that makes the two different types >>>>>>>>> of logical types explicit? >>>>>>>>> >>>>>>>>> // Describes a logical type and how to convert between it and its >>>>>>>>> representation (e.g. Row). >>>>>>>>> message LogicalTypeConversion { >>>>>>>>> oneof conversion { >>>>>>>>> message Standard standard = 1; >>>>>>>>> message Custom custom = 2; >>>>>>>>> } >>>>>>>>> >>>>>>>>> message Standard { >>>>>>>>> String urn = 1; >>>>>>>>> repeated string args = 2; // could also be a map >>>>>>>>> } >>>>>>>>> >>>>>>>>> message Custom { >>>>>>>>> FunctionSpec(?) toRepresentation = 1; >>>>>>>>> FunctionSpec(?) fromRepresentation = 2; >>>>>>>>> bytes type = 3; // e.g. serialized class for Java >>>>>>>>> } >>>>>>>>> } >>>>>>>>> >>>>>>>>> And LogicalType and Schema become: >>>>>>>>> >>>>>>>>> message LogicalType { >>>>>>>>> FieldType representation = 1; >>>>>>>>> LogicalTypeConversion conversion = 2; >>>>>>>>> } >>>>>>>>> >>>>>>>>> message Schema { >>>>>>>>> ... >>>>>>>>> repeated Field fields = 1; >>>>>>>>> LogicalTypeConversion conversion = 2; // implied that >>>>>>>>> representation is Row >>>>>>>>> } >>>>>>>>> >>>>>>>>> Brian >>>>>>>>> >>>>>>>>> On Sat, Jun 1, 2019 at 10:44 AM Reuven Lax <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Keep in mind that right now the SchemaRegistry is only assumed to >>>>>>>>>> exist at graph-construction time, not at execution time; all >>>>>>>>>> information in >>>>>>>>>> the schema registry is embedded in the SchemaCoder, which is the >>>>>>>>>> only thing >>>>>>>>>> we keep around when the pipeline is actually running. We could look >>>>>>>>>> into >>>>>>>>>> changing this, but it would potentially be a very big change, and I >>>>>>>>>> do >>>>>>>>>> think we should start getting users actively using schemas soon. >>>>>>>>>> >>>>>>>>>> On Fri, May 31, 2019 at 3:40 PM Brian Hulette < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> > Can you propose what the protos would look like in this case? >>>>>>>>>>> Right now LogicalType does not contain the to/from conversion >>>>>>>>>>> functions in >>>>>>>>>>> the proto. Do you think we'll need to add these in? >>>>>>>>>>> >>>>>>>>>>> Maybe. Right now the proposed LogicalType message is pretty >>>>>>>>>>> simple/generic: >>>>>>>>>>> message LogicalType { >>>>>>>>>>> FieldType representation = 1; >>>>>>>>>>> string logical_urn = 2; >>>>>>>>>>> bytes logical_payload = 3; >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> If we keep just logical_urn and logical_payload, the >>>>>>>>>>> logical_payload could itself be a protobuf with attributes of 1) a >>>>>>>>>>> serialized class and 2/3) to/from functions. Or, alternatively, we >>>>>>>>>>> could >>>>>>>>>>> have a generalization of the SchemaRegistry for logical types. >>>>>>>>>>> Implementations for standard types and user-defined types would be >>>>>>>>>>> registered by URN, and the SDK could look them up given just a URN. >>>>>>>>>>> I put a >>>>>>>>>>> brief section about this alternative in the doc last week [1]. What >>>>>>>>>>> I >>>>>>>>>>> suggested there included removing the logical_payload field, which >>>>>>>>>>> is >>>>>>>>>>> probably overkill. The critical piece is just relying on a registry >>>>>>>>>>> in the >>>>>>>>>>> SDK to look up types and to/from functions rather than storing them >>>>>>>>>>> in the >>>>>>>>>>> portable schema itself. >>>>>>>>>>> >>>>>>>>>>> I kind of like keeping the LogicalType message generic for now, >>>>>>>>>>> since it gives us a way to try out these various approaches, but >>>>>>>>>>> maybe >>>>>>>>>>> that's just a cop out. >>>>>>>>>>> >>>>>>>>>>> [1] >>>>>>>>>>> https://docs.google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?ts=5cdf6a5b#heading=h.jlt5hdrolfy >>>>>>>>>>> >>>>>>>>>>> On Fri, May 31, 2019 at 12:36 PM Reuven Lax <[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Tue, May 28, 2019 at 10:11 AM Brian Hulette < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Sun, May 26, 2019 at 1:25 PM Reuven Lax <[email protected]> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, May 24, 2019 at 11:42 AM Brian Hulette < >>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> *tl;dr:* SchemaCoder represents a logical type with a base >>>>>>>>>>>>>>> type of Row and we should think about that. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I'm a little concerned that the current proposals for a >>>>>>>>>>>>>>> portable representation don't actually fully represent Schemas. >>>>>>>>>>>>>>> It seems to >>>>>>>>>>>>>>> me that the current java-only Schemas are made up three >>>>>>>>>>>>>>> concepts that are >>>>>>>>>>>>>>> intertwined: >>>>>>>>>>>>>>> (a) The Java SDK specific code for schema inference, type >>>>>>>>>>>>>>> coercion, and "schema-aware" transforms. >>>>>>>>>>>>>>> (b) A RowCoder[1] that encodes Rows[2] which have a >>>>>>>>>>>>>>> particular Schema[3]. >>>>>>>>>>>>>>> (c) A SchemaCoder[4] that has a RowCoder for a >>>>>>>>>>>>>>> particular schema, and functions for converting Rows with that >>>>>>>>>>>>>>> schema >>>>>>>>>>>>>>> to/from a Java type T. Those functions and the RowCoder are >>>>>>>>>>>>>>> then composed >>>>>>>>>>>>>>> to provider a Coder for the type T. >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> RowCoder is currently just an internal implementation detail, >>>>>>>>>>>>>> it can be eliminated. SchemaCoder is the only thing that >>>>>>>>>>>>>> determines a >>>>>>>>>>>>>> schema today. >>>>>>>>>>>>>> >>>>>>>>>>>>> Why not keep it around? I think it would make sense to have a >>>>>>>>>>>>> RowCoder implementation in every SDK, as well as something like >>>>>>>>>>>>> SchemaCoder >>>>>>>>>>>>> that defines a conversion from that SDK's "Row" to the language >>>>>>>>>>>>> type. >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> The point is that from a programmer's perspective, there is >>>>>>>>>>>> nothing much special about Row. Any type can have a schema, and >>>>>>>>>>>> the only >>>>>>>>>>>> special thing about Row is that it's always guaranteed to exist. >>>>>>>>>>>> From that >>>>>>>>>>>> standpoint, Row is nearly an implementation detail. Today RowCoder >>>>>>>>>>>> is never >>>>>>>>>>>> set on _any_ PCollection, it's literally just used as a helper >>>>>>>>>>>> library, so >>>>>>>>>>>> there's no real need for it to exist as a "Coder." >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> We're not concerned with (a) at this time since that's >>>>>>>>>>>>>>> specific to the SDK, not the interface between them. My >>>>>>>>>>>>>>> understanding is we >>>>>>>>>>>>>>> just want to define a portable representation for (b) and/or >>>>>>>>>>>>>>> (c). >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> What has been discussed so far is really just a portable >>>>>>>>>>>>>>> representation for (b), the RowCoder, since the discussion is >>>>>>>>>>>>>>> only around >>>>>>>>>>>>>>> how to represent the schema itself and not the to/from >>>>>>>>>>>>>>> functions. >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Correct. The to/from functions are actually related to a). >>>>>>>>>>>>>> One of the big goals of schemas was that users should not be >>>>>>>>>>>>>> forced to >>>>>>>>>>>>>> operate on rows to get schemas. A user can create >>>>>>>>>>>>>> PCollection<MyRandomType> >>>>>>>>>>>>>> and as long as the SDK can infer a schema from MyRandomType, the >>>>>>>>>>>>>> user never >>>>>>>>>>>>>> needs to even see a Row object. The to/fromRow functions are >>>>>>>>>>>>>> what make this >>>>>>>>>>>>>> work today. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> One of the points I'd like to make is that this type coercion >>>>>>>>>>>>> is a useful concept on it's own, separate from schemas. It's >>>>>>>>>>>>> especially >>>>>>>>>>>>> useful for a type that has a schema and is encoded by RowCoder >>>>>>>>>>>>> since that >>>>>>>>>>>>> can represent many more types, but the type coercion doesn't have >>>>>>>>>>>>> to be >>>>>>>>>>>>> tied to just schemas and RowCoder. We could also do type coercion >>>>>>>>>>>>> for types >>>>>>>>>>>>> that are effectively wrappers around an integer or a string. It >>>>>>>>>>>>> could just >>>>>>>>>>>>> be a general way to map language types to base types (i.e. types >>>>>>>>>>>>> that we >>>>>>>>>>>>> have a coder for). Then it just becomes a general framework for >>>>>>>>>>>>> extending >>>>>>>>>>>>> coders to represent more language types. >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Let's not tie those conversations. Maybe a similar concept will >>>>>>>>>>>> hold true for general coders (or we might decide to get rid of >>>>>>>>>>>> coders in >>>>>>>>>>>> favor of schemas, in which case that becomes moot), but I don't >>>>>>>>>>>> think we >>>>>>>>>>>> should prematurely generalize. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> One of the outstanding questions for that schema >>>>>>>>>>>>>>> representation is how to represent logical types, which may or >>>>>>>>>>>>>>> may not have >>>>>>>>>>>>>>> some language type in each SDK (the canonical example being a >>>>>>>>>>>>>>> timsetamp type with seconds and nanos and java.time.Instant). I >>>>>>>>>>>>>>> think this >>>>>>>>>>>>>>> question is critically important, because (c), the SchemaCoder, >>>>>>>>>>>>>>> is actually >>>>>>>>>>>>>>> *defining a logical type* with a language type T in the Java >>>>>>>>>>>>>>> SDK. This >>>>>>>>>>>>>>> becomes clear when you compare SchemaCoder[4] to the >>>>>>>>>>>>>>> Schema.LogicalType >>>>>>>>>>>>>>> interface[5] - both essentially have three attributes: a base >>>>>>>>>>>>>>> type, and two >>>>>>>>>>>>>>> functions for converting to/from that base type. The only >>>>>>>>>>>>>>> difference is for >>>>>>>>>>>>>>> SchemaCoder that base type must be a Row so it can be >>>>>>>>>>>>>>> represented by a >>>>>>>>>>>>>>> Schema alone, while LogicalType can have any base type that can >>>>>>>>>>>>>>> be >>>>>>>>>>>>>>> represented by FieldType, including a Row. >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> This is not true actually. SchemaCoder can have any base >>>>>>>>>>>>>> type, that's why (in Java) it's SchemaCoder<T>. This is why >>>>>>>>>>>>>> PCollection<T> >>>>>>>>>>>>>> can have a schema, even if T is not Row. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> I'm not sure I effectively communicated what I meant - When I >>>>>>>>>>>>> said SchemaCoder's "base type" I wasn't referring to T, I was >>>>>>>>>>>>> referring to >>>>>>>>>>>>> the base FieldType, whose coder we use for this type. I meant >>>>>>>>>>>>> "base type" >>>>>>>>>>>>> to be analogous to LogicalType's `getBaseType`, or what Kenn is >>>>>>>>>>>>> suggesting >>>>>>>>>>>>> we call "representation" in the portable beam schemas doc. To >>>>>>>>>>>>> define some >>>>>>>>>>>>> terms from my original message: >>>>>>>>>>>>> base type = an instance of FieldType, crucially this is >>>>>>>>>>>>> something that we have a coder for (be it VarIntCoder, Utf8Coder, >>>>>>>>>>>>> RowCoder, >>>>>>>>>>>>> ...) >>>>>>>>>>>>> language type (or "T", "type T", "logical type") = Some Java >>>>>>>>>>>>> class (or something analogous in the other SDKs) that we may or >>>>>>>>>>>>> may not >>>>>>>>>>>>> have a coder for. It's possible to define functions for converting >>>>>>>>>>>>> instances of the language type to/from the base type. >>>>>>>>>>>>> >>>>>>>>>>>>> I was just trying to make the case that SchemaCoder is really >>>>>>>>>>>>> a special case of LogicalType, where `getBaseType` always returns >>>>>>>>>>>>> a Row >>>>>>>>>>>>> with the stored Schema. >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Yeah, I think I got that point. >>>>>>>>>>>> >>>>>>>>>>>> Can you propose what the protos would look like in this case? >>>>>>>>>>>> Right now LogicalType does not contain the to/from conversion >>>>>>>>>>>> functions in >>>>>>>>>>>> the proto. Do you think we'll need to add these in? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> To make the point with code: SchemaCoder<T> can be made to >>>>>>>>>>>>> implement Schema.LogicalType<T,Row> with trivial implementations >>>>>>>>>>>>> of >>>>>>>>>>>>> getBaseType, toBaseType, and toInputType (I'm not trying to say >>>>>>>>>>>>> we should >>>>>>>>>>>>> or shouldn't do this, just using it illustrate my point): >>>>>>>>>>>>> >>>>>>>>>>>>> class SchemaCoder extends CustomCoder<T> implements >>>>>>>>>>>>> Schema.LogicalType<T, Row> { >>>>>>>>>>>>> ... >>>>>>>>>>>>> >>>>>>>>>>>>> @Override >>>>>>>>>>>>> FieldType getBaseType() { >>>>>>>>>>>>> return FieldType.row(getSchema()); >>>>>>>>>>>>> } >>>>>>>>>>>>> >>>>>>>>>>>>> @Override >>>>>>>>>>>>> public Row toBaseType() { >>>>>>>>>>>>> return this.toRowFunction.apply(input); >>>>>>>>>>>>> } >>>>>>>>>>>>> >>>>>>>>>>>>> @Override >>>>>>>>>>>>> public T toInputType(Row base) { >>>>>>>>>>>>> return this.fromRowFunction.apply(base); >>>>>>>>>>>>> } >>>>>>>>>>>>> ... >>>>>>>>>>>>> } >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>> I think it may make sense to fully embrace this duality, by >>>>>>>>>>>>>>> letting SchemaCoder have a baseType other than just Row and >>>>>>>>>>>>>>> renaming it to >>>>>>>>>>>>>>> LogicalTypeCoder/LanguageTypeCoder. The current Java SDK >>>>>>>>>>>>>>> schema-aware >>>>>>>>>>>>>>> transforms (a) would operate only on LogicalTypeCoders with a >>>>>>>>>>>>>>> Row base >>>>>>>>>>>>>>> type. Perhaps some of the current schema logic could alsobe >>>>>>>>>>>>>>> applied more >>>>>>>>>>>>>>> generally to any logical type - for example, to provide type >>>>>>>>>>>>>>> coercion for >>>>>>>>>>>>>>> logical types with a base type other than Row, like int64 and a >>>>>>>>>>>>>>> timestamp >>>>>>>>>>>>>>> class backed by millis, or fixed size bytes and a UUID class. >>>>>>>>>>>>>>> And having a >>>>>>>>>>>>>>> portable representation that represents those (non Row backed) >>>>>>>>>>>>>>> logical >>>>>>>>>>>>>>> types with some URN would also allow us to pass them to other >>>>>>>>>>>>>>> languages >>>>>>>>>>>>>>> without unnecessarily wrapping them in a Row in order to use >>>>>>>>>>>>>>> SchemaCoder. >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> I think the actual overlap here is between the to/from >>>>>>>>>>>>>> functions in SchemaCoder (which is what allows SchemaCoder<T> >>>>>>>>>>>>>> where T != >>>>>>>>>>>>>> Row) and the equivalent functionality in LogicalType. However >>>>>>>>>>>>>> making all of >>>>>>>>>>>>>> schemas simply just a logical type feels a bit awkward and >>>>>>>>>>>>>> circular to me. >>>>>>>>>>>>>> Maybe we should refactor that part out into a >>>>>>>>>>>>>> LogicalTypeConversion proto, >>>>>>>>>>>>>> and reference that from both LogicalType and from SchemaCoder? >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> LogicalType is already potentially circular though. A schema >>>>>>>>>>>>> can have a field with a logical type, and that logical type can >>>>>>>>>>>>> have a base >>>>>>>>>>>>> type of Row with a field with a logical type (and on and on...). >>>>>>>>>>>>> To me it >>>>>>>>>>>>> seems elegant, not awkward, to recognize that SchemaCoder is just >>>>>>>>>>>>> a special >>>>>>>>>>>>> case of this concept. >>>>>>>>>>>>> >>>>>>>>>>>>> Something like the LogicalTypeConversion proto would >>>>>>>>>>>>> definitely be an improvement, but I would still prefer just using >>>>>>>>>>>>> a >>>>>>>>>>>>> top-level logical type :) >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> I've added a section to the doc [6] to propose this >>>>>>>>>>>>>>> alternative in the context of the portable representation but I >>>>>>>>>>>>>>> wanted to >>>>>>>>>>>>>>> bring it up here as well to solicit feedback. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/RowCoder.java#L41 >>>>>>>>>>>>>>> [2] >>>>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/Row.java#L59 >>>>>>>>>>>>>>> [3] >>>>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L48 >>>>>>>>>>>>>>> [4] >>>>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/SchemaCoder.java#L33 >>>>>>>>>>>>>>> [5] >>>>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L489 >>>>>>>>>>>>>>> [6] >>>>>>>>>>>>>>> https://docs.google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?ts=5cdf6a5b#heading=h.7570feur1qin >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Fri, May 10, 2019 at 9:16 AM Brian Hulette < >>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Ah thanks! I added some language there. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> *From: *Kenneth Knowles <[email protected]> >>>>>>>>>>>>>>>> *Date: *Thu, May 9, 2019 at 5:31 PM >>>>>>>>>>>>>>>> *To: *dev >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> *From: *Brian Hulette <[email protected]> >>>>>>>>>>>>>>>>> *Date: *Thu, May 9, 2019 at 2:02 PM >>>>>>>>>>>>>>>>> *To: * <[email protected]> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> We briefly discussed using arrow schemas in place of beam >>>>>>>>>>>>>>>>>> schemas entirely in an arrow thread [1]. The biggest reason >>>>>>>>>>>>>>>>>> not to this was >>>>>>>>>>>>>>>>>> that we wanted to have a type for large iterables in beam >>>>>>>>>>>>>>>>>> schemas. But >>>>>>>>>>>>>>>>>> given that large iterables aren't currently implemented, >>>>>>>>>>>>>>>>>> beam schemas look >>>>>>>>>>>>>>>>>> very similar to arrow schemas. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I think it makes sense to take inspiration from arrow >>>>>>>>>>>>>>>>>> schemas where possible, and maybe even copy them outright. >>>>>>>>>>>>>>>>>> Arrow already >>>>>>>>>>>>>>>>>> has a portable (flatbuffers) schema representation [2], and >>>>>>>>>>>>>>>>>> implementations >>>>>>>>>>>>>>>>>> for it in many languages that we may be able to re-use as we >>>>>>>>>>>>>>>>>> bring schemas >>>>>>>>>>>>>>>>>> to more SDKs (the project has Python and Go >>>>>>>>>>>>>>>>>> implementations). There are a >>>>>>>>>>>>>>>>>> couple of concepts in Arrow schemas that are specific for >>>>>>>>>>>>>>>>>> the format and >>>>>>>>>>>>>>>>>> wouldn't make sense for us, (fields can indicate whether or >>>>>>>>>>>>>>>>>> not they are >>>>>>>>>>>>>>>>>> dictionary encoded, and the schema has an endianness field), >>>>>>>>>>>>>>>>>> but if you >>>>>>>>>>>>>>>>>> drop those concepts the arrow spec looks pretty similar to >>>>>>>>>>>>>>>>>> the beam proto >>>>>>>>>>>>>>>>>> spec. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> FWIW I left a blank section in the doc for filling out >>>>>>>>>>>>>>>>> what the differences are and why, and conversely what the >>>>>>>>>>>>>>>>> interop >>>>>>>>>>>>>>>>> opportunities may be. Such sections are some of my favorite >>>>>>>>>>>>>>>>> sections of >>>>>>>>>>>>>>>>> design docs. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Kenn >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Brian >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>>>>>> https://lists.apache.org/thread.html/6be7715e13b71c2d161e4378c5ca1c76ac40cfc5988a03ba87f1c434@%3Cdev.beam.apache.org%3E >>>>>>>>>>>>>>>>>> [2] >>>>>>>>>>>>>>>>>> https://github.com/apache/arrow/blob/master/format/Schema.fbs#L194 >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> *From: *Robert Bradshaw <[email protected]> >>>>>>>>>>>>>>>>>> *Date: *Thu, May 9, 2019 at 1:38 PM >>>>>>>>>>>>>>>>>> *To: *dev >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> From: Reuven Lax <[email protected]> >>>>>>>>>>>>>>>>>>> Date: Thu, May 9, 2019 at 7:29 PM >>>>>>>>>>>>>>>>>>> To: dev >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> > Also in the future we might be able to do >>>>>>>>>>>>>>>>>>> optimizations at the runner level if at the portability >>>>>>>>>>>>>>>>>>> layer we understood >>>>>>>>>>>>>>>>>>> schemes instead of just raw coders. This could be things >>>>>>>>>>>>>>>>>>> like only parsing >>>>>>>>>>>>>>>>>>> a subset of a row (if we know only a few fields are >>>>>>>>>>>>>>>>>>> accessed) or using a >>>>>>>>>>>>>>>>>>> columnar data structure like Arrow to encode batches of >>>>>>>>>>>>>>>>>>> rows across >>>>>>>>>>>>>>>>>>> portability. This doesn't affect data semantics of course, >>>>>>>>>>>>>>>>>>> but having a >>>>>>>>>>>>>>>>>>> richer, more-expressive type system opens up other >>>>>>>>>>>>>>>>>>> opportunities. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> But we could do all of that with a RowCoder we >>>>>>>>>>>>>>>>>>> understood to designate >>>>>>>>>>>>>>>>>>> the type(s), right? >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> > On Thu, May 9, 2019 at 10:16 AM Robert Bradshaw < >>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>> >> On the flip side, Schemas are equivalent to the space >>>>>>>>>>>>>>>>>>> of Coders with >>>>>>>>>>>>>>>>>>> >> the addition of a RowCoder and the ability to >>>>>>>>>>>>>>>>>>> materialize to something >>>>>>>>>>>>>>>>>>> >> other than bytes, right? (Perhaps I'm missing >>>>>>>>>>>>>>>>>>> something big here...) >>>>>>>>>>>>>>>>>>> >> This may make a backwards-compatible transition >>>>>>>>>>>>>>>>>>> easier. (SDK-side, the >>>>>>>>>>>>>>>>>>> >> ability to reason about and operate on such types is >>>>>>>>>>>>>>>>>>> of course much >>>>>>>>>>>>>>>>>>> >> richer than anything Coders offer right now.) >>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>> >> From: Reuven Lax <[email protected]> >>>>>>>>>>>>>>>>>>> >> Date: Thu, May 9, 2019 at 4:52 PM >>>>>>>>>>>>>>>>>>> >> To: dev >>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>> >> > FYI I can imagine a world in which we have no >>>>>>>>>>>>>>>>>>> coders. We could define the entire model on top of schemas. >>>>>>>>>>>>>>>>>>> Today's "Coder" >>>>>>>>>>>>>>>>>>> is completely equivalent to a single-field schema with a >>>>>>>>>>>>>>>>>>> logical-type field >>>>>>>>>>>>>>>>>>> (actually the latter is slightly more expressive as you >>>>>>>>>>>>>>>>>>> aren't forced to >>>>>>>>>>>>>>>>>>> serialize into bytes). >>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>> >> > Due to compatibility constraints and the effort >>>>>>>>>>>>>>>>>>> that would be involved in such a change, I think the >>>>>>>>>>>>>>>>>>> practical decision >>>>>>>>>>>>>>>>>>> should be for schemas and coders to coexist for the time >>>>>>>>>>>>>>>>>>> being. However >>>>>>>>>>>>>>>>>>> when we start planning Beam 3.0, deprecating coders is >>>>>>>>>>>>>>>>>>> something I would >>>>>>>>>>>>>>>>>>> like to suggest. >>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>> >> > On Thu, May 9, 2019 at 7:48 AM Robert Bradshaw < >>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>>>>> >> >> From: Kenneth Knowles <[email protected]> >>>>>>>>>>>>>>>>>>> >> >> Date: Thu, May 9, 2019 at 10:05 AM >>>>>>>>>>>>>>>>>>> >> >> To: dev >>>>>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>>>>> >> >> > This is a huge development. Top posting because >>>>>>>>>>>>>>>>>>> I can be more compact. >>>>>>>>>>>>>>>>>>> >> >> > >>>>>>>>>>>>>>>>>>> >> >> > I really think after the initial idea converges >>>>>>>>>>>>>>>>>>> this needs a design doc with goals and alternatives. It is >>>>>>>>>>>>>>>>>>> an >>>>>>>>>>>>>>>>>>> extraordinarily consequential model change. So in the >>>>>>>>>>>>>>>>>>> spirit of doing the >>>>>>>>>>>>>>>>>>> work / bias towards action, I created a quick draft at >>>>>>>>>>>>>>>>>>> https://s.apache.org/beam-schemas and added everyone on >>>>>>>>>>>>>>>>>>> this thread as editors. I am still in the process of >>>>>>>>>>>>>>>>>>> writing this to match >>>>>>>>>>>>>>>>>>> the thread. >>>>>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>>>>> >> >> Thanks! Added some comments there. >>>>>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>>>>> >> >> > *Multiple timestamp resolutions*: you can use >>>>>>>>>>>>>>>>>>> logcial types to represent nanos the same way Java and >>>>>>>>>>>>>>>>>>> proto do. >>>>>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>>>>> >> >> As per the other discussion, I'm unsure the value >>>>>>>>>>>>>>>>>>> in supporting >>>>>>>>>>>>>>>>>>> >> >> multiple timestamp resolutions is high enough to >>>>>>>>>>>>>>>>>>> outweigh the cost. >>>>>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>>>>> >> >> > *Why multiple int types?* The domain of values >>>>>>>>>>>>>>>>>>> for these types are different. For a language with one >>>>>>>>>>>>>>>>>>> "int" or "number" >>>>>>>>>>>>>>>>>>> type, that's another domain of values. >>>>>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>>>>> >> >> What is the value in having different domains? If >>>>>>>>>>>>>>>>>>> your data has a >>>>>>>>>>>>>>>>>>> >> >> natural domain, chances are it doesn't line up >>>>>>>>>>>>>>>>>>> exactly with one of >>>>>>>>>>>>>>>>>>> >> >> these. I guess it's for languages whose types have >>>>>>>>>>>>>>>>>>> specific domains? >>>>>>>>>>>>>>>>>>> >> >> (There's also compactness in representation, >>>>>>>>>>>>>>>>>>> encoded and in-memory, >>>>>>>>>>>>>>>>>>> >> >> though I'm not sure that's high.) >>>>>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>>>>> >> >> > *Columnar/Arrow*: making sure we unlock the >>>>>>>>>>>>>>>>>>> ability to take this path is Paramount. So tying it >>>>>>>>>>>>>>>>>>> directly to a >>>>>>>>>>>>>>>>>>> row-oriented coder seems counterproductive. >>>>>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>>>>> >> >> I don't think Coders are necessarily row-oriented. >>>>>>>>>>>>>>>>>>> They are, however, >>>>>>>>>>>>>>>>>>> >> >> bytes-oriented. (Perhaps they need not be.) There >>>>>>>>>>>>>>>>>>> seems to be a lot of >>>>>>>>>>>>>>>>>>> >> >> overlap between what Coders express in terms of >>>>>>>>>>>>>>>>>>> element typing >>>>>>>>>>>>>>>>>>> >> >> information and what Schemas express, and I'd >>>>>>>>>>>>>>>>>>> rather have one concept >>>>>>>>>>>>>>>>>>> >> >> if possible. Or have a clear division of >>>>>>>>>>>>>>>>>>> responsibilities. >>>>>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>>>>> >> >> > *Multimap*: what does it add over an >>>>>>>>>>>>>>>>>>> array-valued map or large-iterable-valued map? (honest >>>>>>>>>>>>>>>>>>> question, not >>>>>>>>>>>>>>>>>>> rhetorical) >>>>>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>>>>> >> >> Multimap has a different notion of what it means >>>>>>>>>>>>>>>>>>> to contain a value, >>>>>>>>>>>>>>>>>>> >> >> can handle (unordered) unions of non-disjoint >>>>>>>>>>>>>>>>>>> keys, etc. Maybe this >>>>>>>>>>>>>>>>>>> >> >> isn't worth a new primitive type. >>>>>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>>>>> >> >> > *URN/enum for type names*: I see the case for >>>>>>>>>>>>>>>>>>> both. The core types are fundamental enough they should >>>>>>>>>>>>>>>>>>> never really change >>>>>>>>>>>>>>>>>>> - after all, proto, thrift, avro, arrow, have addressed >>>>>>>>>>>>>>>>>>> this (not to >>>>>>>>>>>>>>>>>>> mention most programming languages). Maybe additions once >>>>>>>>>>>>>>>>>>> every few years. >>>>>>>>>>>>>>>>>>> I prefer the smallest intersection of these schema >>>>>>>>>>>>>>>>>>> languages. A oneof is >>>>>>>>>>>>>>>>>>> more clear, while URN emphasizes the similarity of built-in >>>>>>>>>>>>>>>>>>> and logical >>>>>>>>>>>>>>>>>>> types. >>>>>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>>>>> >> >> Hmm... Do we have any examples of the multi-level >>>>>>>>>>>>>>>>>>> primitive/logical >>>>>>>>>>>>>>>>>>> >> >> type in any of these other systems? I have a bias >>>>>>>>>>>>>>>>>>> towards all types >>>>>>>>>>>>>>>>>>> >> >> being on the same footing unless there is >>>>>>>>>>>>>>>>>>> compelling reason to divide >>>>>>>>>>>>>>>>>>> >> >> things into primitive/use-defined ones. >>>>>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>>>>> >> >> Here it seems like the most essential value of the >>>>>>>>>>>>>>>>>>> primitive type set >>>>>>>>>>>>>>>>>>> >> >> is to describe the underlying representation, for >>>>>>>>>>>>>>>>>>> encoding elements in >>>>>>>>>>>>>>>>>>> >> >> a variety of ways (notably columnar, but also >>>>>>>>>>>>>>>>>>> interfacing with other >>>>>>>>>>>>>>>>>>> >> >> external systems like IOs). Perhaps, rather than >>>>>>>>>>>>>>>>>>> the previous >>>>>>>>>>>>>>>>>>> >> >> suggestion of making everything a logical of >>>>>>>>>>>>>>>>>>> bytes, this could be made >>>>>>>>>>>>>>>>>>> >> >> clear by still making everything a logical type, >>>>>>>>>>>>>>>>>>> but renaming >>>>>>>>>>>>>>>>>>> >> >> "TypeName" to Representation. There would be URNs >>>>>>>>>>>>>>>>>>> (typically with >>>>>>>>>>>>>>>>>>> >> >> empty payloads) for the various primitive types >>>>>>>>>>>>>>>>>>> (whose mapping to >>>>>>>>>>>>>>>>>>> >> >> their representations would be the identity). >>>>>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>>>>> >> >> - Robert >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>
