On Fri, Jun 7, 2019 at 4:35 AM Robert Burke <rob...@frantil.com> wrote:
> Wouldn't SDK specific types always be under the "coders" component instead > of the logical type listing? > > Offhand, having a separate normalized listing of logical schema types in > the pipeline components message of the types seems about right. Then > they're unambiguous, but can also either refer to other logical types or > existing coders as needed. When SDKs don't understand a given coder, the > field could be just represented by a blob of bytes. > A key difference between a not-understood coder and a not-understood logical type is that a logical type has a representation in terms of primitive types, so it can always be understood through those, even if an SDK does not treat it specially. Kenn > > > > On Wed, Jun 5, 2019, 11:29 PM Brian Hulette <bhule...@google.com> wrote: > >> If we want to have a Pipeline level registry, we could add it to >> Components [1]. >> >> message Components { >> ... >> map<string, LogicalType> logical_types; >> } >> >> And in FieldType reference the logical types by id: >> oneof field_type { >> AtomicType atomic_type; >> ArrayType array_type; >> ... >> string logical_type_id; // was LogicalType logical_type; >> } >> >> I'm not sure I like this idea though. The reason we started discussing a >> "registry" was just to separate the SDK-specific bits from the >> representation type, and this doesn't accomplish that, it just de-dupes >> logical types used >> across the pipeline. >> >> I think instead I'd rather just come back to the message we have now in >> the doc, used directly in FieldType's oneof: >> >> message LogicalType { >> FieldType representation = 1; >> string logical_urn = 2; >> bytes logical_payload = 3; >> } >> >> We can have a URN for SDK-specific types (user type aliases), like >> "beam:logical:javasdk", and the logical_payload could itself be a protobuf >> with attributes of 1) a serialized class and 2/3) to/from functions. For >> truly portable types it would instead have a well-known URN and optionally >> a logical_payload with some agreed-upon representation of parameters. >> >> It seems like maybe SdkFunctionSpec/Environment should be used for this >> somehow, but I can't find a good example of this in the Runner API to use >> as a model. For example, what we're trying to accomplish is basically the >> same as Java custom coders vs. standard coders. But that is accomplished >> with a magic "javasdk" URN, as I suggested here, not with Environment >> [2,3]. There is a "TODO: standardize such things" where that URN is >> defined, is it possible that Environment is that standard and just hasn't >> been utilized for custom coders yet? >> >> Brian >> >> [1] >> https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L54 >> [2] >> https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L542 >> [3] >> https://github.com/apache/beam/blob/master/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/CoderTranslation.java#L121 >> >> On Tue, Jun 4, 2019 at 2:24 PM Brian Hulette <bhule...@google.com> wrote: >> >>> Yeah that's what I meant. It does seem logical reasonable to scope any >>> registry by pipeline and not by PCollection. Then it seems we would want >>> the entire LogicalType (including the `FieldType representation` field) as >>> the value type, and not just LogicalTypeConversion. Otherwise we're >>> separating the representations from the conversions, and duplicating the >>> representations. You did say a "registry of logical types", so maybe that >>> is what you meant. >>> >>> Brian >>> >>> On Tue, Jun 4, 2019 at 1:21 PM Reuven Lax <re...@google.com> wrote: >>> >>>> >>>> >>>> On Tue, Jun 4, 2019 at 9:20 AM Brian Hulette <bhule...@google.com> >>>> wrote: >>>> >>>>> >>>>> >>>>> On Mon, Jun 3, 2019 at 10:04 PM Reuven Lax <re...@google.com> wrote: >>>>> >>>>>> >>>>>> >>>>>> On Mon, Jun 3, 2019 at 12:27 PM Brian Hulette <bhule...@google.com> >>>>>> wrote: >>>>>> >>>>>>> > It has to go into the proto somewhere (since that's the only way >>>>>>> the SDK can get it), but I'm not sure they should be considered integral >>>>>>> parts of the type. >>>>>>> Are you just advocating for an approach where any SDK-specific >>>>>>> information is stored outside of the Schema message itself so that >>>>>>> Schema >>>>>>> really does just represent the type? That seems reasonable to me, and >>>>>>> alleviates my concerns about how this applies to columnar encodings a >>>>>>> bit >>>>>>> as well. >>>>>>> >>>>>> >>>>>> Yes, that's exactly what I'm advocating. >>>>>> >>>>>> >>>>>>> >>>>>>> We could lift all of the LogicalTypeConversion messages out of the >>>>>>> Schema and the LogicalType like this: >>>>>>> >>>>>>> message SchemaCoder { >>>>>>> Schema schema = 1; >>>>>>> LogicalTypeConversion root_conversion = 2; >>>>>>> map<string, LogicalTypeConversion> attribute_conversions = 3; // >>>>>>> only necessary for user type aliases, portable logical types by >>>>>>> definition >>>>>>> have nothing SDK-specific >>>>>>> } >>>>>>> >>>>>> >>>>>> I'm not sure what the map is for? I think we have status quo wihtout >>>>>> it. >>>>>> >>>>> >>>>> My intention was that the SDK-specific information (to/from functions) >>>>> for any nested fields that are themselves user type aliases would be >>>>> stored >>>>> in this map. That was the motivation for my next question, if we don't >>>>> allow user types to be nested within other user types we may not need it. >>>>> >>>> >>>> Oh, is this meant to contain the ids of all the logical types in this >>>> schema? If so I don't think SchemaCoder is the right place for this. Any >>>> "registry" of logical types should be global to the pipeline, not scoped to >>>> a single PCollection IMO. >>>> >>>> >>>>> I may be missing your meaning - but I think we currently only have >>>>> status quo without this map in the Java SDK because Schema.LogicalType is >>>>> just an interface that must be implemented. It's appropriate for just >>>>> portable logical types, not user-type aliases. Note I've adopted Kenn's >>>>> terminology where portable logical type is a type that can be identified >>>>> by >>>>> just a URN and maybe some parameters, while a user type alias needs some >>>>> SDK specific information, like a class and to/from UDFs. >>>>> >>>>> >>>>>> >>>>>>> I think a critical question (that has implications for the above >>>>>>> proposal) is how/if the two different concepts Kenn mentioned are >>>>>>> allowed >>>>>>> to nest. For example, you could argue it's redundant to have a user type >>>>>>> alias that has a Row representation with a field that is itself a user >>>>>>> type >>>>>>> alias, because instead you could just have a single top-level type alias >>>>>>> with to/from functions that pack and unpack the entire hierarchy. On the >>>>>>> other hand, I think it does make sense for a user type alias or a truly >>>>>>> portable logical type to have a field that is itself a truly portable >>>>>>> logical type (e.g. a user type alias or portable type with a DateTime). >>>>>>> >>>>>>> I've been assuming that user-type aliases could be nested, but >>>>>>> should we disallow that? Or should we go the other way and require that >>>>>>> logical types define at most one "level"? >>>>>>> >>>>>> >>>>>> No I think it's useful to allow things to be nested (though of course >>>>>> the nesting must terminate). >>>>>> >>>>> >>>>>> >>>>>>> >>>>>>> Brian >>>>>>> >>>>>>> On Mon, Jun 3, 2019 at 11:08 AM Kenneth Knowles <k...@apache.org> >>>>>>> wrote: >>>>>>> >>>>>>>> >>>>>>>> On Mon, Jun 3, 2019 at 10:53 AM Reuven Lax <re...@google.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> So I feel a bit leery about making the to/from functions a >>>>>>>>> fundamental part of the portability representation. In my mind, that >>>>>>>>> is >>>>>>>>> very tied to a specific SDK/language. A SDK (say the Java SDK) wants >>>>>>>>> to >>>>>>>>> allow users to use a wide variety of native types with schemas, and >>>>>>>>> under >>>>>>>>> the covers uses the to/from functions to implement that. However from >>>>>>>>> the >>>>>>>>> portable Beam perspective, the schema itself should be the real >>>>>>>>> "type" of >>>>>>>>> the PCollection; the to/from methods are simply a way that a >>>>>>>>> particular SDK >>>>>>>>> makes schemas easier to use. It has to go into the proto somewhere >>>>>>>>> (since >>>>>>>>> that's the only way the SDK can get it), but I'm not sure they should >>>>>>>>> be >>>>>>>>> considered integral parts of the type. >>>>>>>>> >>>>>>>> >>>>>>>> On the doc in a couple places this distinction was made: >>>>>>>> >>>>>>>> * For truly portable logical types, no instructions for the SDK are >>>>>>>> needed. Instead, they require: >>>>>>>> - URN: a standardized identifier any SDK can recognize >>>>>>>> - A spec: what is the universe of values in this type? >>>>>>>> - A representation: how is it represented in built-in types? >>>>>>>> This is how SDKs who do not know/care about the URN will process it >>>>>>>> - (optional): SDKs choose preferred SDK-specific types to embed >>>>>>>> the values in. SDKs have to know about the URN and choose for >>>>>>>> themselves. >>>>>>>> >>>>>>>> *For user-level type aliases, written as convenience by the user in >>>>>>>> their pipeline, what Java schemas have today: >>>>>>>> - to/from UDFs: the code is SDK-specific >>>>>>>> - some representation of the intended type (like java class): >>>>>>>> also SDK specific >>>>>>>> - a representation >>>>>>>> - any "id" is just like other ids in the pipeline, just avoiding >>>>>>>> duplicating the proto >>>>>>>> - Luke points out that nesting these can give multiple SDKs a >>>>>>>> hint >>>>>>>> >>>>>>>> In my mind the remaining complexity is whether or not we need to be >>>>>>>> able to move between the two. Composite PTransforms, for example, do >>>>>>>> have >>>>>>>> fluidity between being strictly user-defined versus portable >>>>>>>> URN+payload. >>>>>>>> But it requires lots of engineering, namely the current work on >>>>>>>> expansion >>>>>>>> service. >>>>>>>> >>>>>>>> Kenn >>>>>>>> >>>>>>>> >>>>>>>>> On Mon, Jun 3, 2019 at 10:23 AM Brian Hulette <bhule...@google.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Ah I see, I didn't realize that. Then I suppose we'll need >>>>>>>>>> to/from functions somewhere in the logical type conversion to >>>>>>>>>> preserve the >>>>>>>>>> current behavior. >>>>>>>>>> >>>>>>>>>> I'm still a little hesitant to make these functions an explicit >>>>>>>>>> part of LogicalTypeConversion for another reason. Down the road, >>>>>>>>>> schemas >>>>>>>>>> could give us an avenue to use a batched columnar format (presumably >>>>>>>>>> arrow, >>>>>>>>>> but of course others are possible). By making to/from an explicit >>>>>>>>>> part of >>>>>>>>>> logical types we add some element-wise logic to a schema >>>>>>>>>> representation >>>>>>>>>> that's otherwise ambivalent to element-wise vs. batched encodings. >>>>>>>>>> >>>>>>>>>> I suppose you could make an argument that to/from are only for >>>>>>>>>> custom types. There will also be some set of well-known types >>>>>>>>>> identified >>>>>>>>>> only by URN and some parameters, which could easily be translated to >>>>>>>>>> a >>>>>>>>>> columnar format. We could just not support custom types fully if we >>>>>>>>>> add a >>>>>>>>>> columnar encoding, or maybe add optional toBatch/fromBatch functions >>>>>>>>>> when/if we get there. >>>>>>>>>> >>>>>>>>>> What about something like this that makes the two different types >>>>>>>>>> of logical types explicit? >>>>>>>>>> >>>>>>>>>> // Describes a logical type and how to convert between it and its >>>>>>>>>> representation (e.g. Row). >>>>>>>>>> message LogicalTypeConversion { >>>>>>>>>> oneof conversion { >>>>>>>>>> message Standard standard = 1; >>>>>>>>>> message Custom custom = 2; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> message Standard { >>>>>>>>>> String urn = 1; >>>>>>>>>> repeated string args = 2; // could also be a map >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> message Custom { >>>>>>>>>> FunctionSpec(?) toRepresentation = 1; >>>>>>>>>> FunctionSpec(?) fromRepresentation = 2; >>>>>>>>>> bytes type = 3; // e.g. serialized class for Java >>>>>>>>>> } >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> And LogicalType and Schema become: >>>>>>>>>> >>>>>>>>>> message LogicalType { >>>>>>>>>> FieldType representation = 1; >>>>>>>>>> LogicalTypeConversion conversion = 2; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> message Schema { >>>>>>>>>> ... >>>>>>>>>> repeated Field fields = 1; >>>>>>>>>> LogicalTypeConversion conversion = 2; // implied that >>>>>>>>>> representation is Row >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> Brian >>>>>>>>>> >>>>>>>>>> On Sat, Jun 1, 2019 at 10:44 AM Reuven Lax <re...@google.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Keep in mind that right now the SchemaRegistry is only assumed >>>>>>>>>>> to exist at graph-construction time, not at execution time; all >>>>>>>>>>> information >>>>>>>>>>> in the schema registry is embedded in the SchemaCoder, which is the >>>>>>>>>>> only >>>>>>>>>>> thing we keep around when the pipeline is actually running. We >>>>>>>>>>> could look >>>>>>>>>>> into changing this, but it would potentially be a very big change, >>>>>>>>>>> and I do >>>>>>>>>>> think we should start getting users actively using schemas soon. >>>>>>>>>>> >>>>>>>>>>> On Fri, May 31, 2019 at 3:40 PM Brian Hulette < >>>>>>>>>>> bhule...@google.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> > Can you propose what the protos would look like in this case? >>>>>>>>>>>> Right now LogicalType does not contain the to/from conversion >>>>>>>>>>>> functions in >>>>>>>>>>>> the proto. Do you think we'll need to add these in? >>>>>>>>>>>> >>>>>>>>>>>> Maybe. Right now the proposed LogicalType message is pretty >>>>>>>>>>>> simple/generic: >>>>>>>>>>>> message LogicalType { >>>>>>>>>>>> FieldType representation = 1; >>>>>>>>>>>> string logical_urn = 2; >>>>>>>>>>>> bytes logical_payload = 3; >>>>>>>>>>>> } >>>>>>>>>>>> >>>>>>>>>>>> If we keep just logical_urn and logical_payload, the >>>>>>>>>>>> logical_payload could itself be a protobuf with attributes of 1) a >>>>>>>>>>>> serialized class and 2/3) to/from functions. Or, alternatively, we >>>>>>>>>>>> could >>>>>>>>>>>> have a generalization of the SchemaRegistry for logical types. >>>>>>>>>>>> Implementations for standard types and user-defined types would be >>>>>>>>>>>> registered by URN, and the SDK could look them up given just a >>>>>>>>>>>> URN. I put a >>>>>>>>>>>> brief section about this alternative in the doc last week [1]. >>>>>>>>>>>> What I >>>>>>>>>>>> suggested there included removing the logical_payload field, which >>>>>>>>>>>> is >>>>>>>>>>>> probably overkill. The critical piece is just relying on a >>>>>>>>>>>> registry in the >>>>>>>>>>>> SDK to look up types and to/from functions rather than storing >>>>>>>>>>>> them in the >>>>>>>>>>>> portable schema itself. >>>>>>>>>>>> >>>>>>>>>>>> I kind of like keeping the LogicalType message generic for now, >>>>>>>>>>>> since it gives us a way to try out these various approaches, but >>>>>>>>>>>> maybe >>>>>>>>>>>> that's just a cop out. >>>>>>>>>>>> >>>>>>>>>>>> [1] >>>>>>>>>>>> https://docs.google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?ts=5cdf6a5b#heading=h.jlt5hdrolfy >>>>>>>>>>>> >>>>>>>>>>>> On Fri, May 31, 2019 at 12:36 PM Reuven Lax <re...@google.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, May 28, 2019 at 10:11 AM Brian Hulette < >>>>>>>>>>>>> bhule...@google.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Sun, May 26, 2019 at 1:25 PM Reuven Lax <re...@google.com> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Fri, May 24, 2019 at 11:42 AM Brian Hulette < >>>>>>>>>>>>>>> bhule...@google.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> *tl;dr:* SchemaCoder represents a logical type with a base >>>>>>>>>>>>>>>> type of Row and we should think about that. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I'm a little concerned that the current proposals for a >>>>>>>>>>>>>>>> portable representation don't actually fully represent >>>>>>>>>>>>>>>> Schemas. It seems to >>>>>>>>>>>>>>>> me that the current java-only Schemas are made up three >>>>>>>>>>>>>>>> concepts that are >>>>>>>>>>>>>>>> intertwined: >>>>>>>>>>>>>>>> (a) The Java SDK specific code for schema inference, type >>>>>>>>>>>>>>>> coercion, and "schema-aware" transforms. >>>>>>>>>>>>>>>> (b) A RowCoder[1] that encodes Rows[2] which have a >>>>>>>>>>>>>>>> particular Schema[3]. >>>>>>>>>>>>>>>> (c) A SchemaCoder[4] that has a RowCoder for a >>>>>>>>>>>>>>>> particular schema, and functions for converting Rows with that >>>>>>>>>>>>>>>> schema >>>>>>>>>>>>>>>> to/from a Java type T. Those functions and the RowCoder are >>>>>>>>>>>>>>>> then composed >>>>>>>>>>>>>>>> to provider a Coder for the type T. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> RowCoder is currently just an internal implementation >>>>>>>>>>>>>>> detail, it can be eliminated. SchemaCoder is the only thing >>>>>>>>>>>>>>> that determines >>>>>>>>>>>>>>> a schema today. >>>>>>>>>>>>>>> >>>>>>>>>>>>>> Why not keep it around? I think it would make sense to have a >>>>>>>>>>>>>> RowCoder implementation in every SDK, as well as something like >>>>>>>>>>>>>> SchemaCoder >>>>>>>>>>>>>> that defines a conversion from that SDK's "Row" to the language >>>>>>>>>>>>>> type. >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> The point is that from a programmer's perspective, there is >>>>>>>>>>>>> nothing much special about Row. Any type can have a schema, and >>>>>>>>>>>>> the only >>>>>>>>>>>>> special thing about Row is that it's always guaranteed to exist. >>>>>>>>>>>>> From that >>>>>>>>>>>>> standpoint, Row is nearly an implementation detail. Today >>>>>>>>>>>>> RowCoder is never >>>>>>>>>>>>> set on _any_ PCollection, it's literally just used as a helper >>>>>>>>>>>>> library, so >>>>>>>>>>>>> there's no real need for it to exist as a "Coder." >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> We're not concerned with (a) at this time since that's >>>>>>>>>>>>>>>> specific to the SDK, not the interface between them. My >>>>>>>>>>>>>>>> understanding is we >>>>>>>>>>>>>>>> just want to define a portable representation for (b) and/or >>>>>>>>>>>>>>>> (c). >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> What has been discussed so far is really just a portable >>>>>>>>>>>>>>>> representation for (b), the RowCoder, since the discussion is >>>>>>>>>>>>>>>> only around >>>>>>>>>>>>>>>> how to represent the schema itself and not the to/from >>>>>>>>>>>>>>>> functions. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Correct. The to/from functions are actually related to a). >>>>>>>>>>>>>>> One of the big goals of schemas was that users should not be >>>>>>>>>>>>>>> forced to >>>>>>>>>>>>>>> operate on rows to get schemas. A user can create >>>>>>>>>>>>>>> PCollection<MyRandomType> >>>>>>>>>>>>>>> and as long as the SDK can infer a schema from MyRandomType, >>>>>>>>>>>>>>> the user never >>>>>>>>>>>>>>> needs to even see a Row object. The to/fromRow functions are >>>>>>>>>>>>>>> what make this >>>>>>>>>>>>>>> work today. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> One of the points I'd like to make is that this type coercion >>>>>>>>>>>>>> is a useful concept on it's own, separate from schemas. It's >>>>>>>>>>>>>> especially >>>>>>>>>>>>>> useful for a type that has a schema and is encoded by RowCoder >>>>>>>>>>>>>> since that >>>>>>>>>>>>>> can represent many more types, but the type coercion doesn't >>>>>>>>>>>>>> have to be >>>>>>>>>>>>>> tied to just schemas and RowCoder. We could also do type >>>>>>>>>>>>>> coercion for types >>>>>>>>>>>>>> that are effectively wrappers around an integer or a string. It >>>>>>>>>>>>>> could just >>>>>>>>>>>>>> be a general way to map language types to base types (i.e. types >>>>>>>>>>>>>> that we >>>>>>>>>>>>>> have a coder for). Then it just becomes a general framework for >>>>>>>>>>>>>> extending >>>>>>>>>>>>>> coders to represent more language types. >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Let's not tie those conversations. Maybe a similar concept >>>>>>>>>>>>> will hold true for general coders (or we might decide to get rid >>>>>>>>>>>>> of coders >>>>>>>>>>>>> in favor of schemas, in which case that becomes moot), but I >>>>>>>>>>>>> don't think we >>>>>>>>>>>>> should prematurely generalize. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> One of the outstanding questions for that schema >>>>>>>>>>>>>>>> representation is how to represent logical types, which may or >>>>>>>>>>>>>>>> may not have >>>>>>>>>>>>>>>> some language type in each SDK (the canonical example being a >>>>>>>>>>>>>>>> timsetamp type with seconds and nanos and java.time.Instant). >>>>>>>>>>>>>>>> I think this >>>>>>>>>>>>>>>> question is critically important, because (c), the >>>>>>>>>>>>>>>> SchemaCoder, is actually >>>>>>>>>>>>>>>> *defining a logical type* with a language type T in the Java >>>>>>>>>>>>>>>> SDK. This >>>>>>>>>>>>>>>> becomes clear when you compare SchemaCoder[4] to the >>>>>>>>>>>>>>>> Schema.LogicalType >>>>>>>>>>>>>>>> interface[5] - both essentially have three attributes: a base >>>>>>>>>>>>>>>> type, and two >>>>>>>>>>>>>>>> functions for converting to/from that base type. The only >>>>>>>>>>>>>>>> difference is for >>>>>>>>>>>>>>>> SchemaCoder that base type must be a Row so it can be >>>>>>>>>>>>>>>> represented by a >>>>>>>>>>>>>>>> Schema alone, while LogicalType can have any base type that >>>>>>>>>>>>>>>> can be >>>>>>>>>>>>>>>> represented by FieldType, including a Row. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> This is not true actually. SchemaCoder can have any base >>>>>>>>>>>>>>> type, that's why (in Java) it's SchemaCoder<T>. This is why >>>>>>>>>>>>>>> PCollection<T> >>>>>>>>>>>>>>> can have a schema, even if T is not Row. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> I'm not sure I effectively communicated what I meant - When I >>>>>>>>>>>>>> said SchemaCoder's "base type" I wasn't referring to T, I was >>>>>>>>>>>>>> referring to >>>>>>>>>>>>>> the base FieldType, whose coder we use for this type. I meant >>>>>>>>>>>>>> "base type" >>>>>>>>>>>>>> to be analogous to LogicalType's `getBaseType`, or what Kenn is >>>>>>>>>>>>>> suggesting >>>>>>>>>>>>>> we call "representation" in the portable beam schemas doc. To >>>>>>>>>>>>>> define some >>>>>>>>>>>>>> terms from my original message: >>>>>>>>>>>>>> base type = an instance of FieldType, crucially this is >>>>>>>>>>>>>> something that we have a coder for (be it VarIntCoder, >>>>>>>>>>>>>> Utf8Coder, RowCoder, >>>>>>>>>>>>>> ...) >>>>>>>>>>>>>> language type (or "T", "type T", "logical type") = Some Java >>>>>>>>>>>>>> class (or something analogous in the other SDKs) that we may or >>>>>>>>>>>>>> may not >>>>>>>>>>>>>> have a coder for. It's possible to define functions for >>>>>>>>>>>>>> converting >>>>>>>>>>>>>> instances of the language type to/from the base type. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I was just trying to make the case that SchemaCoder is really >>>>>>>>>>>>>> a special case of LogicalType, where `getBaseType` always >>>>>>>>>>>>>> returns a Row >>>>>>>>>>>>>> with the stored Schema. >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Yeah, I think I got that point. >>>>>>>>>>>>> >>>>>>>>>>>>> Can you propose what the protos would look like in this case? >>>>>>>>>>>>> Right now LogicalType does not contain the to/from conversion >>>>>>>>>>>>> functions in >>>>>>>>>>>>> the proto. Do you think we'll need to add these in? >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> To make the point with code: SchemaCoder<T> can be made to >>>>>>>>>>>>>> implement Schema.LogicalType<T,Row> with trivial implementations >>>>>>>>>>>>>> of >>>>>>>>>>>>>> getBaseType, toBaseType, and toInputType (I'm not trying to say >>>>>>>>>>>>>> we should >>>>>>>>>>>>>> or shouldn't do this, just using it illustrate my point): >>>>>>>>>>>>>> >>>>>>>>>>>>>> class SchemaCoder extends CustomCoder<T> implements >>>>>>>>>>>>>> Schema.LogicalType<T, Row> { >>>>>>>>>>>>>> ... >>>>>>>>>>>>>> >>>>>>>>>>>>>> @Override >>>>>>>>>>>>>> FieldType getBaseType() { >>>>>>>>>>>>>> return FieldType.row(getSchema()); >>>>>>>>>>>>>> } >>>>>>>>>>>>>> >>>>>>>>>>>>>> @Override >>>>>>>>>>>>>> public Row toBaseType() { >>>>>>>>>>>>>> return this.toRowFunction.apply(input); >>>>>>>>>>>>>> } >>>>>>>>>>>>>> >>>>>>>>>>>>>> @Override >>>>>>>>>>>>>> public T toInputType(Row base) { >>>>>>>>>>>>>> return this.fromRowFunction.apply(base); >>>>>>>>>>>>>> } >>>>>>>>>>>>>> ... >>>>>>>>>>>>>> } >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I think it may make sense to fully embrace this duality, by >>>>>>>>>>>>>>>> letting SchemaCoder have a baseType other than just Row and >>>>>>>>>>>>>>>> renaming it to >>>>>>>>>>>>>>>> LogicalTypeCoder/LanguageTypeCoder. The current Java SDK >>>>>>>>>>>>>>>> schema-aware >>>>>>>>>>>>>>>> transforms (a) would operate only on LogicalTypeCoders with a >>>>>>>>>>>>>>>> Row base >>>>>>>>>>>>>>>> type. Perhaps some of the current schema logic could alsobe >>>>>>>>>>>>>>>> applied more >>>>>>>>>>>>>>>> generally to any logical type - for example, to provide type >>>>>>>>>>>>>>>> coercion for >>>>>>>>>>>>>>>> logical types with a base type other than Row, like int64 and >>>>>>>>>>>>>>>> a timestamp >>>>>>>>>>>>>>>> class backed by millis, or fixed size bytes and a UUID class. >>>>>>>>>>>>>>>> And having a >>>>>>>>>>>>>>>> portable representation that represents those (non Row backed) >>>>>>>>>>>>>>>> logical >>>>>>>>>>>>>>>> types with some URN would also allow us to pass them to other >>>>>>>>>>>>>>>> languages >>>>>>>>>>>>>>>> without unnecessarily wrapping them in a Row in order to use >>>>>>>>>>>>>>>> SchemaCoder. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I think the actual overlap here is between the to/from >>>>>>>>>>>>>>> functions in SchemaCoder (which is what allows SchemaCoder<T> >>>>>>>>>>>>>>> where T != >>>>>>>>>>>>>>> Row) and the equivalent functionality in LogicalType. However >>>>>>>>>>>>>>> making all of >>>>>>>>>>>>>>> schemas simply just a logical type feels a bit awkward and >>>>>>>>>>>>>>> circular to me. >>>>>>>>>>>>>>> Maybe we should refactor that part out into a >>>>>>>>>>>>>>> LogicalTypeConversion proto, >>>>>>>>>>>>>>> and reference that from both LogicalType and from SchemaCoder? >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> LogicalType is already potentially circular though. A schema >>>>>>>>>>>>>> can have a field with a logical type, and that logical type can >>>>>>>>>>>>>> have a base >>>>>>>>>>>>>> type of Row with a field with a logical type (and on and on...). >>>>>>>>>>>>>> To me it >>>>>>>>>>>>>> seems elegant, not awkward, to recognize that SchemaCoder is >>>>>>>>>>>>>> just a special >>>>>>>>>>>>>> case of this concept. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Something like the LogicalTypeConversion proto would >>>>>>>>>>>>>> definitely be an improvement, but I would still prefer just >>>>>>>>>>>>>> using a >>>>>>>>>>>>>> top-level logical type :) >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I've added a section to the doc [6] to propose this >>>>>>>>>>>>>>>> alternative in the context of the portable representation but >>>>>>>>>>>>>>>> I wanted to >>>>>>>>>>>>>>>> bring it up here as well to solicit feedback. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/RowCoder.java#L41 >>>>>>>>>>>>>>>> [2] >>>>>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/Row.java#L59 >>>>>>>>>>>>>>>> [3] >>>>>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L48 >>>>>>>>>>>>>>>> [4] >>>>>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/SchemaCoder.java#L33 >>>>>>>>>>>>>>>> [5] >>>>>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L489 >>>>>>>>>>>>>>>> [6] >>>>>>>>>>>>>>>> https://docs.google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?ts=5cdf6a5b#heading=h.7570feur1qin >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Fri, May 10, 2019 at 9:16 AM Brian Hulette < >>>>>>>>>>>>>>>> bhule...@google.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Ah thanks! I added some language there. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> *From: *Kenneth Knowles <k...@apache.org> >>>>>>>>>>>>>>>>> *Date: *Thu, May 9, 2019 at 5:31 PM >>>>>>>>>>>>>>>>> *To: *dev >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> *From: *Brian Hulette <bhule...@google.com> >>>>>>>>>>>>>>>>>> *Date: *Thu, May 9, 2019 at 2:02 PM >>>>>>>>>>>>>>>>>> *To: * <dev@beam.apache.org> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> We briefly discussed using arrow schemas in place of beam >>>>>>>>>>>>>>>>>>> schemas entirely in an arrow thread [1]. The biggest reason >>>>>>>>>>>>>>>>>>> not to this was >>>>>>>>>>>>>>>>>>> that we wanted to have a type for large iterables in beam >>>>>>>>>>>>>>>>>>> schemas. But >>>>>>>>>>>>>>>>>>> given that large iterables aren't currently implemented, >>>>>>>>>>>>>>>>>>> beam schemas look >>>>>>>>>>>>>>>>>>> very similar to arrow schemas. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I think it makes sense to take inspiration from arrow >>>>>>>>>>>>>>>>>>> schemas where possible, and maybe even copy them outright. >>>>>>>>>>>>>>>>>>> Arrow already >>>>>>>>>>>>>>>>>>> has a portable (flatbuffers) schema representation [2], and >>>>>>>>>>>>>>>>>>> implementations >>>>>>>>>>>>>>>>>>> for it in many languages that we may be able to re-use as >>>>>>>>>>>>>>>>>>> we bring schemas >>>>>>>>>>>>>>>>>>> to more SDKs (the project has Python and Go >>>>>>>>>>>>>>>>>>> implementations). There are a >>>>>>>>>>>>>>>>>>> couple of concepts in Arrow schemas that are specific for >>>>>>>>>>>>>>>>>>> the format and >>>>>>>>>>>>>>>>>>> wouldn't make sense for us, (fields can indicate whether or >>>>>>>>>>>>>>>>>>> not they are >>>>>>>>>>>>>>>>>>> dictionary encoded, and the schema has an endianness >>>>>>>>>>>>>>>>>>> field), but if you >>>>>>>>>>>>>>>>>>> drop those concepts the arrow spec looks pretty similar to >>>>>>>>>>>>>>>>>>> the beam proto >>>>>>>>>>>>>>>>>>> spec. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> FWIW I left a blank section in the doc for filling out >>>>>>>>>>>>>>>>>> what the differences are and why, and conversely what the >>>>>>>>>>>>>>>>>> interop >>>>>>>>>>>>>>>>>> opportunities may be. Such sections are some of my favorite >>>>>>>>>>>>>>>>>> sections of >>>>>>>>>>>>>>>>>> design docs. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Kenn >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Brian >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>>>>>>> https://lists.apache.org/thread.html/6be7715e13b71c2d161e4378c5ca1c76ac40cfc5988a03ba87f1c434@%3Cdev.beam.apache.org%3E >>>>>>>>>>>>>>>>>>> [2] >>>>>>>>>>>>>>>>>>> https://github.com/apache/arrow/blob/master/format/Schema.fbs#L194 >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> *From: *Robert Bradshaw <rober...@google.com> >>>>>>>>>>>>>>>>>>> *Date: *Thu, May 9, 2019 at 1:38 PM >>>>>>>>>>>>>>>>>>> *To: *dev >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> From: Reuven Lax <re...@google.com> >>>>>>>>>>>>>>>>>>>> Date: Thu, May 9, 2019 at 7:29 PM >>>>>>>>>>>>>>>>>>>> To: dev >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> > Also in the future we might be able to do >>>>>>>>>>>>>>>>>>>> optimizations at the runner level if at the portability >>>>>>>>>>>>>>>>>>>> layer we understood >>>>>>>>>>>>>>>>>>>> schemes instead of just raw coders. This could be things >>>>>>>>>>>>>>>>>>>> like only parsing >>>>>>>>>>>>>>>>>>>> a subset of a row (if we know only a few fields are >>>>>>>>>>>>>>>>>>>> accessed) or using a >>>>>>>>>>>>>>>>>>>> columnar data structure like Arrow to encode batches of >>>>>>>>>>>>>>>>>>>> rows across >>>>>>>>>>>>>>>>>>>> portability. This doesn't affect data semantics of course, >>>>>>>>>>>>>>>>>>>> but having a >>>>>>>>>>>>>>>>>>>> richer, more-expressive type system opens up other >>>>>>>>>>>>>>>>>>>> opportunities. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> But we could do all of that with a RowCoder we >>>>>>>>>>>>>>>>>>>> understood to designate >>>>>>>>>>>>>>>>>>>> the type(s), right? >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> > On Thu, May 9, 2019 at 10:16 AM Robert Bradshaw < >>>>>>>>>>>>>>>>>>>> rober...@google.com> wrote: >>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>> >> On the flip side, Schemas are equivalent to the >>>>>>>>>>>>>>>>>>>> space of Coders with >>>>>>>>>>>>>>>>>>>> >> the addition of a RowCoder and the ability to >>>>>>>>>>>>>>>>>>>> materialize to something >>>>>>>>>>>>>>>>>>>> >> other than bytes, right? (Perhaps I'm missing >>>>>>>>>>>>>>>>>>>> something big here...) >>>>>>>>>>>>>>>>>>>> >> This may make a backwards-compatible transition >>>>>>>>>>>>>>>>>>>> easier. (SDK-side, the >>>>>>>>>>>>>>>>>>>> >> ability to reason about and operate on such types is >>>>>>>>>>>>>>>>>>>> of course much >>>>>>>>>>>>>>>>>>>> >> richer than anything Coders offer right now.) >>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>> >> From: Reuven Lax <re...@google.com> >>>>>>>>>>>>>>>>>>>> >> Date: Thu, May 9, 2019 at 4:52 PM >>>>>>>>>>>>>>>>>>>> >> To: dev >>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>> >> > FYI I can imagine a world in which we have no >>>>>>>>>>>>>>>>>>>> coders. We could define the entire model on top of >>>>>>>>>>>>>>>>>>>> schemas. Today's "Coder" >>>>>>>>>>>>>>>>>>>> is completely equivalent to a single-field schema with a >>>>>>>>>>>>>>>>>>>> logical-type field >>>>>>>>>>>>>>>>>>>> (actually the latter is slightly more expressive as you >>>>>>>>>>>>>>>>>>>> aren't forced to >>>>>>>>>>>>>>>>>>>> serialize into bytes). >>>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>>> >> > Due to compatibility constraints and the effort >>>>>>>>>>>>>>>>>>>> that would be involved in such a change, I think the >>>>>>>>>>>>>>>>>>>> practical decision >>>>>>>>>>>>>>>>>>>> should be for schemas and coders to coexist for the time >>>>>>>>>>>>>>>>>>>> being. However >>>>>>>>>>>>>>>>>>>> when we start planning Beam 3.0, deprecating coders is >>>>>>>>>>>>>>>>>>>> something I would >>>>>>>>>>>>>>>>>>>> like to suggest. >>>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>>> >> > On Thu, May 9, 2019 at 7:48 AM Robert Bradshaw < >>>>>>>>>>>>>>>>>>>> rober...@google.com> wrote: >>>>>>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>>>>>> >> >> From: Kenneth Knowles <k...@apache.org> >>>>>>>>>>>>>>>>>>>> >> >> Date: Thu, May 9, 2019 at 10:05 AM >>>>>>>>>>>>>>>>>>>> >> >> To: dev >>>>>>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>>>>>> >> >> > This is a huge development. Top posting because >>>>>>>>>>>>>>>>>>>> I can be more compact. >>>>>>>>>>>>>>>>>>>> >> >> > >>>>>>>>>>>>>>>>>>>> >> >> > I really think after the initial idea converges >>>>>>>>>>>>>>>>>>>> this needs a design doc with goals and alternatives. It is >>>>>>>>>>>>>>>>>>>> an >>>>>>>>>>>>>>>>>>>> extraordinarily consequential model change. So in the >>>>>>>>>>>>>>>>>>>> spirit of doing the >>>>>>>>>>>>>>>>>>>> work / bias towards action, I created a quick draft at >>>>>>>>>>>>>>>>>>>> https://s.apache.org/beam-schemas and added everyone >>>>>>>>>>>>>>>>>>>> on this thread as editors. I am still in the process of >>>>>>>>>>>>>>>>>>>> writing this to >>>>>>>>>>>>>>>>>>>> match the thread. >>>>>>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>>>>>> >> >> Thanks! Added some comments there. >>>>>>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>>>>>> >> >> > *Multiple timestamp resolutions*: you can use >>>>>>>>>>>>>>>>>>>> logcial types to represent nanos the same way Java and >>>>>>>>>>>>>>>>>>>> proto do. >>>>>>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>>>>>> >> >> As per the other discussion, I'm unsure the value >>>>>>>>>>>>>>>>>>>> in supporting >>>>>>>>>>>>>>>>>>>> >> >> multiple timestamp resolutions is high enough to >>>>>>>>>>>>>>>>>>>> outweigh the cost. >>>>>>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>>>>>> >> >> > *Why multiple int types?* The domain of values >>>>>>>>>>>>>>>>>>>> for these types are different. For a language with one >>>>>>>>>>>>>>>>>>>> "int" or "number" >>>>>>>>>>>>>>>>>>>> type, that's another domain of values. >>>>>>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>>>>>> >> >> What is the value in having different domains? If >>>>>>>>>>>>>>>>>>>> your data has a >>>>>>>>>>>>>>>>>>>> >> >> natural domain, chances are it doesn't line up >>>>>>>>>>>>>>>>>>>> exactly with one of >>>>>>>>>>>>>>>>>>>> >> >> these. I guess it's for languages whose types >>>>>>>>>>>>>>>>>>>> have specific domains? >>>>>>>>>>>>>>>>>>>> >> >> (There's also compactness in representation, >>>>>>>>>>>>>>>>>>>> encoded and in-memory, >>>>>>>>>>>>>>>>>>>> >> >> though I'm not sure that's high.) >>>>>>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>>>>>> >> >> > *Columnar/Arrow*: making sure we unlock the >>>>>>>>>>>>>>>>>>>> ability to take this path is Paramount. So tying it >>>>>>>>>>>>>>>>>>>> directly to a >>>>>>>>>>>>>>>>>>>> row-oriented coder seems counterproductive. >>>>>>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>>>>>> >> >> I don't think Coders are necessarily >>>>>>>>>>>>>>>>>>>> row-oriented. They are, however, >>>>>>>>>>>>>>>>>>>> >> >> bytes-oriented. (Perhaps they need not be.) There >>>>>>>>>>>>>>>>>>>> seems to be a lot of >>>>>>>>>>>>>>>>>>>> >> >> overlap between what Coders express in terms of >>>>>>>>>>>>>>>>>>>> element typing >>>>>>>>>>>>>>>>>>>> >> >> information and what Schemas express, and I'd >>>>>>>>>>>>>>>>>>>> rather have one concept >>>>>>>>>>>>>>>>>>>> >> >> if possible. Or have a clear division of >>>>>>>>>>>>>>>>>>>> responsibilities. >>>>>>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>>>>>> >> >> > *Multimap*: what does it add over an >>>>>>>>>>>>>>>>>>>> array-valued map or large-iterable-valued map? (honest >>>>>>>>>>>>>>>>>>>> question, not >>>>>>>>>>>>>>>>>>>> rhetorical) >>>>>>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>>>>>> >> >> Multimap has a different notion of what it means >>>>>>>>>>>>>>>>>>>> to contain a value, >>>>>>>>>>>>>>>>>>>> >> >> can handle (unordered) unions of non-disjoint >>>>>>>>>>>>>>>>>>>> keys, etc. Maybe this >>>>>>>>>>>>>>>>>>>> >> >> isn't worth a new primitive type. >>>>>>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>>>>>> >> >> > *URN/enum for type names*: I see the case for >>>>>>>>>>>>>>>>>>>> both. The core types are fundamental enough they should >>>>>>>>>>>>>>>>>>>> never really change >>>>>>>>>>>>>>>>>>>> - after all, proto, thrift, avro, arrow, have addressed >>>>>>>>>>>>>>>>>>>> this (not to >>>>>>>>>>>>>>>>>>>> mention most programming languages). Maybe additions once >>>>>>>>>>>>>>>>>>>> every few years. >>>>>>>>>>>>>>>>>>>> I prefer the smallest intersection of these schema >>>>>>>>>>>>>>>>>>>> languages. A oneof is >>>>>>>>>>>>>>>>>>>> more clear, while URN emphasizes the similarity of >>>>>>>>>>>>>>>>>>>> built-in and logical >>>>>>>>>>>>>>>>>>>> types. >>>>>>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>>>>>> >> >> Hmm... Do we have any examples of the multi-level >>>>>>>>>>>>>>>>>>>> primitive/logical >>>>>>>>>>>>>>>>>>>> >> >> type in any of these other systems? I have a bias >>>>>>>>>>>>>>>>>>>> towards all types >>>>>>>>>>>>>>>>>>>> >> >> being on the same footing unless there is >>>>>>>>>>>>>>>>>>>> compelling reason to divide >>>>>>>>>>>>>>>>>>>> >> >> things into primitive/use-defined ones. >>>>>>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>>>>>> >> >> Here it seems like the most essential value of >>>>>>>>>>>>>>>>>>>> the primitive type set >>>>>>>>>>>>>>>>>>>> >> >> is to describe the underlying representation, for >>>>>>>>>>>>>>>>>>>> encoding elements in >>>>>>>>>>>>>>>>>>>> >> >> a variety of ways (notably columnar, but also >>>>>>>>>>>>>>>>>>>> interfacing with other >>>>>>>>>>>>>>>>>>>> >> >> external systems like IOs). Perhaps, rather than >>>>>>>>>>>>>>>>>>>> the previous >>>>>>>>>>>>>>>>>>>> >> >> suggestion of making everything a logical of >>>>>>>>>>>>>>>>>>>> bytes, this could be made >>>>>>>>>>>>>>>>>>>> >> >> clear by still making everything a logical type, >>>>>>>>>>>>>>>>>>>> but renaming >>>>>>>>>>>>>>>>>>>> >> >> "TypeName" to Representation. There would be URNs >>>>>>>>>>>>>>>>>>>> (typically with >>>>>>>>>>>>>>>>>>>> >> >> empty payloads) for the various primitive types >>>>>>>>>>>>>>>>>>>> (whose mapping to >>>>>>>>>>>>>>>>>>>> >> >> their representations would be the identity). >>>>>>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>>>>>> >> >> - Robert >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>