Re: [DISCUSS] Portability representation of schemas

Kenneth Knowles Sat, 08 Jun 2019 12:26:03 -0700

On Fri, Jun 7, 2019 at 4:35 AM Robert Burke <rob...@frantil.com> wrote:


> Wouldn't SDK specific types always be under the "coders" component instead
> of the logical type listing?
>
> Offhand, having a separate normalized listing of logical schema types in
> the pipeline components message of the types seems about right. Then
> they're unambiguous, but can also either refer to other logical types or
> existing coders as needed. When SDKs don't understand a given coder, the
> field could be just represented by a blob of bytes.
>

A key difference between a not-understood coder and a not-understood
logical type is that a logical type has a representation in terms of
primitive types, so it can always be understood through those, even if an
SDK does not treat it specially.

Kenn


>
>
>
> On Wed, Jun 5, 2019, 11:29 PM Brian Hulette <bhule...@google.com> wrote:
>
>> If we want to have a Pipeline level registry, we could add it to
>> Components [1].
>>
>> message Components {
>>   ...
>>   map<string, LogicalType> logical_types;
>> }
>>
>> And in FieldType reference the logical types by id:
>> oneof field_type {
>>   AtomicType atomic_type;
>>   ArrayType array_type;
>>   ...
>>   string logical_type_id;    // was LogicalType logical_type;
>> }
>>
>> I'm not sure I like this idea though. The reason we started discussing a
>> "registry" was just to separate the SDK-specific bits from the
>> representation type, and this doesn't accomplish that, it just de-dupes
>> logical types used
>> across the pipeline.
>>
>> I think instead I'd rather just come back to the message we have now in
>> the doc, used directly in FieldType's oneof:
>>
>> message LogicalType {
>>   FieldType representation = 1;
>>   string logical_urn = 2;
>>   bytes logical_payload = 3;
>> }
>>
>> We can have a URN for SDK-specific types (user type aliases), like
>> "beam:logical:javasdk", and the logical_payload could itself be a protobuf
>> with attributes of 1) a serialized class and 2/3) to/from functions. For
>> truly portable types it would instead have a well-known URN and optionally
>> a logical_payload with some agreed-upon representation of parameters.
>>
>> It seems like maybe SdkFunctionSpec/Environment should be used for this
>> somehow, but I can't find a good example of this in the Runner API to use
>> as a model. For example, what we're trying to accomplish is basically the
>> same as Java custom coders vs. standard coders. But that is accomplished
>> with a magic "javasdk" URN, as I suggested here, not with Environment
>> [2,3]. There is a "TODO: standardize such things" where that URN is
>> defined, is it possible that Environment is that standard and just hasn't
>> been utilized for custom coders yet?
>>
>> Brian
>>
>> [1]
>> https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L54
>> [2]
>> https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L542
>> [3]
>> https://github.com/apache/beam/blob/master/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/CoderTranslation.java#L121
>>
>> On Tue, Jun 4, 2019 at 2:24 PM Brian Hulette <bhule...@google.com> wrote:
>>
>>> Yeah that's what I meant. It does seem logical reasonable to scope any
>>> registry by pipeline and not by PCollection. Then it seems we would want
>>> the entire LogicalType (including the `FieldType representation` field) as
>>> the value type, and not just LogicalTypeConversion. Otherwise we're
>>> separating the representations from the conversions, and duplicating the
>>> representations. You did say a "registry of logical types", so maybe that
>>> is what you meant.
>>>
>>> Brian
>>>
>>> On Tue, Jun 4, 2019 at 1:21 PM Reuven Lax <re...@google.com> wrote:
>>>
>>>>
>>>>
>>>> On Tue, Jun 4, 2019 at 9:20 AM Brian Hulette <bhule...@google.com>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Mon, Jun 3, 2019 at 10:04 PM Reuven Lax <re...@google.com> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Jun 3, 2019 at 12:27 PM Brian Hulette <bhule...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> > It has to go into the proto somewhere (since that's the only way
>>>>>>> the SDK can get it), but I'm not sure they should be considered integral
>>>>>>> parts of the type.
>>>>>>> Are you just advocating for an approach where any SDK-specific
>>>>>>> information is stored outside of the Schema message itself so that 
>>>>>>> Schema
>>>>>>> really does just represent the type? That seems reasonable to me, and
>>>>>>> alleviates my concerns about how this applies to columnar encodings a 
>>>>>>> bit
>>>>>>> as well.
>>>>>>>
>>>>>>
>>>>>> Yes, that's exactly what I'm advocating.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> We could lift all of the LogicalTypeConversion messages out of the
>>>>>>> Schema and the LogicalType like this:
>>>>>>>
>>>>>>> message SchemaCoder {
>>>>>>>   Schema schema = 1;
>>>>>>>   LogicalTypeConversion root_conversion = 2;
>>>>>>>   map<string, LogicalTypeConversion> attribute_conversions = 3; //
>>>>>>> only necessary for user type aliases, portable logical types by 
>>>>>>> definition
>>>>>>> have nothing SDK-specific
>>>>>>> }
>>>>>>>
>>>>>>
>>>>>> I'm not sure what the map is for? I think we have status quo wihtout
>>>>>> it.
>>>>>>
>>>>>
>>>>> My intention was that the SDK-specific information (to/from functions)
>>>>> for any nested fields that are themselves user type aliases would be 
>>>>> stored
>>>>> in this map. That was the motivation for my next question, if we don't
>>>>> allow user types to be nested within other user types we may not need it.
>>>>>
>>>>
>>>> Oh, is this meant to contain the ids of all the logical types in this
>>>> schema? If so I don't think SchemaCoder is the right place for this. Any
>>>> "registry" of logical types should be global to the pipeline, not scoped to
>>>> a single PCollection IMO.
>>>>
>>>>
>>>>> I may be missing your meaning - but I think we currently only have
>>>>> status quo without this map in the Java SDK because Schema.LogicalType is
>>>>> just an interface that must be implemented. It's appropriate for just
>>>>> portable logical types, not user-type aliases. Note I've adopted Kenn's
>>>>> terminology where portable logical type is a type that can be identified 
>>>>> by
>>>>> just a URN and maybe some parameters, while a user type alias needs some
>>>>> SDK specific information, like a class and to/from UDFs.
>>>>>
>>>>>
>>>>>>
>>>>>>> I think a critical question (that has implications for the above
>>>>>>> proposal) is how/if the two different concepts Kenn mentioned are 
>>>>>>> allowed
>>>>>>> to nest. For example, you could argue it's redundant to have a user type
>>>>>>> alias that has a Row representation with a field that is itself a user 
>>>>>>> type
>>>>>>> alias, because instead you could just have a single top-level type alias
>>>>>>> with to/from functions that pack and unpack the entire hierarchy. On the
>>>>>>> other hand, I think it does make sense for a user type alias or a truly
>>>>>>> portable logical type to have a field that is itself a truly portable
>>>>>>> logical type (e.g. a user type alias or portable type with a DateTime).
>>>>>>>
>>>>>>> I've been assuming that user-type aliases could be nested, but
>>>>>>> should we disallow that? Or should we go the other way and require that
>>>>>>> logical types define at most one "level"?
>>>>>>>
>>>>>>
>>>>>> No I think it's useful to allow things to be nested (though of course
>>>>>> the nesting must terminate).
>>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Brian
>>>>>>>
>>>>>>> On Mon, Jun 3, 2019 at 11:08 AM Kenneth Knowles <k...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Jun 3, 2019 at 10:53 AM Reuven Lax <re...@google.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> So I feel a bit leery about making the to/from functions a
>>>>>>>>> fundamental part of the portability representation. In my mind, that 
>>>>>>>>> is
>>>>>>>>> very tied to a specific SDK/language. A SDK (say the Java SDK) wants 
>>>>>>>>> to
>>>>>>>>> allow users to use a wide variety of native types with schemas, and 
>>>>>>>>> under
>>>>>>>>> the covers uses the to/from functions to implement that. However from 
>>>>>>>>> the
>>>>>>>>> portable Beam perspective, the schema itself should be the real 
>>>>>>>>> "type" of
>>>>>>>>> the PCollection; the to/from methods are simply a way that a 
>>>>>>>>> particular SDK
>>>>>>>>> makes schemas easier to use. It has to go into the proto somewhere 
>>>>>>>>> (since
>>>>>>>>> that's the only way the SDK can get it), but I'm not sure they should 
>>>>>>>>> be
>>>>>>>>> considered integral parts of the type.
>>>>>>>>>
>>>>>>>>
>>>>>>>> On the doc in a couple places this distinction was made:
>>>>>>>>
>>>>>>>> * For truly portable logical types, no instructions for the SDK are
>>>>>>>> needed. Instead, they require:
>>>>>>>>    - URN: a standardized identifier any SDK can recognize
>>>>>>>>    - A spec: what is the universe of values in this type?
>>>>>>>>    - A representation: how is it represented in built-in types?
>>>>>>>> This is how SDKs who do not know/care about the URN will process it
>>>>>>>>    - (optional): SDKs choose preferred SDK-specific types to embed
>>>>>>>> the values in. SDKs have to know about the URN and choose for 
>>>>>>>> themselves.
>>>>>>>>
>>>>>>>> *For user-level type aliases, written as convenience by the user in
>>>>>>>> their pipeline, what Java schemas have today:
>>>>>>>>    - to/from UDFs: the code is SDK-specific
>>>>>>>>    - some representation of the intended type (like java class):
>>>>>>>> also SDK specific
>>>>>>>>    - a representation
>>>>>>>>    - any "id" is just like other ids in the pipeline, just avoiding
>>>>>>>> duplicating the proto
>>>>>>>>    - Luke points out that nesting these can give multiple SDKs a
>>>>>>>> hint
>>>>>>>>
>>>>>>>> In my mind the remaining complexity is whether or not we need to be
>>>>>>>> able to move between the two. Composite PTransforms, for example, do 
>>>>>>>> have
>>>>>>>> fluidity between being strictly user-defined versus portable 
>>>>>>>> URN+payload.
>>>>>>>> But it requires lots of engineering, namely the current work on 
>>>>>>>> expansion
>>>>>>>> service.
>>>>>>>>
>>>>>>>> Kenn
>>>>>>>>
>>>>>>>>
>>>>>>>>> On Mon, Jun 3, 2019 at 10:23 AM Brian Hulette <bhule...@google.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Ah I see, I didn't realize that. Then I suppose we'll need
>>>>>>>>>> to/from functions somewhere in the logical type conversion to 
>>>>>>>>>> preserve the
>>>>>>>>>> current behavior.
>>>>>>>>>>
>>>>>>>>>> I'm still a little hesitant to make these functions an explicit
>>>>>>>>>> part of LogicalTypeConversion for another reason. Down the road, 
>>>>>>>>>> schemas
>>>>>>>>>> could give us an avenue to use a batched columnar format (presumably 
>>>>>>>>>> arrow,
>>>>>>>>>> but of course others are possible). By making to/from an explicit 
>>>>>>>>>> part of
>>>>>>>>>> logical types we add some element-wise logic to a schema 
>>>>>>>>>> representation
>>>>>>>>>> that's otherwise ambivalent to element-wise vs. batched encodings.
>>>>>>>>>>
>>>>>>>>>> I suppose you could make an argument that to/from are only for
>>>>>>>>>> custom types. There will also be some set of well-known types 
>>>>>>>>>> identified
>>>>>>>>>> only by URN and some parameters, which could easily be translated to 
>>>>>>>>>> a
>>>>>>>>>> columnar format. We could just not support custom types fully if we 
>>>>>>>>>> add a
>>>>>>>>>> columnar encoding, or maybe add optional toBatch/fromBatch functions
>>>>>>>>>> when/if we get there.
>>>>>>>>>>
>>>>>>>>>> What about something like this that makes the two different types
>>>>>>>>>> of logical types explicit?
>>>>>>>>>>
>>>>>>>>>> // Describes a logical type and how to convert between it and its
>>>>>>>>>> representation (e.g. Row).
>>>>>>>>>> message LogicalTypeConversion {
>>>>>>>>>>   oneof conversion {
>>>>>>>>>>     message Standard standard = 1;
>>>>>>>>>>     message Custom custom = 2;
>>>>>>>>>>   }
>>>>>>>>>>
>>>>>>>>>>   message Standard {
>>>>>>>>>>     String urn = 1;
>>>>>>>>>>     repeated string args = 2; // could also be a map
>>>>>>>>>>   }
>>>>>>>>>>
>>>>>>>>>>   message Custom {
>>>>>>>>>>     FunctionSpec(?) toRepresentation = 1;
>>>>>>>>>>     FunctionSpec(?) fromRepresentation = 2;
>>>>>>>>>>     bytes type = 3; // e.g. serialized class for Java
>>>>>>>>>>   }
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> And LogicalType and Schema become:
>>>>>>>>>>
>>>>>>>>>> message LogicalType {
>>>>>>>>>>   FieldType representation = 1;
>>>>>>>>>>   LogicalTypeConversion conversion = 2;
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> message Schema {
>>>>>>>>>>   ...
>>>>>>>>>>   repeated Field fields = 1;
>>>>>>>>>>   LogicalTypeConversion conversion = 2; // implied that
>>>>>>>>>> representation is Row
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> Brian
>>>>>>>>>>
>>>>>>>>>> On Sat, Jun 1, 2019 at 10:44 AM Reuven Lax <re...@google.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Keep in mind that right now the SchemaRegistry is only assumed
>>>>>>>>>>> to exist at graph-construction time, not at execution time; all 
>>>>>>>>>>> information
>>>>>>>>>>> in the schema registry is embedded in the SchemaCoder, which is the 
>>>>>>>>>>> only
>>>>>>>>>>> thing we keep around when the pipeline is actually running. We 
>>>>>>>>>>> could look
>>>>>>>>>>> into changing this, but it would potentially be a very big change, 
>>>>>>>>>>> and I do
>>>>>>>>>>> think we should start getting users actively using schemas soon.
>>>>>>>>>>>
>>>>>>>>>>> On Fri, May 31, 2019 at 3:40 PM Brian Hulette <
>>>>>>>>>>> bhule...@google.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> > Can you propose what the protos would look like in this case?
>>>>>>>>>>>> Right now LogicalType does not contain the to/from conversion 
>>>>>>>>>>>> functions in
>>>>>>>>>>>> the proto. Do you think we'll need to add these in?
>>>>>>>>>>>>
>>>>>>>>>>>> Maybe. Right now the proposed LogicalType message is pretty
>>>>>>>>>>>> simple/generic:
>>>>>>>>>>>> message LogicalType {
>>>>>>>>>>>>   FieldType representation = 1;
>>>>>>>>>>>>   string logical_urn = 2;
>>>>>>>>>>>>   bytes logical_payload = 3;
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> If we keep just logical_urn and logical_payload, the
>>>>>>>>>>>> logical_payload could itself be a protobuf with attributes of 1) a
>>>>>>>>>>>> serialized class and 2/3) to/from functions. Or, alternatively, we 
>>>>>>>>>>>> could
>>>>>>>>>>>> have a generalization of the SchemaRegistry for logical types.
>>>>>>>>>>>> Implementations for standard types and user-defined types would be
>>>>>>>>>>>> registered by URN, and the SDK could look them up given just a 
>>>>>>>>>>>> URN. I put a
>>>>>>>>>>>> brief section about this alternative in the doc last week [1]. 
>>>>>>>>>>>> What I
>>>>>>>>>>>> suggested there included removing the logical_payload field, which 
>>>>>>>>>>>> is
>>>>>>>>>>>> probably overkill. The critical piece is just relying on a 
>>>>>>>>>>>> registry in the
>>>>>>>>>>>> SDK to look up types and to/from functions rather than storing 
>>>>>>>>>>>> them in the
>>>>>>>>>>>> portable schema itself.
>>>>>>>>>>>>
>>>>>>>>>>>> I kind of like keeping the LogicalType message generic for now,
>>>>>>>>>>>> since it gives us a way to try out these various approaches, but 
>>>>>>>>>>>> maybe
>>>>>>>>>>>> that's just a cop out.
>>>>>>>>>>>>
>>>>>>>>>>>> [1]
>>>>>>>>>>>> https://docs.google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?ts=5cdf6a5b#heading=h.jlt5hdrolfy
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, May 31, 2019 at 12:36 PM Reuven Lax <re...@google.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, May 28, 2019 at 10:11 AM Brian Hulette <
>>>>>>>>>>>>> bhule...@google.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sun, May 26, 2019 at 1:25 PM Reuven Lax <re...@google.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, May 24, 2019 at 11:42 AM Brian Hulette <
>>>>>>>>>>>>>>> bhule...@google.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> *tl;dr:* SchemaCoder represents a logical type with a base
>>>>>>>>>>>>>>>> type of Row and we should think about that.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'm a little concerned that the current proposals for a
>>>>>>>>>>>>>>>> portable representation don't actually fully represent 
>>>>>>>>>>>>>>>> Schemas. It seems to
>>>>>>>>>>>>>>>> me that the current java-only Schemas are made up three 
>>>>>>>>>>>>>>>> concepts that are
>>>>>>>>>>>>>>>> intertwined:
>>>>>>>>>>>>>>>> (a) The Java SDK specific code for schema inference, type
>>>>>>>>>>>>>>>> coercion, and "schema-aware" transforms.
>>>>>>>>>>>>>>>> (b) A RowCoder[1] that encodes Rows[2] which have a
>>>>>>>>>>>>>>>> particular Schema[3].
>>>>>>>>>>>>>>>> (c) A SchemaCoder[4] that has a RowCoder for a
>>>>>>>>>>>>>>>> particular schema, and functions for converting Rows with that 
>>>>>>>>>>>>>>>> schema
>>>>>>>>>>>>>>>> to/from a Java type T. Those functions and the RowCoder are 
>>>>>>>>>>>>>>>> then composed
>>>>>>>>>>>>>>>> to provider a Coder for the type T.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> RowCoder is currently just an internal implementation
>>>>>>>>>>>>>>> detail, it can be eliminated. SchemaCoder is the only thing 
>>>>>>>>>>>>>>> that determines
>>>>>>>>>>>>>>> a schema today.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Why not keep it around? I think it would make sense to have a
>>>>>>>>>>>>>> RowCoder implementation in every SDK, as well as something like 
>>>>>>>>>>>>>> SchemaCoder
>>>>>>>>>>>>>> that defines a conversion from that SDK's "Row" to the language 
>>>>>>>>>>>>>> type.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> The point is that from a programmer's perspective, there is
>>>>>>>>>>>>> nothing much special about Row. Any type can have a schema, and 
>>>>>>>>>>>>> the only
>>>>>>>>>>>>> special thing about Row is that it's always guaranteed to exist. 
>>>>>>>>>>>>> From that
>>>>>>>>>>>>> standpoint, Row is nearly an implementation detail. Today 
>>>>>>>>>>>>> RowCoder is never
>>>>>>>>>>>>> set on _any_ PCollection, it's literally just used as a helper 
>>>>>>>>>>>>> library, so
>>>>>>>>>>>>> there's no real need for it to exist as a "Coder."
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> We're not concerned with (a) at this time since that's
>>>>>>>>>>>>>>>> specific to the SDK, not the interface between them. My 
>>>>>>>>>>>>>>>> understanding is we
>>>>>>>>>>>>>>>> just want to define a portable representation for (b) and/or 
>>>>>>>>>>>>>>>> (c).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> What has been discussed so far is really just a portable
>>>>>>>>>>>>>>>> representation for (b), the RowCoder, since the discussion is 
>>>>>>>>>>>>>>>> only around
>>>>>>>>>>>>>>>> how to represent the schema itself and not the to/from 
>>>>>>>>>>>>>>>> functions.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Correct. The to/from functions are actually related to a).
>>>>>>>>>>>>>>> One of the big goals of schemas was that users should not be 
>>>>>>>>>>>>>>> forced to
>>>>>>>>>>>>>>> operate on rows to get schemas. A user can create 
>>>>>>>>>>>>>>> PCollection<MyRandomType>
>>>>>>>>>>>>>>> and as long as the SDK can infer a schema from MyRandomType, 
>>>>>>>>>>>>>>> the user never
>>>>>>>>>>>>>>> needs to even see a Row object. The to/fromRow functions are 
>>>>>>>>>>>>>>> what make this
>>>>>>>>>>>>>>> work today.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> One of the points I'd like to make is that this type coercion
>>>>>>>>>>>>>> is a useful concept on it's own, separate from schemas. It's 
>>>>>>>>>>>>>> especially
>>>>>>>>>>>>>> useful for a type that has a schema and is encoded by RowCoder 
>>>>>>>>>>>>>> since that
>>>>>>>>>>>>>> can represent many more types, but the type coercion doesn't 
>>>>>>>>>>>>>> have to be
>>>>>>>>>>>>>> tied to just schemas and RowCoder. We could also do type 
>>>>>>>>>>>>>> coercion for types
>>>>>>>>>>>>>> that are effectively wrappers around an integer or a string. It 
>>>>>>>>>>>>>> could just
>>>>>>>>>>>>>> be a general way to map language types to base types (i.e. types 
>>>>>>>>>>>>>> that we
>>>>>>>>>>>>>> have a coder for). Then it just becomes a general framework for 
>>>>>>>>>>>>>> extending
>>>>>>>>>>>>>> coders to represent more language types.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Let's not tie those conversations. Maybe a similar concept
>>>>>>>>>>>>> will hold true for general coders (or we might decide to get rid 
>>>>>>>>>>>>> of coders
>>>>>>>>>>>>> in favor of schemas, in which case that becomes moot), but I 
>>>>>>>>>>>>> don't think we
>>>>>>>>>>>>> should prematurely generalize.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> One of the outstanding questions for that schema
>>>>>>>>>>>>>>>> representation is how to represent logical types, which may or 
>>>>>>>>>>>>>>>> may not have
>>>>>>>>>>>>>>>> some language type in each SDK (the canonical example being a
>>>>>>>>>>>>>>>> timsetamp type with seconds and nanos and java.time.Instant). 
>>>>>>>>>>>>>>>> I think this
>>>>>>>>>>>>>>>> question is critically important, because (c), the 
>>>>>>>>>>>>>>>> SchemaCoder, is actually
>>>>>>>>>>>>>>>> *defining a logical type* with a language type T in the Java 
>>>>>>>>>>>>>>>> SDK. This
>>>>>>>>>>>>>>>> becomes clear when you compare SchemaCoder[4] to the 
>>>>>>>>>>>>>>>> Schema.LogicalType
>>>>>>>>>>>>>>>> interface[5] - both essentially have three attributes: a base 
>>>>>>>>>>>>>>>> type, and two
>>>>>>>>>>>>>>>> functions for converting to/from that base type. The only 
>>>>>>>>>>>>>>>> difference is for
>>>>>>>>>>>>>>>> SchemaCoder that base type must be a Row so it can be 
>>>>>>>>>>>>>>>> represented by a
>>>>>>>>>>>>>>>> Schema alone, while LogicalType can have any base type that 
>>>>>>>>>>>>>>>> can be
>>>>>>>>>>>>>>>> represented by FieldType, including a Row.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This is not true actually. SchemaCoder can have any base
>>>>>>>>>>>>>>> type, that's why (in Java) it's SchemaCoder<T>. This is why 
>>>>>>>>>>>>>>> PCollection<T>
>>>>>>>>>>>>>>> can have a schema, even if T is not Row.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm not sure I effectively communicated what I meant - When I
>>>>>>>>>>>>>> said SchemaCoder's "base type" I wasn't referring to T, I was 
>>>>>>>>>>>>>> referring to
>>>>>>>>>>>>>> the base FieldType, whose coder we use for this type. I meant 
>>>>>>>>>>>>>> "base type"
>>>>>>>>>>>>>> to be analogous to LogicalType's `getBaseType`, or what Kenn is 
>>>>>>>>>>>>>> suggesting
>>>>>>>>>>>>>> we call "representation" in the portable beam schemas doc. To 
>>>>>>>>>>>>>> define some
>>>>>>>>>>>>>> terms from my original message:
>>>>>>>>>>>>>> base type = an instance of FieldType, crucially this is
>>>>>>>>>>>>>> something that we have a coder for (be it VarIntCoder, 
>>>>>>>>>>>>>> Utf8Coder, RowCoder,
>>>>>>>>>>>>>> ...)
>>>>>>>>>>>>>> language type (or "T", "type T", "logical type") = Some Java
>>>>>>>>>>>>>> class (or something analogous in the other SDKs) that we may or 
>>>>>>>>>>>>>> may not
>>>>>>>>>>>>>> have a coder for. It's possible to define functions for 
>>>>>>>>>>>>>> converting
>>>>>>>>>>>>>> instances of the language type to/from the base type.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I was just trying to make the case that SchemaCoder is really
>>>>>>>>>>>>>> a special case of LogicalType, where `getBaseType` always 
>>>>>>>>>>>>>> returns a Row
>>>>>>>>>>>>>> with the stored Schema.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yeah, I think  I got that point.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Can you propose what the protos would look like in this case?
>>>>>>>>>>>>> Right now LogicalType does not contain the to/from conversion 
>>>>>>>>>>>>> functions in
>>>>>>>>>>>>> the proto. Do you think we'll need to add these in?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> To make the point with code: SchemaCoder<T> can be made to
>>>>>>>>>>>>>> implement Schema.LogicalType<T,Row> with trivial implementations 
>>>>>>>>>>>>>> of
>>>>>>>>>>>>>> getBaseType, toBaseType, and toInputType (I'm not trying to say 
>>>>>>>>>>>>>> we should
>>>>>>>>>>>>>> or shouldn't do this, just using it illustrate my point):
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> class SchemaCoder extends CustomCoder<T> implements
>>>>>>>>>>>>>> Schema.LogicalType<T, Row> {
>>>>>>>>>>>>>>   ...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>   @Override
>>>>>>>>>>>>>>   FieldType getBaseType() {
>>>>>>>>>>>>>>     return FieldType.row(getSchema());
>>>>>>>>>>>>>>   }
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>   @Override
>>>>>>>>>>>>>>   public Row toBaseType() {
>>>>>>>>>>>>>>     return this.toRowFunction.apply(input);
>>>>>>>>>>>>>>   }
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>   @Override
>>>>>>>>>>>>>>   public T toInputType(Row base) {
>>>>>>>>>>>>>>     return this.fromRowFunction.apply(base);
>>>>>>>>>>>>>>   }
>>>>>>>>>>>>>>   ...
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I think it may make sense to fully embrace this duality, by
>>>>>>>>>>>>>>>> letting SchemaCoder have a baseType other than just Row and 
>>>>>>>>>>>>>>>> renaming it to
>>>>>>>>>>>>>>>> LogicalTypeCoder/LanguageTypeCoder. The current Java SDK 
>>>>>>>>>>>>>>>> schema-aware
>>>>>>>>>>>>>>>> transforms (a) would operate only on LogicalTypeCoders with a 
>>>>>>>>>>>>>>>> Row base
>>>>>>>>>>>>>>>> type. Perhaps some of the current schema logic could  alsobe 
>>>>>>>>>>>>>>>> applied more
>>>>>>>>>>>>>>>> generally to any logical type  - for example, to provide type 
>>>>>>>>>>>>>>>> coercion for
>>>>>>>>>>>>>>>> logical types with a base type other than Row, like int64 and 
>>>>>>>>>>>>>>>> a timestamp
>>>>>>>>>>>>>>>> class backed by millis, or fixed size bytes and a UUID class. 
>>>>>>>>>>>>>>>> And having a
>>>>>>>>>>>>>>>> portable representation that represents those (non Row backed) 
>>>>>>>>>>>>>>>> logical
>>>>>>>>>>>>>>>> types with some URN would also allow us to pass them to other 
>>>>>>>>>>>>>>>> languages
>>>>>>>>>>>>>>>> without unnecessarily wrapping them in a Row in order to use 
>>>>>>>>>>>>>>>> SchemaCoder.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I think the actual overlap here is between the to/from
>>>>>>>>>>>>>>> functions in SchemaCoder (which is what allows SchemaCoder<T> 
>>>>>>>>>>>>>>> where T !=
>>>>>>>>>>>>>>> Row) and the equivalent functionality in LogicalType. However 
>>>>>>>>>>>>>>> making all of
>>>>>>>>>>>>>>> schemas simply just a logical type feels a bit awkward and 
>>>>>>>>>>>>>>> circular to me.
>>>>>>>>>>>>>>> Maybe we should refactor that part out into a 
>>>>>>>>>>>>>>> LogicalTypeConversion proto,
>>>>>>>>>>>>>>> and reference that from both LogicalType and from SchemaCoder?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> LogicalType is already potentially circular though. A schema
>>>>>>>>>>>>>> can have a field with a logical type, and that logical type can 
>>>>>>>>>>>>>> have a base
>>>>>>>>>>>>>> type of Row with a field with a logical type (and on and on...). 
>>>>>>>>>>>>>> To me it
>>>>>>>>>>>>>> seems elegant, not awkward, to recognize that SchemaCoder is 
>>>>>>>>>>>>>> just a special
>>>>>>>>>>>>>> case of this concept.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Something like the LogicalTypeConversion proto would
>>>>>>>>>>>>>> definitely be an improvement, but I would still prefer just 
>>>>>>>>>>>>>> using a
>>>>>>>>>>>>>> top-level logical type :)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I've added a section to the doc [6] to propose this
>>>>>>>>>>>>>>>> alternative in the context of the portable representation but 
>>>>>>>>>>>>>>>> I wanted to
>>>>>>>>>>>>>>>> bring it up here as well to solicit feedback.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/RowCoder.java#L41
>>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/Row.java#L59
>>>>>>>>>>>>>>>> [3]
>>>>>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L48
>>>>>>>>>>>>>>>> [4]
>>>>>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/SchemaCoder.java#L33
>>>>>>>>>>>>>>>> [5]
>>>>>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L489
>>>>>>>>>>>>>>>> [6]
>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?ts=5cdf6a5b#heading=h.7570feur1qin
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, May 10, 2019 at 9:16 AM Brian Hulette <
>>>>>>>>>>>>>>>> bhule...@google.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Ah thanks! I added some language there.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> *From: *Kenneth Knowles <k...@apache.org>
>>>>>>>>>>>>>>>>> *Date: *Thu, May 9, 2019 at 5:31 PM
>>>>>>>>>>>>>>>>> *To: *dev
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> *From: *Brian Hulette <bhule...@google.com>
>>>>>>>>>>>>>>>>>> *Date: *Thu, May 9, 2019 at 2:02 PM
>>>>>>>>>>>>>>>>>> *To: * <dev@beam.apache.org>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> We briefly discussed using arrow schemas in place of beam
>>>>>>>>>>>>>>>>>>> schemas entirely in an arrow thread [1]. The biggest reason 
>>>>>>>>>>>>>>>>>>> not to this was
>>>>>>>>>>>>>>>>>>> that we wanted to have a type for large iterables in beam 
>>>>>>>>>>>>>>>>>>> schemas. But
>>>>>>>>>>>>>>>>>>> given that large iterables aren't currently implemented, 
>>>>>>>>>>>>>>>>>>> beam schemas look
>>>>>>>>>>>>>>>>>>> very similar to arrow schemas.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I think it makes sense to take inspiration from arrow
>>>>>>>>>>>>>>>>>>> schemas where possible, and maybe even copy them outright. 
>>>>>>>>>>>>>>>>>>> Arrow already
>>>>>>>>>>>>>>>>>>> has a portable (flatbuffers) schema representation [2], and 
>>>>>>>>>>>>>>>>>>> implementations
>>>>>>>>>>>>>>>>>>> for it in many languages that we may be able to re-use as 
>>>>>>>>>>>>>>>>>>> we bring schemas
>>>>>>>>>>>>>>>>>>> to more SDKs (the project has Python and Go 
>>>>>>>>>>>>>>>>>>> implementations). There are a
>>>>>>>>>>>>>>>>>>> couple of concepts in Arrow schemas that are specific for 
>>>>>>>>>>>>>>>>>>> the format and
>>>>>>>>>>>>>>>>>>> wouldn't make sense for us, (fields can indicate whether or 
>>>>>>>>>>>>>>>>>>> not they are
>>>>>>>>>>>>>>>>>>> dictionary encoded, and the schema has an endianness 
>>>>>>>>>>>>>>>>>>> field), but if you
>>>>>>>>>>>>>>>>>>> drop those concepts the arrow spec looks pretty similar to 
>>>>>>>>>>>>>>>>>>> the beam proto
>>>>>>>>>>>>>>>>>>> spec.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> FWIW I left a blank section in the doc for filling out
>>>>>>>>>>>>>>>>>> what the differences are and why, and conversely what the 
>>>>>>>>>>>>>>>>>> interop
>>>>>>>>>>>>>>>>>> opportunities may be. Such sections are some of my favorite 
>>>>>>>>>>>>>>>>>> sections of
>>>>>>>>>>>>>>>>>> design docs.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Kenn
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Brian
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>> https://lists.apache.org/thread.html/6be7715e13b71c2d161e4378c5ca1c76ac40cfc5988a03ba87f1c434@%3Cdev.beam.apache.org%3E
>>>>>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>>>>>>> https://github.com/apache/arrow/blob/master/format/Schema.fbs#L194
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> *From: *Robert Bradshaw <rober...@google.com>
>>>>>>>>>>>>>>>>>>> *Date: *Thu, May 9, 2019 at 1:38 PM
>>>>>>>>>>>>>>>>>>> *To: *dev
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> From: Reuven Lax <re...@google.com>
>>>>>>>>>>>>>>>>>>>> Date: Thu, May 9, 2019 at 7:29 PM
>>>>>>>>>>>>>>>>>>>> To: dev
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> > Also in the future we might be able to do
>>>>>>>>>>>>>>>>>>>> optimizations at the runner level if at the portability 
>>>>>>>>>>>>>>>>>>>> layer we understood
>>>>>>>>>>>>>>>>>>>> schemes instead of just raw coders. This could be things 
>>>>>>>>>>>>>>>>>>>> like only parsing
>>>>>>>>>>>>>>>>>>>> a subset of a row (if we know only a few fields are 
>>>>>>>>>>>>>>>>>>>> accessed) or using a
>>>>>>>>>>>>>>>>>>>> columnar data structure like Arrow to encode batches of 
>>>>>>>>>>>>>>>>>>>> rows across
>>>>>>>>>>>>>>>>>>>> portability. This doesn't affect data semantics of course, 
>>>>>>>>>>>>>>>>>>>> but having a
>>>>>>>>>>>>>>>>>>>> richer, more-expressive type system opens up other 
>>>>>>>>>>>>>>>>>>>> opportunities.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> But we could do all of that with a RowCoder we
>>>>>>>>>>>>>>>>>>>> understood to designate
>>>>>>>>>>>>>>>>>>>> the type(s), right?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> > On Thu, May 9, 2019 at 10:16 AM Robert Bradshaw <
>>>>>>>>>>>>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>> >> On the flip side, Schemas are equivalent to the
>>>>>>>>>>>>>>>>>>>> space of Coders with
>>>>>>>>>>>>>>>>>>>> >> the addition of a RowCoder and the ability to
>>>>>>>>>>>>>>>>>>>> materialize to something
>>>>>>>>>>>>>>>>>>>> >> other than bytes, right? (Perhaps I'm missing
>>>>>>>>>>>>>>>>>>>> something big here...)
>>>>>>>>>>>>>>>>>>>> >> This may make a backwards-compatible transition
>>>>>>>>>>>>>>>>>>>> easier. (SDK-side, the
>>>>>>>>>>>>>>>>>>>> >> ability to reason about and operate on such types is
>>>>>>>>>>>>>>>>>>>> of course much
>>>>>>>>>>>>>>>>>>>> >> richer than anything Coders offer right now.)
>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>> >> From: Reuven Lax <re...@google.com>
>>>>>>>>>>>>>>>>>>>> >> Date: Thu, May 9, 2019 at 4:52 PM
>>>>>>>>>>>>>>>>>>>> >> To: dev
>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>> >> > FYI I can imagine a world in which we have no
>>>>>>>>>>>>>>>>>>>> coders. We could define the entire model on top of 
>>>>>>>>>>>>>>>>>>>> schemas. Today's "Coder"
>>>>>>>>>>>>>>>>>>>> is completely equivalent to a single-field schema with a 
>>>>>>>>>>>>>>>>>>>> logical-type field
>>>>>>>>>>>>>>>>>>>> (actually the latter is slightly more expressive as you 
>>>>>>>>>>>>>>>>>>>> aren't forced to
>>>>>>>>>>>>>>>>>>>> serialize into bytes).
>>>>>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>>>>>> >> > Due to compatibility constraints and the effort
>>>>>>>>>>>>>>>>>>>> that would be  involved in such a change, I think the 
>>>>>>>>>>>>>>>>>>>> practical decision
>>>>>>>>>>>>>>>>>>>> should be for schemas and coders to coexist for the time 
>>>>>>>>>>>>>>>>>>>> being. However
>>>>>>>>>>>>>>>>>>>> when we start planning Beam 3.0, deprecating coders is 
>>>>>>>>>>>>>>>>>>>> something I would
>>>>>>>>>>>>>>>>>>>> like to suggest.
>>>>>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>>>>>> >> > On Thu, May 9, 2019 at 7:48 AM Robert Bradshaw <
>>>>>>>>>>>>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>>>>> >> >> From: Kenneth Knowles <k...@apache.org>
>>>>>>>>>>>>>>>>>>>> >> >> Date: Thu, May 9, 2019 at 10:05 AM
>>>>>>>>>>>>>>>>>>>> >> >> To: dev
>>>>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>>>>> >> >> > This is a huge development. Top posting because
>>>>>>>>>>>>>>>>>>>> I can be more compact.
>>>>>>>>>>>>>>>>>>>> >> >> >
>>>>>>>>>>>>>>>>>>>> >> >> > I really think after the initial idea converges
>>>>>>>>>>>>>>>>>>>> this needs a design doc with goals and alternatives. It is 
>>>>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>> extraordinarily consequential model change. So in the 
>>>>>>>>>>>>>>>>>>>> spirit of doing the
>>>>>>>>>>>>>>>>>>>> work / bias towards action, I created a quick draft at
>>>>>>>>>>>>>>>>>>>> https://s.apache.org/beam-schemas and added everyone
>>>>>>>>>>>>>>>>>>>> on this thread as editors. I am still in the process of 
>>>>>>>>>>>>>>>>>>>> writing this to
>>>>>>>>>>>>>>>>>>>> match the thread.
>>>>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>>>>> >> >> Thanks! Added some comments there.
>>>>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>>>>> >> >> > *Multiple timestamp resolutions*: you can use
>>>>>>>>>>>>>>>>>>>> logcial types to represent nanos the same way Java and 
>>>>>>>>>>>>>>>>>>>> proto do.
>>>>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>>>>> >> >> As per the other discussion, I'm unsure the value
>>>>>>>>>>>>>>>>>>>> in supporting
>>>>>>>>>>>>>>>>>>>> >> >> multiple timestamp resolutions is high enough to
>>>>>>>>>>>>>>>>>>>> outweigh the cost.
>>>>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>>>>> >> >> > *Why multiple int types?* The domain of values
>>>>>>>>>>>>>>>>>>>> for these types are different. For a language with one 
>>>>>>>>>>>>>>>>>>>> "int" or "number"
>>>>>>>>>>>>>>>>>>>> type, that's another domain of values.
>>>>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>>>>> >> >> What is the value in having different domains? If
>>>>>>>>>>>>>>>>>>>> your data has a
>>>>>>>>>>>>>>>>>>>> >> >> natural domain, chances are it doesn't line up
>>>>>>>>>>>>>>>>>>>> exactly with one of
>>>>>>>>>>>>>>>>>>>> >> >> these. I guess it's for languages whose types
>>>>>>>>>>>>>>>>>>>> have specific domains?
>>>>>>>>>>>>>>>>>>>> >> >> (There's also compactness in representation,
>>>>>>>>>>>>>>>>>>>> encoded and in-memory,
>>>>>>>>>>>>>>>>>>>> >> >> though I'm not sure that's high.)
>>>>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>>>>> >> >> > *Columnar/Arrow*: making sure we unlock the
>>>>>>>>>>>>>>>>>>>> ability to take this path is Paramount. So tying it 
>>>>>>>>>>>>>>>>>>>> directly to a
>>>>>>>>>>>>>>>>>>>> row-oriented coder seems counterproductive.
>>>>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>>>>> >> >> I don't think Coders are necessarily
>>>>>>>>>>>>>>>>>>>> row-oriented. They are, however,
>>>>>>>>>>>>>>>>>>>> >> >> bytes-oriented. (Perhaps they need not be.) There
>>>>>>>>>>>>>>>>>>>> seems to be a lot of
>>>>>>>>>>>>>>>>>>>> >> >> overlap between what Coders express in terms of
>>>>>>>>>>>>>>>>>>>> element typing
>>>>>>>>>>>>>>>>>>>> >> >> information and what Schemas express, and I'd
>>>>>>>>>>>>>>>>>>>> rather have one concept
>>>>>>>>>>>>>>>>>>>> >> >> if possible. Or have a clear division of
>>>>>>>>>>>>>>>>>>>> responsibilities.
>>>>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>>>>> >> >> > *Multimap*: what does it add over an
>>>>>>>>>>>>>>>>>>>> array-valued map or large-iterable-valued map? (honest 
>>>>>>>>>>>>>>>>>>>> question, not
>>>>>>>>>>>>>>>>>>>> rhetorical)
>>>>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>>>>> >> >> Multimap has a different notion of what it means
>>>>>>>>>>>>>>>>>>>> to contain a value,
>>>>>>>>>>>>>>>>>>>> >> >> can handle (unordered) unions of non-disjoint
>>>>>>>>>>>>>>>>>>>> keys, etc. Maybe this
>>>>>>>>>>>>>>>>>>>> >> >> isn't worth a new primitive type.
>>>>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>>>>> >> >> > *URN/enum for type names*: I see the case for
>>>>>>>>>>>>>>>>>>>> both. The core types are fundamental enough they should 
>>>>>>>>>>>>>>>>>>>> never really change
>>>>>>>>>>>>>>>>>>>> - after all, proto, thrift, avro, arrow, have addressed 
>>>>>>>>>>>>>>>>>>>> this (not to
>>>>>>>>>>>>>>>>>>>> mention most programming languages). Maybe additions once 
>>>>>>>>>>>>>>>>>>>> every few years.
>>>>>>>>>>>>>>>>>>>> I prefer the smallest intersection of these schema 
>>>>>>>>>>>>>>>>>>>> languages. A oneof is
>>>>>>>>>>>>>>>>>>>> more clear, while URN emphasizes the similarity of 
>>>>>>>>>>>>>>>>>>>> built-in and logical
>>>>>>>>>>>>>>>>>>>> types.
>>>>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>>>>> >> >> Hmm... Do we have any examples of the multi-level
>>>>>>>>>>>>>>>>>>>> primitive/logical
>>>>>>>>>>>>>>>>>>>> >> >> type in any of these other systems? I have a bias
>>>>>>>>>>>>>>>>>>>> towards all types
>>>>>>>>>>>>>>>>>>>> >> >> being on the same footing unless there is
>>>>>>>>>>>>>>>>>>>> compelling reason to divide
>>>>>>>>>>>>>>>>>>>> >> >> things into primitive/use-defined ones.
>>>>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>>>>> >> >> Here it seems like the most essential value of
>>>>>>>>>>>>>>>>>>>> the primitive type set
>>>>>>>>>>>>>>>>>>>> >> >> is to describe the underlying representation, for
>>>>>>>>>>>>>>>>>>>> encoding elements in
>>>>>>>>>>>>>>>>>>>> >> >> a variety of ways (notably columnar, but also
>>>>>>>>>>>>>>>>>>>> interfacing with other
>>>>>>>>>>>>>>>>>>>> >> >> external systems like IOs). Perhaps, rather than
>>>>>>>>>>>>>>>>>>>> the previous
>>>>>>>>>>>>>>>>>>>> >> >> suggestion of making everything a logical of
>>>>>>>>>>>>>>>>>>>> bytes, this could be made
>>>>>>>>>>>>>>>>>>>> >> >> clear by still making everything a logical type,
>>>>>>>>>>>>>>>>>>>> but renaming
>>>>>>>>>>>>>>>>>>>> >> >> "TypeName" to Representation. There would be URNs
>>>>>>>>>>>>>>>>>>>> (typically with
>>>>>>>>>>>>>>>>>>>> >> >> empty payloads) for the various primitive types
>>>>>>>>>>>>>>>>>>>> (whose mapping to
>>>>>>>>>>>>>>>>>>>> >> >> their representations would be the identity).
>>>>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>>>>> >> >> - Robert
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>

Re: [DISCUSS] Portability representation of schemas

Reply via email to