Re: [DISCUSS] Portability representation of schemas

Brian Hulette Wed, 05 Jun 2019 16:30:20 -0700

If we want to have a Pipeline level registry, we could add it to Components
[1].


message Components {
  ...
  map<string, LogicalType> logical_types;
}

And in FieldType reference the logical types by id:
oneof field_type {
  AtomicType atomic_type;
  ArrayType array_type;
  ...
  string logical_type_id;    // was LogicalType logical_type;
}

I'm not sure I like this idea though. The reason we started discussing a
"registry" was just to separate the SDK-specific bits from the
representation type, and this doesn't accomplish that, it just de-dupes
logical types used
across the pipeline.

I think instead I'd rather just come back to the message we have now in the
doc, used directly in FieldType's oneof:

message LogicalType {
  FieldType representation = 1;
  string logical_urn = 2;
  bytes logical_payload = 3;
}

We can have a URN for SDK-specific types (user type aliases), like
"beam:logical:javasdk", and the logical_payload could itself be a protobuf
with attributes of 1) a serialized class and 2/3) to/from functions. For
truly portable types it would instead have a well-known URN and optionally
a logical_payload with some agreed-upon representation of parameters.

It seems like maybe SdkFunctionSpec/Environment should be used for this
somehow, but I can't find a good example of this in the Runner API to use
as a model. For example, what we're trying to accomplish is basically the
same as Java custom coders vs. standard coders. But that is accomplished
with a magic "javasdk" URN, as I suggested here, not with Environment
[2,3]. There is a "TODO: standardize such things" where that URN is
defined, is it possible that Environment is that standard and just hasn't
been utilized for custom coders yet?

Brian

[1]
https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L54
[2]
https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L542
[3]
https://github.com/apache/beam/blob/master/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/CoderTranslation.java#L121

On Tue, Jun 4, 2019 at 2:24 PM Brian Hulette <bhule...@google.com> wrote:

> Yeah that's what I meant. It does seem logical reasonable to scope any
> registry by pipeline and not by PCollection. Then it seems we would want
> the entire LogicalType (including the `FieldType representation` field) as
> the value type, and not just LogicalTypeConversion. Otherwise we're
> separating the representations from the conversions, and duplicating the
> representations. You did say a "registry of logical types", so maybe that
> is what you meant.
>
> Brian
>
> On Tue, Jun 4, 2019 at 1:21 PM Reuven Lax <re...@google.com> wrote:
>
>>
>>
>> On Tue, Jun 4, 2019 at 9:20 AM Brian Hulette <bhule...@google.com> wrote:
>>
>>>
>>>
>>> On Mon, Jun 3, 2019 at 10:04 PM Reuven Lax <re...@google.com> wrote:
>>>
>>>>
>>>>
>>>> On Mon, Jun 3, 2019 at 12:27 PM Brian Hulette <bhule...@google.com>
>>>> wrote:
>>>>
>>>>> > It has to go into the proto somewhere (since that's the only way
>>>>> the SDK can get it), but I'm not sure they should be considered integral
>>>>> parts of the type.
>>>>> Are you just advocating for an approach where any SDK-specific
>>>>> information is stored outside of the Schema message itself so that Schema
>>>>> really does just represent the type? That seems reasonable to me, and
>>>>> alleviates my concerns about how this applies to columnar encodings a bit
>>>>> as well.
>>>>>
>>>>
>>>> Yes, that's exactly what I'm advocating.
>>>>
>>>>
>>>>>
>>>>> We could lift all of the LogicalTypeConversion messages out of the
>>>>> Schema and the LogicalType like this:
>>>>>
>>>>> message SchemaCoder {
>>>>>   Schema schema = 1;
>>>>>   LogicalTypeConversion root_conversion = 2;
>>>>>   map<string, LogicalTypeConversion> attribute_conversions = 3; //
>>>>> only necessary for user type aliases, portable logical types by definition
>>>>> have nothing SDK-specific
>>>>> }
>>>>>
>>>>
>>>> I'm not sure what the map is for? I think we have status quo wihtout it.
>>>>
>>>
>>> My intention was that the SDK-specific information (to/from functions)
>>> for any nested fields that are themselves user type aliases would be stored
>>> in this map. That was the motivation for my next question, if we don't
>>> allow user types to be nested within other user types we may not need it.
>>>
>>
>> Oh, is this meant to contain the ids of all the logical types in this
>> schema? If so I don't think SchemaCoder is the right place for this. Any
>> "registry" of logical types should be global to the pipeline, not scoped to
>> a single PCollection IMO.
>>
>>
>>> I may be missing your meaning - but I think we currently only have
>>> status quo without this map in the Java SDK because Schema.LogicalType is
>>> just an interface that must be implemented. It's appropriate for just
>>> portable logical types, not user-type aliases. Note I've adopted Kenn's
>>> terminology where portable logical type is a type that can be identified by
>>> just a URN and maybe some parameters, while a user type alias needs some
>>> SDK specific information, like a class and to/from UDFs.
>>>
>>>
>>>>
>>>>> I think a critical question (that has implications for the above
>>>>> proposal) is how/if the two different concepts Kenn mentioned are allowed
>>>>> to nest. For example, you could argue it's redundant to have a user type
>>>>> alias that has a Row representation with a field that is itself a user 
>>>>> type
>>>>> alias, because instead you could just have a single top-level type alias
>>>>> with to/from functions that pack and unpack the entire hierarchy. On the
>>>>> other hand, I think it does make sense for a user type alias or a truly
>>>>> portable logical type to have a field that is itself a truly portable
>>>>> logical type (e.g. a user type alias or portable type with a DateTime).
>>>>>
>>>>> I've been assuming that user-type aliases could be nested, but should
>>>>> we disallow that? Or should we go the other way and require that logical
>>>>> types define at most one "level"?
>>>>>
>>>>
>>>> No I think it's useful to allow things to be nested (though of course
>>>> the nesting must terminate).
>>>>
>>>
>>>>
>>>>>
>>>>> Brian
>>>>>
>>>>> On Mon, Jun 3, 2019 at 11:08 AM Kenneth Knowles <k...@apache.org>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> On Mon, Jun 3, 2019 at 10:53 AM Reuven Lax <re...@google.com> wrote:
>>>>>>
>>>>>>> So I feel a bit leery about making the to/from functions a
>>>>>>> fundamental part of the portability representation. In my mind, that is
>>>>>>> very tied to a specific SDK/language. A SDK (say the Java SDK) wants to
>>>>>>> allow users to use a wide variety of native types with schemas, and 
>>>>>>> under
>>>>>>> the covers uses the to/from functions to implement that. However from 
>>>>>>> the
>>>>>>> portable Beam perspective, the schema itself should be the real "type" 
>>>>>>> of
>>>>>>> the PCollection; the to/from methods are simply a way that a particular 
>>>>>>> SDK
>>>>>>> makes schemas easier to use. It has to go into the proto somewhere 
>>>>>>> (since
>>>>>>> that's the only way the SDK can get it), but I'm not sure they should be
>>>>>>> considered integral parts of the type.
>>>>>>>
>>>>>>
>>>>>> On the doc in a couple places this distinction was made:
>>>>>>
>>>>>> * For truly portable logical types, no instructions for the SDK are
>>>>>> needed. Instead, they require:
>>>>>>    - URN: a standardized identifier any SDK can recognize
>>>>>>    - A spec: what is the universe of values in this type?
>>>>>>    - A representation: how is it represented in built-in types? This
>>>>>> is how SDKs who do not know/care about the URN will process it
>>>>>>    - (optional): SDKs choose preferred SDK-specific types to embed
>>>>>> the values in. SDKs have to know about the URN and choose for themselves.
>>>>>>
>>>>>> *For user-level type aliases, written as convenience by the user in
>>>>>> their pipeline, what Java schemas have today:
>>>>>>    - to/from UDFs: the code is SDK-specific
>>>>>>    - some representation of the intended type (like java class): also
>>>>>> SDK specific
>>>>>>    - a representation
>>>>>>    - any "id" is just like other ids in the pipeline, just avoiding
>>>>>> duplicating the proto
>>>>>>    - Luke points out that nesting these can give multiple SDKs a hint
>>>>>>
>>>>>> In my mind the remaining complexity is whether or not we need to be
>>>>>> able to move between the two. Composite PTransforms, for example, do have
>>>>>> fluidity between being strictly user-defined versus portable URN+payload.
>>>>>> But it requires lots of engineering, namely the current work on expansion
>>>>>> service.
>>>>>>
>>>>>> Kenn
>>>>>>
>>>>>>
>>>>>>> On Mon, Jun 3, 2019 at 10:23 AM Brian Hulette <bhule...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Ah I see, I didn't realize that. Then I suppose we'll need to/from
>>>>>>>> functions somewhere in the logical type conversion to preserve the 
>>>>>>>> current
>>>>>>>> behavior.
>>>>>>>>
>>>>>>>> I'm still a little hesitant to make these functions an explicit
>>>>>>>> part of LogicalTypeConversion for another reason. Down the road, 
>>>>>>>> schemas
>>>>>>>> could give us an avenue to use a batched columnar format (presumably 
>>>>>>>> arrow,
>>>>>>>> but of course others are possible). By making to/from an explicit part 
>>>>>>>> of
>>>>>>>> logical types we add some element-wise logic to a schema representation
>>>>>>>> that's otherwise ambivalent to element-wise vs. batched encodings.
>>>>>>>>
>>>>>>>> I suppose you could make an argument that to/from are only for
>>>>>>>> custom types. There will also be some set of well-known types 
>>>>>>>> identified
>>>>>>>> only by URN and some parameters, which could easily be translated to a
>>>>>>>> columnar format. We could just not support custom types fully if we 
>>>>>>>> add a
>>>>>>>> columnar encoding, or maybe add optional toBatch/fromBatch functions
>>>>>>>> when/if we get there.
>>>>>>>>
>>>>>>>> What about something like this that makes the two different types
>>>>>>>> of logical types explicit?
>>>>>>>>
>>>>>>>> // Describes a logical type and how to convert between it and its
>>>>>>>> representation (e.g. Row).
>>>>>>>> message LogicalTypeConversion {
>>>>>>>>   oneof conversion {
>>>>>>>>     message Standard standard = 1;
>>>>>>>>     message Custom custom = 2;
>>>>>>>>   }
>>>>>>>>
>>>>>>>>   message Standard {
>>>>>>>>     String urn = 1;
>>>>>>>>     repeated string args = 2; // could also be a map
>>>>>>>>   }
>>>>>>>>
>>>>>>>>   message Custom {
>>>>>>>>     FunctionSpec(?) toRepresentation = 1;
>>>>>>>>     FunctionSpec(?) fromRepresentation = 2;
>>>>>>>>     bytes type = 3; // e.g. serialized class for Java
>>>>>>>>   }
>>>>>>>> }
>>>>>>>>
>>>>>>>> And LogicalType and Schema become:
>>>>>>>>
>>>>>>>> message LogicalType {
>>>>>>>>   FieldType representation = 1;
>>>>>>>>   LogicalTypeConversion conversion = 2;
>>>>>>>> }
>>>>>>>>
>>>>>>>> message Schema {
>>>>>>>>   ...
>>>>>>>>   repeated Field fields = 1;
>>>>>>>>   LogicalTypeConversion conversion = 2; // implied that
>>>>>>>> representation is Row
>>>>>>>> }
>>>>>>>>
>>>>>>>> Brian
>>>>>>>>
>>>>>>>> On Sat, Jun 1, 2019 at 10:44 AM Reuven Lax <re...@google.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Keep in mind that right now the SchemaRegistry is only assumed to
>>>>>>>>> exist at graph-construction time, not at execution time; all 
>>>>>>>>> information in
>>>>>>>>> the schema registry is embedded in the SchemaCoder, which is the only 
>>>>>>>>> thing
>>>>>>>>> we keep around when the pipeline is actually running. We could look 
>>>>>>>>> into
>>>>>>>>> changing this, but it would potentially be a very big change, and I do
>>>>>>>>> think we should start getting users actively using schemas soon.
>>>>>>>>>
>>>>>>>>> On Fri, May 31, 2019 at 3:40 PM Brian Hulette <bhule...@google.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> > Can you propose what the protos would look like in this case?
>>>>>>>>>> Right now LogicalType does not contain the to/from conversion 
>>>>>>>>>> functions in
>>>>>>>>>> the proto. Do you think we'll need to add these in?
>>>>>>>>>>
>>>>>>>>>> Maybe. Right now the proposed LogicalType message is pretty
>>>>>>>>>> simple/generic:
>>>>>>>>>> message LogicalType {
>>>>>>>>>>   FieldType representation = 1;
>>>>>>>>>>   string logical_urn = 2;
>>>>>>>>>>   bytes logical_payload = 3;
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> If we keep just logical_urn and logical_payload, the
>>>>>>>>>> logical_payload could itself be a protobuf with attributes of 1) a
>>>>>>>>>> serialized class and 2/3) to/from functions. Or, alternatively, we 
>>>>>>>>>> could
>>>>>>>>>> have a generalization of the SchemaRegistry for logical types.
>>>>>>>>>> Implementations for standard types and user-defined types would be
>>>>>>>>>> registered by URN, and the SDK could look them up given just a URN. 
>>>>>>>>>> I put a
>>>>>>>>>> brief section about this alternative in the doc last week [1]. What I
>>>>>>>>>> suggested there included removing the logical_payload field, which is
>>>>>>>>>> probably overkill. The critical piece is just relying on a registry 
>>>>>>>>>> in the
>>>>>>>>>> SDK to look up types and to/from functions rather than storing them 
>>>>>>>>>> in the
>>>>>>>>>> portable schema itself.
>>>>>>>>>>
>>>>>>>>>> I kind of like keeping the LogicalType message generic for now,
>>>>>>>>>> since it gives us a way to try out these various approaches, but 
>>>>>>>>>> maybe
>>>>>>>>>> that's just a cop out.
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>> https://docs.google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?ts=5cdf6a5b#heading=h.jlt5hdrolfy
>>>>>>>>>>
>>>>>>>>>> On Fri, May 31, 2019 at 12:36 PM Reuven Lax <re...@google.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, May 28, 2019 at 10:11 AM Brian Hulette <
>>>>>>>>>>> bhule...@google.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sun, May 26, 2019 at 1:25 PM Reuven Lax <re...@google.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, May 24, 2019 at 11:42 AM Brian Hulette <
>>>>>>>>>>>>> bhule...@google.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> *tl;dr:* SchemaCoder represents a logical type with a base
>>>>>>>>>>>>>> type of Row and we should think about that.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm a little concerned that the current proposals for a
>>>>>>>>>>>>>> portable representation don't actually fully represent Schemas. 
>>>>>>>>>>>>>> It seems to
>>>>>>>>>>>>>> me that the current java-only Schemas are made up three concepts 
>>>>>>>>>>>>>> that are
>>>>>>>>>>>>>> intertwined:
>>>>>>>>>>>>>> (a) The Java SDK specific code for schema inference, type
>>>>>>>>>>>>>> coercion, and "schema-aware" transforms.
>>>>>>>>>>>>>> (b) A RowCoder[1] that encodes Rows[2] which have a
>>>>>>>>>>>>>> particular Schema[3].
>>>>>>>>>>>>>> (c) A SchemaCoder[4] that has a RowCoder for a
>>>>>>>>>>>>>> particular schema, and functions for converting Rows with that 
>>>>>>>>>>>>>> schema
>>>>>>>>>>>>>> to/from a Java type T. Those functions and the RowCoder are then 
>>>>>>>>>>>>>> composed
>>>>>>>>>>>>>> to provider a Coder for the type T.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> RowCoder is currently just an internal implementation detail,
>>>>>>>>>>>>> it can be eliminated. SchemaCoder is the only thing that 
>>>>>>>>>>>>> determines a
>>>>>>>>>>>>> schema today.
>>>>>>>>>>>>>
>>>>>>>>>>>> Why not keep it around? I think it would make sense to have a
>>>>>>>>>>>> RowCoder implementation in every SDK, as well as something like 
>>>>>>>>>>>> SchemaCoder
>>>>>>>>>>>> that defines a conversion from that SDK's "Row" to the language 
>>>>>>>>>>>> type.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> The point is that from a programmer's perspective, there is
>>>>>>>>>>> nothing much special about Row. Any type can have a schema, and the 
>>>>>>>>>>> only
>>>>>>>>>>> special thing about Row is that it's always guaranteed to exist. 
>>>>>>>>>>> From that
>>>>>>>>>>> standpoint, Row is nearly an implementation detail. Today RowCoder 
>>>>>>>>>>> is never
>>>>>>>>>>> set on _any_ PCollection, it's literally just used as a helper 
>>>>>>>>>>> library, so
>>>>>>>>>>> there's no real need for it to exist as a "Coder."
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We're not concerned with (a) at this time since that's
>>>>>>>>>>>>>> specific to the SDK, not the interface between them. My 
>>>>>>>>>>>>>> understanding is we
>>>>>>>>>>>>>> just want to define a portable representation for (b) and/or (c).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> What has been discussed so far is really just a portable
>>>>>>>>>>>>>> representation for (b), the RowCoder, since the discussion is 
>>>>>>>>>>>>>> only around
>>>>>>>>>>>>>> how to represent the schema itself and not the to/from functions.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Correct. The to/from functions are actually related to a). One
>>>>>>>>>>>>> of the big goals of schemas was that users should not be forced 
>>>>>>>>>>>>> to operate
>>>>>>>>>>>>> on rows to get schemas. A user can create 
>>>>>>>>>>>>> PCollection<MyRandomType> and as
>>>>>>>>>>>>> long as the SDK can infer a schema from MyRandomType, the user 
>>>>>>>>>>>>> never needs
>>>>>>>>>>>>> to even see a Row object. The to/fromRow functions are what make 
>>>>>>>>>>>>> this work
>>>>>>>>>>>>> today.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> One of the points I'd like to make is that this type coercion
>>>>>>>>>>>> is a useful concept on it's own, separate from schemas. It's 
>>>>>>>>>>>> especially
>>>>>>>>>>>> useful for a type that has a schema and is encoded by RowCoder 
>>>>>>>>>>>> since that
>>>>>>>>>>>> can represent many more types, but the type coercion doesn't have 
>>>>>>>>>>>> to be
>>>>>>>>>>>> tied to just schemas and RowCoder. We could also do type coercion 
>>>>>>>>>>>> for types
>>>>>>>>>>>> that are effectively wrappers around an integer or a string. It 
>>>>>>>>>>>> could just
>>>>>>>>>>>> be a general way to map language types to base types (i.e. types 
>>>>>>>>>>>> that we
>>>>>>>>>>>> have a coder for). Then it just becomes a general framework for 
>>>>>>>>>>>> extending
>>>>>>>>>>>> coders to represent more language types.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Let's not tie those conversations. Maybe a similar concept will
>>>>>>>>>>> hold true for general coders (or we might decide to get rid of 
>>>>>>>>>>> coders in
>>>>>>>>>>> favor of schemas, in which case that becomes moot), but I don't 
>>>>>>>>>>> think we
>>>>>>>>>>> should prematurely generalize.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> One of the outstanding questions for that schema
>>>>>>>>>>>>>> representation is how to represent logical types, which may or 
>>>>>>>>>>>>>> may not have
>>>>>>>>>>>>>> some language type in each SDK (the canonical example being a
>>>>>>>>>>>>>> timsetamp type with seconds and nanos and java.time.Instant). I 
>>>>>>>>>>>>>> think this
>>>>>>>>>>>>>> question is critically important, because (c), the SchemaCoder, 
>>>>>>>>>>>>>> is actually
>>>>>>>>>>>>>> *defining a logical type* with a language type T in the Java 
>>>>>>>>>>>>>> SDK. This
>>>>>>>>>>>>>> becomes clear when you compare SchemaCoder[4] to the 
>>>>>>>>>>>>>> Schema.LogicalType
>>>>>>>>>>>>>> interface[5] - both essentially have three attributes: a base 
>>>>>>>>>>>>>> type, and two
>>>>>>>>>>>>>> functions for converting to/from that base type. The only 
>>>>>>>>>>>>>> difference is for
>>>>>>>>>>>>>> SchemaCoder that base type must be a Row so it can be 
>>>>>>>>>>>>>> represented by a
>>>>>>>>>>>>>> Schema alone, while LogicalType can have any base type that can 
>>>>>>>>>>>>>> be
>>>>>>>>>>>>>> represented by FieldType, including a Row.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> This is not true actually. SchemaCoder can have any base type,
>>>>>>>>>>>>> that's why (in Java) it's SchemaCoder<T>. This is why 
>>>>>>>>>>>>> PCollection<T> can
>>>>>>>>>>>>> have a schema, even if T is not Row.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> I'm not sure I effectively communicated what I meant - When I
>>>>>>>>>>>> said SchemaCoder's "base type" I wasn't referring to T, I was 
>>>>>>>>>>>> referring to
>>>>>>>>>>>> the base FieldType, whose coder we use for this type. I meant 
>>>>>>>>>>>> "base type"
>>>>>>>>>>>> to be analogous to LogicalType's `getBaseType`, or what Kenn is 
>>>>>>>>>>>> suggesting
>>>>>>>>>>>> we call "representation" in the portable beam schemas doc. To 
>>>>>>>>>>>> define some
>>>>>>>>>>>> terms from my original message:
>>>>>>>>>>>> base type = an instance of FieldType, crucially this is
>>>>>>>>>>>> something that we have a coder for (be it VarIntCoder, Utf8Coder, 
>>>>>>>>>>>> RowCoder,
>>>>>>>>>>>> ...)
>>>>>>>>>>>> language type (or "T", "type T", "logical type") = Some Java
>>>>>>>>>>>> class (or something analogous in the other SDKs) that we may or 
>>>>>>>>>>>> may not
>>>>>>>>>>>> have a coder for. It's possible to define functions for converting
>>>>>>>>>>>> instances of the language type to/from the base type.
>>>>>>>>>>>>
>>>>>>>>>>>> I was just trying to make the case that SchemaCoder is really a
>>>>>>>>>>>> special case of LogicalType, where `getBaseType` always returns a 
>>>>>>>>>>>> Row with
>>>>>>>>>>>> the stored Schema.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Yeah, I think  I got that point.
>>>>>>>>>>>
>>>>>>>>>>> Can you propose what the protos would look like in this case?
>>>>>>>>>>> Right now LogicalType does not contain the to/from conversion 
>>>>>>>>>>> functions in
>>>>>>>>>>> the proto. Do you think we'll need to add these in?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> To make the point with code: SchemaCoder<T> can be made to
>>>>>>>>>>>> implement Schema.LogicalType<T,Row> with trivial implementations of
>>>>>>>>>>>> getBaseType, toBaseType, and toInputType (I'm not trying to say we 
>>>>>>>>>>>> should
>>>>>>>>>>>> or shouldn't do this, just using it illustrate my point):
>>>>>>>>>>>>
>>>>>>>>>>>> class SchemaCoder extends CustomCoder<T> implements
>>>>>>>>>>>> Schema.LogicalType<T, Row> {
>>>>>>>>>>>>   ...
>>>>>>>>>>>>
>>>>>>>>>>>>   @Override
>>>>>>>>>>>>   FieldType getBaseType() {
>>>>>>>>>>>>     return FieldType.row(getSchema());
>>>>>>>>>>>>   }
>>>>>>>>>>>>
>>>>>>>>>>>>   @Override
>>>>>>>>>>>>   public Row toBaseType() {
>>>>>>>>>>>>     return this.toRowFunction.apply(input);
>>>>>>>>>>>>   }
>>>>>>>>>>>>
>>>>>>>>>>>>   @Override
>>>>>>>>>>>>   public T toInputType(Row base) {
>>>>>>>>>>>>     return this.fromRowFunction.apply(base);
>>>>>>>>>>>>   }
>>>>>>>>>>>>   ...
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>> I think it may make sense to fully embrace this duality, by
>>>>>>>>>>>>>> letting SchemaCoder have a baseType other than just Row and 
>>>>>>>>>>>>>> renaming it to
>>>>>>>>>>>>>> LogicalTypeCoder/LanguageTypeCoder. The current Java SDK 
>>>>>>>>>>>>>> schema-aware
>>>>>>>>>>>>>> transforms (a) would operate only on LogicalTypeCoders with a 
>>>>>>>>>>>>>> Row base
>>>>>>>>>>>>>> type. Perhaps some of the current schema logic could  alsobe 
>>>>>>>>>>>>>> applied more
>>>>>>>>>>>>>> generally to any logical type  - for example, to provide type 
>>>>>>>>>>>>>> coercion for
>>>>>>>>>>>>>> logical types with a base type other than Row, like int64 and a 
>>>>>>>>>>>>>> timestamp
>>>>>>>>>>>>>> class backed by millis, or fixed size bytes and a UUID class. 
>>>>>>>>>>>>>> And having a
>>>>>>>>>>>>>> portable representation that represents those (non Row backed) 
>>>>>>>>>>>>>> logical
>>>>>>>>>>>>>> types with some URN would also allow us to pass them to other 
>>>>>>>>>>>>>> languages
>>>>>>>>>>>>>> without unnecessarily wrapping them in a Row in order to use 
>>>>>>>>>>>>>> SchemaCoder.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think the actual overlap here is between the to/from
>>>>>>>>>>>>> functions in SchemaCoder (which is what allows SchemaCoder<T> 
>>>>>>>>>>>>> where T !=
>>>>>>>>>>>>> Row) and the equivalent functionality in LogicalType. However 
>>>>>>>>>>>>> making all of
>>>>>>>>>>>>> schemas simply just a logical type feels a bit awkward and 
>>>>>>>>>>>>> circular to me.
>>>>>>>>>>>>> Maybe we should refactor that part out into a 
>>>>>>>>>>>>> LogicalTypeConversion proto,
>>>>>>>>>>>>> and reference that from both LogicalType and from SchemaCoder?
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> LogicalType is already potentially circular though. A schema
>>>>>>>>>>>> can have a field with a logical type, and that logical type can 
>>>>>>>>>>>> have a base
>>>>>>>>>>>> type of Row with a field with a logical type (and on and on...). 
>>>>>>>>>>>> To me it
>>>>>>>>>>>> seems elegant, not awkward, to recognize that SchemaCoder is just 
>>>>>>>>>>>> a special
>>>>>>>>>>>> case of this concept.
>>>>>>>>>>>>
>>>>>>>>>>>> Something like the LogicalTypeConversion proto would definitely
>>>>>>>>>>>> be an improvement, but I would still prefer just using a top-level 
>>>>>>>>>>>> logical
>>>>>>>>>>>> type :)
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I've added a section to the doc [6] to propose this
>>>>>>>>>>>>>> alternative in the context of the portable representation but I 
>>>>>>>>>>>>>> wanted to
>>>>>>>>>>>>>> bring it up here as well to solicit feedback.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/RowCoder.java#L41
>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/Row.java#L59
>>>>>>>>>>>>>> [3]
>>>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L48
>>>>>>>>>>>>>> [4]
>>>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/SchemaCoder.java#L33
>>>>>>>>>>>>>> [5]
>>>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L489
>>>>>>>>>>>>>> [6]
>>>>>>>>>>>>>> https://docs.google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?ts=5cdf6a5b#heading=h.7570feur1qin
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, May 10, 2019 at 9:16 AM Brian Hulette <
>>>>>>>>>>>>>> bhule...@google.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Ah thanks! I added some language there.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> *From: *Kenneth Knowles <k...@apache.org>
>>>>>>>>>>>>>>> *Date: *Thu, May 9, 2019 at 5:31 PM
>>>>>>>>>>>>>>> *To: *dev
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> *From: *Brian Hulette <bhule...@google.com>
>>>>>>>>>>>>>>>> *Date: *Thu, May 9, 2019 at 2:02 PM
>>>>>>>>>>>>>>>> *To: * <dev@beam.apache.org>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> We briefly discussed using arrow schemas in place of beam
>>>>>>>>>>>>>>>>> schemas entirely in an arrow thread [1]. The biggest reason 
>>>>>>>>>>>>>>>>> not to this was
>>>>>>>>>>>>>>>>> that we wanted to have a type for large iterables in beam 
>>>>>>>>>>>>>>>>> schemas. But
>>>>>>>>>>>>>>>>> given that large iterables aren't currently implemented, beam 
>>>>>>>>>>>>>>>>> schemas look
>>>>>>>>>>>>>>>>> very similar to arrow schemas.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I think it makes sense to take inspiration from arrow
>>>>>>>>>>>>>>>>> schemas where possible, and maybe even copy them outright. 
>>>>>>>>>>>>>>>>> Arrow already
>>>>>>>>>>>>>>>>> has a portable (flatbuffers) schema representation [2], and 
>>>>>>>>>>>>>>>>> implementations
>>>>>>>>>>>>>>>>> for it in many languages that we may be able to re-use as we 
>>>>>>>>>>>>>>>>> bring schemas
>>>>>>>>>>>>>>>>> to more SDKs (the project has Python and Go implementations). 
>>>>>>>>>>>>>>>>> There are a
>>>>>>>>>>>>>>>>> couple of concepts in Arrow schemas that are specific for the 
>>>>>>>>>>>>>>>>> format and
>>>>>>>>>>>>>>>>> wouldn't make sense for us, (fields can indicate whether or 
>>>>>>>>>>>>>>>>> not they are
>>>>>>>>>>>>>>>>> dictionary encoded, and the schema has an endianness field), 
>>>>>>>>>>>>>>>>> but if you
>>>>>>>>>>>>>>>>> drop those concepts the arrow spec looks pretty similar to 
>>>>>>>>>>>>>>>>> the beam proto
>>>>>>>>>>>>>>>>> spec.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> FWIW I left a blank section in the doc for filling out what
>>>>>>>>>>>>>>>> the differences are and why, and conversely what the interop 
>>>>>>>>>>>>>>>> opportunities
>>>>>>>>>>>>>>>> may be. Such sections are some of my favorite sections of 
>>>>>>>>>>>>>>>> design docs.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Kenn
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Brian
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>> https://lists.apache.org/thread.html/6be7715e13b71c2d161e4378c5ca1c76ac40cfc5988a03ba87f1c434@%3Cdev.beam.apache.org%3E
>>>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>>>>> https://github.com/apache/arrow/blob/master/format/Schema.fbs#L194
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> *From: *Robert Bradshaw <rober...@google.com>
>>>>>>>>>>>>>>>>> *Date: *Thu, May 9, 2019 at 1:38 PM
>>>>>>>>>>>>>>>>> *To: *dev
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> From: Reuven Lax <re...@google.com>
>>>>>>>>>>>>>>>>>> Date: Thu, May 9, 2019 at 7:29 PM
>>>>>>>>>>>>>>>>>> To: dev
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> > Also in the future we might be able to do optimizations
>>>>>>>>>>>>>>>>>> at the runner level if at the portability layer we 
>>>>>>>>>>>>>>>>>> understood schemes
>>>>>>>>>>>>>>>>>> instead of just raw coders. This could be things like only 
>>>>>>>>>>>>>>>>>> parsing a subset
>>>>>>>>>>>>>>>>>> of a row (if we know only a few fields are accessed) or 
>>>>>>>>>>>>>>>>>> using a columnar
>>>>>>>>>>>>>>>>>> data structure like Arrow to encode batches of rows across 
>>>>>>>>>>>>>>>>>> portability.
>>>>>>>>>>>>>>>>>> This doesn't affect data semantics of course, but having a 
>>>>>>>>>>>>>>>>>> richer,
>>>>>>>>>>>>>>>>>> more-expressive type system opens up other opportunities.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> But we could do all of that with a RowCoder we understood
>>>>>>>>>>>>>>>>>> to designate
>>>>>>>>>>>>>>>>>> the type(s), right?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> > On Thu, May 9, 2019 at 10:16 AM Robert Bradshaw <
>>>>>>>>>>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>> >> On the flip side, Schemas are equivalent to the space
>>>>>>>>>>>>>>>>>> of Coders with
>>>>>>>>>>>>>>>>>> >> the addition of a RowCoder and the ability to
>>>>>>>>>>>>>>>>>> materialize to something
>>>>>>>>>>>>>>>>>> >> other than bytes, right? (Perhaps I'm missing
>>>>>>>>>>>>>>>>>> something big here...)
>>>>>>>>>>>>>>>>>> >> This may make a backwards-compatible transition
>>>>>>>>>>>>>>>>>> easier. (SDK-side, the
>>>>>>>>>>>>>>>>>> >> ability to reason about and operate on such types is
>>>>>>>>>>>>>>>>>> of course much
>>>>>>>>>>>>>>>>>> >> richer than anything Coders offer right now.)
>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>> >> From: Reuven Lax <re...@google.com>
>>>>>>>>>>>>>>>>>> >> Date: Thu, May 9, 2019 at 4:52 PM
>>>>>>>>>>>>>>>>>> >> To: dev
>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>> >> > FYI I can imagine a world in which we have no
>>>>>>>>>>>>>>>>>> coders. We could define the entire model on top of schemas. 
>>>>>>>>>>>>>>>>>> Today's "Coder"
>>>>>>>>>>>>>>>>>> is completely equivalent to a single-field schema with a 
>>>>>>>>>>>>>>>>>> logical-type field
>>>>>>>>>>>>>>>>>> (actually the latter is slightly more expressive as you 
>>>>>>>>>>>>>>>>>> aren't forced to
>>>>>>>>>>>>>>>>>> serialize into bytes).
>>>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>>>> >> > Due to compatibility constraints and the effort that
>>>>>>>>>>>>>>>>>> would be  involved in such a change, I think the practical 
>>>>>>>>>>>>>>>>>> decision should
>>>>>>>>>>>>>>>>>> be for schemas and coders to coexist for the time being. 
>>>>>>>>>>>>>>>>>> However when we
>>>>>>>>>>>>>>>>>> start planning Beam 3.0, deprecating coders is something I 
>>>>>>>>>>>>>>>>>> would like to
>>>>>>>>>>>>>>>>>> suggest.
>>>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>>>> >> > On Thu, May 9, 2019 at 7:48 AM Robert Bradshaw <
>>>>>>>>>>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>>> >> >> From: Kenneth Knowles <k...@apache.org>
>>>>>>>>>>>>>>>>>> >> >> Date: Thu, May 9, 2019 at 10:05 AM
>>>>>>>>>>>>>>>>>> >> >> To: dev
>>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>>> >> >> > This is a huge development. Top posting because I
>>>>>>>>>>>>>>>>>> can be more compact.
>>>>>>>>>>>>>>>>>> >> >> >
>>>>>>>>>>>>>>>>>> >> >> > I really think after the initial idea converges
>>>>>>>>>>>>>>>>>> this needs a design doc with goals and alternatives. It is an
>>>>>>>>>>>>>>>>>> extraordinarily consequential model change. So in the spirit 
>>>>>>>>>>>>>>>>>> of doing the
>>>>>>>>>>>>>>>>>> work / bias towards action, I created a quick draft at
>>>>>>>>>>>>>>>>>> https://s.apache.org/beam-schemas and added everyone on
>>>>>>>>>>>>>>>>>> this thread as editors. I am still in the process of writing 
>>>>>>>>>>>>>>>>>> this to match
>>>>>>>>>>>>>>>>>> the thread.
>>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>>> >> >> Thanks! Added some comments there.
>>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>>> >> >> > *Multiple timestamp resolutions*: you can use
>>>>>>>>>>>>>>>>>> logcial types to represent nanos the same way Java and proto 
>>>>>>>>>>>>>>>>>> do.
>>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>>> >> >> As per the other discussion, I'm unsure the value
>>>>>>>>>>>>>>>>>> in supporting
>>>>>>>>>>>>>>>>>> >> >> multiple timestamp resolutions is high enough to
>>>>>>>>>>>>>>>>>> outweigh the cost.
>>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>>> >> >> > *Why multiple int types?* The domain of values
>>>>>>>>>>>>>>>>>> for these types are different. For a language with one "int" 
>>>>>>>>>>>>>>>>>> or "number"
>>>>>>>>>>>>>>>>>> type, that's another domain of values.
>>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>>> >> >> What is the value in having different domains? If
>>>>>>>>>>>>>>>>>> your data has a
>>>>>>>>>>>>>>>>>> >> >> natural domain, chances are it doesn't line up
>>>>>>>>>>>>>>>>>> exactly with one of
>>>>>>>>>>>>>>>>>> >> >> these. I guess it's for languages whose types have
>>>>>>>>>>>>>>>>>> specific domains?
>>>>>>>>>>>>>>>>>> >> >> (There's also compactness in representation,
>>>>>>>>>>>>>>>>>> encoded and in-memory,
>>>>>>>>>>>>>>>>>> >> >> though I'm not sure that's high.)
>>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>>> >> >> > *Columnar/Arrow*: making sure we unlock the
>>>>>>>>>>>>>>>>>> ability to take this path is Paramount. So tying it directly 
>>>>>>>>>>>>>>>>>> to a
>>>>>>>>>>>>>>>>>> row-oriented coder seems counterproductive.
>>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>>> >> >> I don't think Coders are necessarily row-oriented.
>>>>>>>>>>>>>>>>>> They are, however,
>>>>>>>>>>>>>>>>>> >> >> bytes-oriented. (Perhaps they need not be.) There
>>>>>>>>>>>>>>>>>> seems to be a lot of
>>>>>>>>>>>>>>>>>> >> >> overlap between what Coders express in terms of
>>>>>>>>>>>>>>>>>> element typing
>>>>>>>>>>>>>>>>>> >> >> information and what Schemas express, and I'd
>>>>>>>>>>>>>>>>>> rather have one concept
>>>>>>>>>>>>>>>>>> >> >> if possible. Or have a clear division of
>>>>>>>>>>>>>>>>>> responsibilities.
>>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>>> >> >> > *Multimap*: what does it add over an array-valued
>>>>>>>>>>>>>>>>>> map or large-iterable-valued map? (honest question, not 
>>>>>>>>>>>>>>>>>> rhetorical)
>>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>>> >> >> Multimap has a different notion of what it means to
>>>>>>>>>>>>>>>>>> contain a value,
>>>>>>>>>>>>>>>>>> >> >> can handle (unordered) unions of non-disjoint keys,
>>>>>>>>>>>>>>>>>> etc. Maybe this
>>>>>>>>>>>>>>>>>> >> >> isn't worth a new primitive type.
>>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>>> >> >> > *URN/enum for type names*: I see the case for
>>>>>>>>>>>>>>>>>> both. The core types are fundamental enough they should 
>>>>>>>>>>>>>>>>>> never really change
>>>>>>>>>>>>>>>>>> - after all, proto, thrift, avro, arrow, have addressed this 
>>>>>>>>>>>>>>>>>> (not to
>>>>>>>>>>>>>>>>>> mention most programming languages). Maybe additions once 
>>>>>>>>>>>>>>>>>> every few years.
>>>>>>>>>>>>>>>>>> I prefer the smallest intersection of these schema 
>>>>>>>>>>>>>>>>>> languages. A oneof is
>>>>>>>>>>>>>>>>>> more clear, while URN emphasizes the similarity of built-in 
>>>>>>>>>>>>>>>>>> and logical
>>>>>>>>>>>>>>>>>> types.
>>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>>> >> >> Hmm... Do we have any examples of the multi-level
>>>>>>>>>>>>>>>>>> primitive/logical
>>>>>>>>>>>>>>>>>> >> >> type in any of these other systems? I have a bias
>>>>>>>>>>>>>>>>>> towards all types
>>>>>>>>>>>>>>>>>> >> >> being on the same footing unless there is
>>>>>>>>>>>>>>>>>> compelling reason to divide
>>>>>>>>>>>>>>>>>> >> >> things into primitive/use-defined ones.
>>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>>> >> >> Here it seems like the most essential value of the
>>>>>>>>>>>>>>>>>> primitive type set
>>>>>>>>>>>>>>>>>> >> >> is to describe the underlying representation, for
>>>>>>>>>>>>>>>>>> encoding elements in
>>>>>>>>>>>>>>>>>> >> >> a variety of ways (notably columnar, but also
>>>>>>>>>>>>>>>>>> interfacing with other
>>>>>>>>>>>>>>>>>> >> >> external systems like IOs). Perhaps, rather than
>>>>>>>>>>>>>>>>>> the previous
>>>>>>>>>>>>>>>>>> >> >> suggestion of making everything a logical of bytes,
>>>>>>>>>>>>>>>>>> this could be made
>>>>>>>>>>>>>>>>>> >> >> clear by still making everything a logical type,
>>>>>>>>>>>>>>>>>> but renaming
>>>>>>>>>>>>>>>>>> >> >> "TypeName" to Representation. There would be URNs
>>>>>>>>>>>>>>>>>> (typically with
>>>>>>>>>>>>>>>>>> >> >> empty payloads) for the various primitive types
>>>>>>>>>>>>>>>>>> (whose mapping to
>>>>>>>>>>>>>>>>>> >> >> their representations would be the identity).
>>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>>> >> >> - Robert
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>

Re: [DISCUSS] Portability representation of schemas

Reply via email to