Re: [DISCUSS] Portability representation of schemas

Reuven Lax Mon, 03 Jun 2019 22:04:31 -0700

On Mon, Jun 3, 2019 at 12:27 PM Brian Hulette <[email protected]> wrote:


> > It has to go into the proto somewhere (since that's the only way the
> SDK can get it), but I'm not sure they should be considered integral parts
> of the type.
> Are you just advocating for an approach where any SDK-specific information
> is stored outside of the Schema message itself so that Schema really does
> just represent the type? That seems reasonable to me, and alleviates my
> concerns about how this applies to columnar encodings a bit as well.
>

Yes, that's exactly what I'm advocating.


>
> We could lift all of the LogicalTypeConversion messages out of the Schema
> and the LogicalType like this:
>
> message SchemaCoder {
>   Schema schema = 1;
>   LogicalTypeConversion root_conversion = 2;
>   map<string, LogicalTypeConversion> attribute_conversions = 3; // only
> necessary for user type aliases, portable logical types by definition have
> nothing SDK-specific
> }
>

I'm not sure what the map is for? I think we have status quo wihtout it.


>
> I think a critical question (that has implications for the above proposal)
> is how/if the two different concepts Kenn mentioned are allowed to nest.
> For example, you could argue it's redundant to have a user type alias that
> has a Row representation with a field that is itself a user type alias,
> because instead you could just have a single top-level type alias
> with to/from functions that pack and unpack the entire hierarchy. On the
> other hand, I think it does make sense for a user type alias or a truly
> portable logical type to have a field that is itself a truly portable
> logical type (e.g. a user type alias or portable type with a DateTime).
>
> I've been assuming that user-type aliases could be nested, but should we
> disallow that? Or should we go the other way and require that logical types
> define at most one "level"?
>

No I think it's useful to allow things to be nested (though of course the
nesting must terminate).


>
> Brian
>
> On Mon, Jun 3, 2019 at 11:08 AM Kenneth Knowles <[email protected]> wrote:
>
>>
>> On Mon, Jun 3, 2019 at 10:53 AM Reuven Lax <[email protected]> wrote:
>>
>>> So I feel a bit leery about making the to/from functions a fundamental
>>> part of the portability representation. In my mind, that is very tied to a
>>> specific SDK/language. A SDK (say the Java SDK) wants to allow users to use
>>> a wide variety of native types with schemas, and under the covers uses the
>>> to/from functions to implement that. However from the portable Beam
>>> perspective, the schema itself should be the real "type" of the
>>> PCollection; the to/from methods are simply a way that a particular SDK
>>> makes schemas easier to use. It has to go into the proto somewhere (since
>>> that's the only way the SDK can get it), but I'm not sure they should be
>>> considered integral parts of the type.
>>>
>>
>> On the doc in a couple places this distinction was made:
>>
>> * For truly portable logical types, no instructions for the SDK are
>> needed. Instead, they require:
>>    - URN: a standardized identifier any SDK can recognize
>>    - A spec: what is the universe of values in this type?
>>    - A representation: how is it represented in built-in types? This is
>> how SDKs who do not know/care about the URN will process it
>>    - (optional): SDKs choose preferred SDK-specific types to embed the
>> values in. SDKs have to know about the URN and choose for themselves.
>>
>> *For user-level type aliases, written as convenience by the user in their
>> pipeline, what Java schemas have today:
>>    - to/from UDFs: the code is SDK-specific
>>    - some representation of the intended type (like java class): also SDK
>> specific
>>    - a representation
>>    - any "id" is just like other ids in the pipeline, just avoiding
>> duplicating the proto
>>    - Luke points out that nesting these can give multiple SDKs a hint
>>
>> In my mind the remaining complexity is whether or not we need to be able
>> to move between the two. Composite PTransforms, for example, do have
>> fluidity between being strictly user-defined versus portable URN+payload.
>> But it requires lots of engineering, namely the current work on expansion
>> service.
>>
>> Kenn
>>
>>
>>> On Mon, Jun 3, 2019 at 10:23 AM Brian Hulette <[email protected]>
>>> wrote:
>>>
>>>> Ah I see, I didn't realize that. Then I suppose we'll need to/from
>>>> functions somewhere in the logical type conversion to preserve the current
>>>> behavior.
>>>>
>>>> I'm still a little hesitant to make these functions an explicit part of
>>>> LogicalTypeConversion for another reason. Down the road, schemas could give
>>>> us an avenue to use a batched columnar format (presumably arrow, but of
>>>> course others are possible). By making to/from an explicit part of logical
>>>> types we add some element-wise logic to a schema representation that's
>>>> otherwise ambivalent to element-wise vs. batched encodings.
>>>>
>>>> I suppose you could make an argument that to/from are only for
>>>> custom types. There will also be some set of well-known types identified
>>>> only by URN and some parameters, which could easily be translated to a
>>>> columnar format. We could just not support custom types fully if we add a
>>>> columnar encoding, or maybe add optional toBatch/fromBatch functions
>>>> when/if we get there.
>>>>
>>>> What about something like this that makes the two different types of
>>>> logical types explicit?
>>>>
>>>> // Describes a logical type and how to convert between it and its
>>>> representation (e.g. Row).
>>>> message LogicalTypeConversion {
>>>>   oneof conversion {
>>>>     message Standard standard = 1;
>>>>     message Custom custom = 2;
>>>>   }
>>>>
>>>>   message Standard {
>>>>     String urn = 1;
>>>>     repeated string args = 2; // could also be a map
>>>>   }
>>>>
>>>>   message Custom {
>>>>     FunctionSpec(?) toRepresentation = 1;
>>>>     FunctionSpec(?) fromRepresentation = 2;
>>>>     bytes type = 3; // e.g. serialized class for Java
>>>>   }
>>>> }
>>>>
>>>> And LogicalType and Schema become:
>>>>
>>>> message LogicalType {
>>>>   FieldType representation = 1;
>>>>   LogicalTypeConversion conversion = 2;
>>>> }
>>>>
>>>> message Schema {
>>>>   ...
>>>>   repeated Field fields = 1;
>>>>   LogicalTypeConversion conversion = 2; // implied that representation
>>>> is Row
>>>> }
>>>>
>>>> Brian
>>>>
>>>> On Sat, Jun 1, 2019 at 10:44 AM Reuven Lax <[email protected]> wrote:
>>>>
>>>>> Keep in mind that right now the SchemaRegistry is only assumed to
>>>>> exist at graph-construction time, not at execution time; all information 
>>>>> in
>>>>> the schema registry is embedded in the SchemaCoder, which is the only 
>>>>> thing
>>>>> we keep around when the pipeline is actually running. We could look into
>>>>> changing this, but it would potentially be a very big change, and I do
>>>>> think we should start getting users actively using schemas soon.
>>>>>
>>>>> On Fri, May 31, 2019 at 3:40 PM Brian Hulette <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> > Can you propose what the protos would look like in this case? Right
>>>>>> now LogicalType does not contain the to/from conversion functions in the
>>>>>> proto. Do you think we'll need to add these in?
>>>>>>
>>>>>> Maybe. Right now the proposed LogicalType message is pretty
>>>>>> simple/generic:
>>>>>> message LogicalType {
>>>>>>   FieldType representation = 1;
>>>>>>   string logical_urn = 2;
>>>>>>   bytes logical_payload = 3;
>>>>>> }
>>>>>>
>>>>>> If we keep just logical_urn and logical_payload, the logical_payload
>>>>>> could itself be a protobuf with attributes of 1) a serialized class and
>>>>>> 2/3) to/from functions. Or, alternatively, we could have a generalization
>>>>>> of the SchemaRegistry for logical types. Implementations for standard 
>>>>>> types
>>>>>> and user-defined types would be registered by URN, and the SDK could look
>>>>>> them up given just a URN. I put a brief section about this alternative in
>>>>>> the doc last week [1]. What I suggested there included removing the
>>>>>> logical_payload field, which is probably overkill. The critical piece is
>>>>>> just relying on a registry in the SDK to look up types and to/from
>>>>>> functions rather than storing them in the portable schema itself.
>>>>>>
>>>>>> I kind of like keeping the LogicalType message generic for now, since
>>>>>> it gives us a way to try out these various approaches, but maybe that's
>>>>>> just a cop out.
>>>>>>
>>>>>> [1]
>>>>>> https://docs.google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?ts=5cdf6a5b#heading=h.jlt5hdrolfy
>>>>>>
>>>>>> On Fri, May 31, 2019 at 12:36 PM Reuven Lax <[email protected]> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, May 28, 2019 at 10:11 AM Brian Hulette <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sun, May 26, 2019 at 1:25 PM Reuven Lax <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, May 24, 2019 at 11:42 AM Brian Hulette <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> *tl;dr:* SchemaCoder represents a logical type with a base type
>>>>>>>>>> of Row and we should think about that.
>>>>>>>>>>
>>>>>>>>>> I'm a little concerned that the current proposals for a portable
>>>>>>>>>> representation don't actually fully represent Schemas. It seems to 
>>>>>>>>>> me that
>>>>>>>>>> the current java-only Schemas are made up three concepts that are
>>>>>>>>>> intertwined:
>>>>>>>>>> (a) The Java SDK specific code for schema inference, type
>>>>>>>>>> coercion, and "schema-aware" transforms.
>>>>>>>>>> (b) A RowCoder[1] that encodes Rows[2] which have a particular
>>>>>>>>>> Schema[3].
>>>>>>>>>> (c) A SchemaCoder[4] that has a RowCoder for a particular schema,
>>>>>>>>>> and functions for converting Rows with that schema to/from a Java 
>>>>>>>>>> type T.
>>>>>>>>>> Those functions and the RowCoder are then composed to provider a 
>>>>>>>>>> Coder for
>>>>>>>>>> the type T.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> RowCoder is currently just an internal implementation detail, it
>>>>>>>>> can be eliminated. SchemaCoder is the only thing that determines a 
>>>>>>>>> schema
>>>>>>>>> today.
>>>>>>>>>
>>>>>>>> Why not keep it around? I think it would make sense to have a
>>>>>>>> RowCoder implementation in every SDK, as well as something like 
>>>>>>>> SchemaCoder
>>>>>>>> that defines a conversion from that SDK's "Row" to the language type.
>>>>>>>>
>>>>>>>
>>>>>>> The point is that from a programmer's perspective, there is nothing
>>>>>>> much special about Row. Any type can have a schema, and the only special
>>>>>>> thing about Row is that it's always guaranteed to exist. From that
>>>>>>> standpoint, Row is nearly an implementation detail. Today RowCoder is 
>>>>>>> never
>>>>>>> set on _any_ PCollection, it's literally just used as a helper library, 
>>>>>>> so
>>>>>>> there's no real need for it to exist as a "Coder."
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> We're not concerned with (a) at this time since that's specific
>>>>>>>>>> to the SDK, not the interface between them. My understanding is we 
>>>>>>>>>> just
>>>>>>>>>> want to define a portable representation for (b) and/or (c).
>>>>>>>>>>
>>>>>>>>>> What has been discussed so far is really just a portable
>>>>>>>>>> representation for (b), the RowCoder, since the discussion is only 
>>>>>>>>>> around
>>>>>>>>>> how to represent the schema itself and not the to/from functions.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Correct. The to/from functions are actually related to a). One of
>>>>>>>>> the big goals of schemas was that users should not be forced to 
>>>>>>>>> operate on
>>>>>>>>> rows to get schemas. A user can create PCollection<MyRandomType> and 
>>>>>>>>> as
>>>>>>>>> long as the SDK can infer a schema from MyRandomType, the user never 
>>>>>>>>> needs
>>>>>>>>> to even see a Row object. The to/fromRow functions are what make this 
>>>>>>>>> work
>>>>>>>>> today.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> One of the points I'd like to make is that this type coercion is a
>>>>>>>> useful concept on it's own, separate from schemas. It's especially 
>>>>>>>> useful
>>>>>>>> for a type that has a schema and is encoded by RowCoder since that can
>>>>>>>> represent many more types, but the type coercion doesn't have to be 
>>>>>>>> tied to
>>>>>>>> just schemas and RowCoder. We could also do type coercion for types 
>>>>>>>> that
>>>>>>>> are effectively wrappers around an integer or a string. It could just 
>>>>>>>> be a
>>>>>>>> general way to map language types to base types (i.e. types that we 
>>>>>>>> have a
>>>>>>>> coder for). Then it just becomes a general framework for extending 
>>>>>>>> coders
>>>>>>>> to represent more language types.
>>>>>>>>
>>>>>>>
>>>>>>> Let's not tie those conversations. Maybe a similar concept will hold
>>>>>>> true for general coders (or we might decide to get rid of coders in 
>>>>>>> favor
>>>>>>> of schemas, in which case that becomes moot), but I don't think we 
>>>>>>> should
>>>>>>> prematurely generalize.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> One of the outstanding questions for that schema representation is
>>>>>>>>>> how to represent logical types, which may or may not have some 
>>>>>>>>>> language
>>>>>>>>>> type in each SDK (the canonical example being a timsetamp type with 
>>>>>>>>>> seconds
>>>>>>>>>> and nanos and java.time.Instant). I think this question is critically
>>>>>>>>>> important, because (c), the SchemaCoder, is actually *defining a 
>>>>>>>>>> logical
>>>>>>>>>> type* with a language type T in the Java SDK. This becomes clear 
>>>>>>>>>> when you
>>>>>>>>>> compare SchemaCoder[4] to the Schema.LogicalType interface[5] - both
>>>>>>>>>> essentially have three attributes: a base type, and two functions for
>>>>>>>>>> converting to/from that base type. The only difference is for 
>>>>>>>>>> SchemaCoder
>>>>>>>>>> that base type must be a Row so it can be represented by a Schema 
>>>>>>>>>> alone,
>>>>>>>>>> while LogicalType can have any base type that can be represented by
>>>>>>>>>> FieldType, including a Row.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This is not true actually. SchemaCoder can have any base type,
>>>>>>>>> that's why (in Java) it's SchemaCoder<T>. This is why PCollection<T> 
>>>>>>>>> can
>>>>>>>>> have a schema, even if T is not Row.
>>>>>>>>>
>>>>>>>>>
>>>>>>>> I'm not sure I effectively communicated what I meant - When I said
>>>>>>>> SchemaCoder's "base type" I wasn't referring to T, I was referring to 
>>>>>>>> the
>>>>>>>> base FieldType, whose coder we use for this type. I meant "base type" 
>>>>>>>> to be
>>>>>>>> analogous to LogicalType's `getBaseType`, or what Kenn is suggesting we
>>>>>>>> call "representation" in the portable beam schemas doc. To define some
>>>>>>>> terms from my original message:
>>>>>>>> base type = an instance of FieldType, crucially this is something
>>>>>>>> that we have a coder for (be it VarIntCoder, Utf8Coder, RowCoder, ...)
>>>>>>>> language type (or "T", "type T", "logical type") = Some Java class
>>>>>>>> (or something analogous in the other SDKs) that we may or may not have 
>>>>>>>> a
>>>>>>>> coder for. It's possible to define functions for converting instances 
>>>>>>>> of
>>>>>>>> the language type to/from the base type.
>>>>>>>>
>>>>>>>> I was just trying to make the case that SchemaCoder is really a
>>>>>>>> special case of LogicalType, where `getBaseType` always returns a Row 
>>>>>>>> with
>>>>>>>> the stored Schema.
>>>>>>>>
>>>>>>>
>>>>>>> Yeah, I think  I got that point.
>>>>>>>
>>>>>>> Can you propose what the protos would look like in this case? Right
>>>>>>> now LogicalType does not contain the to/from conversion functions in the
>>>>>>> proto. Do you think we'll need to add these in?
>>>>>>>
>>>>>>>
>>>>>>>> To make the point with code: SchemaCoder<T> can be made to
>>>>>>>> implement Schema.LogicalType<T,Row> with trivial implementations of
>>>>>>>> getBaseType, toBaseType, and toInputType (I'm not trying to say we 
>>>>>>>> should
>>>>>>>> or shouldn't do this, just using it illustrate my point):
>>>>>>>>
>>>>>>>> class SchemaCoder extends CustomCoder<T> implements
>>>>>>>> Schema.LogicalType<T, Row> {
>>>>>>>>   ...
>>>>>>>>
>>>>>>>>   @Override
>>>>>>>>   FieldType getBaseType() {
>>>>>>>>     return FieldType.row(getSchema());
>>>>>>>>   }
>>>>>>>>
>>>>>>>>   @Override
>>>>>>>>   public Row toBaseType() {
>>>>>>>>     return this.toRowFunction.apply(input);
>>>>>>>>   }
>>>>>>>>
>>>>>>>>   @Override
>>>>>>>>   public T toInputType(Row base) {
>>>>>>>>     return this.fromRowFunction.apply(base);
>>>>>>>>   }
>>>>>>>>   ...
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>>>> I think it may make sense to fully embrace this duality, by
>>>>>>>>>> letting SchemaCoder have a baseType other than just Row and renaming 
>>>>>>>>>> it to
>>>>>>>>>> LogicalTypeCoder/LanguageTypeCoder. The current Java SDK schema-aware
>>>>>>>>>> transforms (a) would operate only on LogicalTypeCoders with a Row 
>>>>>>>>>> base
>>>>>>>>>> type. Perhaps some of the current schema logic could  alsobe applied 
>>>>>>>>>> more
>>>>>>>>>> generally to any logical type  - for example, to provide type 
>>>>>>>>>> coercion for
>>>>>>>>>> logical types with a base type other than Row, like int64 and a 
>>>>>>>>>> timestamp
>>>>>>>>>> class backed by millis, or fixed size bytes and a UUID class. And 
>>>>>>>>>> having a
>>>>>>>>>> portable representation that represents those (non Row backed) 
>>>>>>>>>> logical
>>>>>>>>>> types with some URN would also allow us to pass them to other 
>>>>>>>>>> languages
>>>>>>>>>> without unnecessarily wrapping them in a Row in order to use 
>>>>>>>>>> SchemaCoder.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I think the actual overlap here is between the to/from functions
>>>>>>>>> in SchemaCoder (which is what allows SchemaCoder<T> where T != Row) 
>>>>>>>>> and the
>>>>>>>>> equivalent functionality in LogicalType. However making all of schemas
>>>>>>>>> simply just a logical type feels a bit awkward and circular to me. 
>>>>>>>>> Maybe we
>>>>>>>>> should refactor that part out into a LogicalTypeConversion proto, and
>>>>>>>>> reference that from both LogicalType and from SchemaCoder?
>>>>>>>>>
>>>>>>>>
>>>>>>>> LogicalType is already potentially circular though. A schema can
>>>>>>>> have a field with a logical type, and that logical type can have a base
>>>>>>>> type of Row with a field with a logical type (and on and on...). To me 
>>>>>>>> it
>>>>>>>> seems elegant, not awkward, to recognize that SchemaCoder is just a 
>>>>>>>> special
>>>>>>>> case of this concept.
>>>>>>>>
>>>>>>>> Something like the LogicalTypeConversion proto would definitely be
>>>>>>>> an improvement, but I would still prefer just using a top-level logical
>>>>>>>> type :)
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I've added a section to the doc [6] to propose this alternative in
>>>>>>>>>> the context of the portable representation but I wanted to bring it 
>>>>>>>>>> up here
>>>>>>>>>> as well to solicit feedback.
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/RowCoder.java#L41
>>>>>>>>>> [2]
>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/Row.java#L59
>>>>>>>>>> [3]
>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L48
>>>>>>>>>> [4]
>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/SchemaCoder.java#L33
>>>>>>>>>> [5]
>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L489
>>>>>>>>>> [6]
>>>>>>>>>> https://docs.google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?ts=5cdf6a5b#heading=h.7570feur1qin
>>>>>>>>>>
>>>>>>>>>> On Fri, May 10, 2019 at 9:16 AM Brian Hulette <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Ah thanks! I added some language there.
>>>>>>>>>>>
>>>>>>>>>>> *From: *Kenneth Knowles <[email protected]>
>>>>>>>>>>> *Date: *Thu, May 9, 2019 at 5:31 PM
>>>>>>>>>>> *To: *dev
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> *From: *Brian Hulette <[email protected]>
>>>>>>>>>>>> *Date: *Thu, May 9, 2019 at 2:02 PM
>>>>>>>>>>>> *To: * <[email protected]>
>>>>>>>>>>>>
>>>>>>>>>>>> We briefly discussed using arrow schemas in place of beam
>>>>>>>>>>>>> schemas entirely in an arrow thread [1]. The biggest reason not 
>>>>>>>>>>>>> to this was
>>>>>>>>>>>>> that we wanted to have a type for large iterables in beam 
>>>>>>>>>>>>> schemas. But
>>>>>>>>>>>>> given that large iterables aren't currently implemented, beam 
>>>>>>>>>>>>> schemas look
>>>>>>>>>>>>> very similar to arrow schemas.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> I think it makes sense to take inspiration from arrow schemas
>>>>>>>>>>>>> where possible, and maybe even copy them outright. Arrow already 
>>>>>>>>>>>>> has a
>>>>>>>>>>>>> portable (flatbuffers) schema representation [2], and 
>>>>>>>>>>>>> implementations for
>>>>>>>>>>>>> it in many languages that we may be able to re-use as we bring 
>>>>>>>>>>>>> schemas to
>>>>>>>>>>>>> more SDKs (the project has Python and Go implementations). There 
>>>>>>>>>>>>> are a
>>>>>>>>>>>>> couple of concepts in Arrow schemas that are specific for the 
>>>>>>>>>>>>> format and
>>>>>>>>>>>>> wouldn't make sense for us, (fields can indicate whether or not 
>>>>>>>>>>>>> they are
>>>>>>>>>>>>> dictionary encoded, and the schema has an endianness field), but 
>>>>>>>>>>>>> if you
>>>>>>>>>>>>> drop those concepts the arrow spec looks pretty similar to the 
>>>>>>>>>>>>> beam proto
>>>>>>>>>>>>> spec.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> FWIW I left a blank section in the doc for filling out what the
>>>>>>>>>>>> differences are and why, and conversely what the interop 
>>>>>>>>>>>> opportunities may
>>>>>>>>>>>> be. Such sections are some of my favorite sections of design docs.
>>>>>>>>>>>>
>>>>>>>>>>>> Kenn
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Brian
>>>>>>>>>>>>>
>>>>>>>>>>>>> [1]
>>>>>>>>>>>>> https://lists.apache.org/thread.html/6be7715e13b71c2d161e4378c5ca1c76ac40cfc5988a03ba87f1c434@%3Cdev.beam.apache.org%3E
>>>>>>>>>>>>> [2]
>>>>>>>>>>>>> https://github.com/apache/arrow/blob/master/format/Schema.fbs#L194
>>>>>>>>>>>>>
>>>>>>>>>>>>> *From: *Robert Bradshaw <[email protected]>
>>>>>>>>>>>>> *Date: *Thu, May 9, 2019 at 1:38 PM
>>>>>>>>>>>>> *To: *dev
>>>>>>>>>>>>>
>>>>>>>>>>>>> From: Reuven Lax <[email protected]>
>>>>>>>>>>>>>> Date: Thu, May 9, 2019 at 7:29 PM
>>>>>>>>>>>>>> To: dev
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> > Also in the future we might be able to do optimizations at
>>>>>>>>>>>>>> the runner level if at the portability layer we understood 
>>>>>>>>>>>>>> schemes instead
>>>>>>>>>>>>>> of just raw coders. This could be things like only parsing a 
>>>>>>>>>>>>>> subset of a
>>>>>>>>>>>>>> row (if we know only a few fields are accessed) or using a 
>>>>>>>>>>>>>> columnar data
>>>>>>>>>>>>>> structure like Arrow to encode batches of rows across 
>>>>>>>>>>>>>> portability. This
>>>>>>>>>>>>>> doesn't affect data semantics of course, but having a richer,
>>>>>>>>>>>>>> more-expressive type system opens up other opportunities.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> But we could do all of that with a RowCoder we understood to
>>>>>>>>>>>>>> designate
>>>>>>>>>>>>>> the type(s), right?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> > On Thu, May 9, 2019 at 10:16 AM Robert Bradshaw <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>> >> On the flip side, Schemas are equivalent to the space of
>>>>>>>>>>>>>> Coders with
>>>>>>>>>>>>>> >> the addition of a RowCoder and the ability to materialize
>>>>>>>>>>>>>> to something
>>>>>>>>>>>>>> >> other than bytes, right? (Perhaps I'm missing something
>>>>>>>>>>>>>> big here...)
>>>>>>>>>>>>>> >> This may make a backwards-compatible transition easier.
>>>>>>>>>>>>>> (SDK-side, the
>>>>>>>>>>>>>> >> ability to reason about and operate on such types is of
>>>>>>>>>>>>>> course much
>>>>>>>>>>>>>> >> richer than anything Coders offer right now.)
>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>> >> From: Reuven Lax <[email protected]>
>>>>>>>>>>>>>> >> Date: Thu, May 9, 2019 at 4:52 PM
>>>>>>>>>>>>>> >> To: dev
>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>> >> > FYI I can imagine a world in which we have no coders. We
>>>>>>>>>>>>>> could define the entire model on top of schemas. Today's "Coder" 
>>>>>>>>>>>>>> is
>>>>>>>>>>>>>> completely equivalent to a single-field schema with a 
>>>>>>>>>>>>>> logical-type field
>>>>>>>>>>>>>> (actually the latter is slightly more expressive as you aren't 
>>>>>>>>>>>>>> forced to
>>>>>>>>>>>>>> serialize into bytes).
>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>> >> > Due to compatibility constraints and the effort that
>>>>>>>>>>>>>> would be  involved in such a change, I think the practical 
>>>>>>>>>>>>>> decision should
>>>>>>>>>>>>>> be for schemas and coders to coexist for the time being. However 
>>>>>>>>>>>>>> when we
>>>>>>>>>>>>>> start planning Beam 3.0, deprecating coders is something I would 
>>>>>>>>>>>>>> like to
>>>>>>>>>>>>>> suggest.
>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>> >> > On Thu, May 9, 2019 at 7:48 AM Robert Bradshaw <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>> >> >> From: Kenneth Knowles <[email protected]>
>>>>>>>>>>>>>> >> >> Date: Thu, May 9, 2019 at 10:05 AM
>>>>>>>>>>>>>> >> >> To: dev
>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>> >> >> > This is a huge development. Top posting because I can
>>>>>>>>>>>>>> be more compact.
>>>>>>>>>>>>>> >> >> >
>>>>>>>>>>>>>> >> >> > I really think after the initial idea converges this
>>>>>>>>>>>>>> needs a design doc with goals and alternatives. It is an 
>>>>>>>>>>>>>> extraordinarily
>>>>>>>>>>>>>> consequential model change. So in the spirit of doing the work / 
>>>>>>>>>>>>>> bias
>>>>>>>>>>>>>> towards action, I created a quick draft at
>>>>>>>>>>>>>> https://s.apache.org/beam-schemas and added everyone on this
>>>>>>>>>>>>>> thread as editors. I am still in the process of writing this to 
>>>>>>>>>>>>>> match the
>>>>>>>>>>>>>> thread.
>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>> >> >> Thanks! Added some comments there.
>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>> >> >> > *Multiple timestamp resolutions*: you can use logcial
>>>>>>>>>>>>>> types to represent nanos the same way Java and proto do.
>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>> >> >> As per the other discussion, I'm unsure the value in
>>>>>>>>>>>>>> supporting
>>>>>>>>>>>>>> >> >> multiple timestamp resolutions is high enough to
>>>>>>>>>>>>>> outweigh the cost.
>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>> >> >> > *Why multiple int types?* The domain of values for
>>>>>>>>>>>>>> these types are different. For a language with one "int" or 
>>>>>>>>>>>>>> "number" type,
>>>>>>>>>>>>>> that's another domain of values.
>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>> >> >> What is the value in having different domains? If your
>>>>>>>>>>>>>> data has a
>>>>>>>>>>>>>> >> >> natural domain, chances are it doesn't line up exactly
>>>>>>>>>>>>>> with one of
>>>>>>>>>>>>>> >> >> these. I guess it's for languages whose types have
>>>>>>>>>>>>>> specific domains?
>>>>>>>>>>>>>> >> >> (There's also compactness in representation, encoded
>>>>>>>>>>>>>> and in-memory,
>>>>>>>>>>>>>> >> >> though I'm not sure that's high.)
>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>> >> >> > *Columnar/Arrow*: making sure we unlock the ability
>>>>>>>>>>>>>> to take this path is Paramount. So tying it directly to a 
>>>>>>>>>>>>>> row-oriented
>>>>>>>>>>>>>> coder seems counterproductive.
>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>> >> >> I don't think Coders are necessarily row-oriented. They
>>>>>>>>>>>>>> are, however,
>>>>>>>>>>>>>> >> >> bytes-oriented. (Perhaps they need not be.) There seems
>>>>>>>>>>>>>> to be a lot of
>>>>>>>>>>>>>> >> >> overlap between what Coders express in terms of element
>>>>>>>>>>>>>> typing
>>>>>>>>>>>>>> >> >> information and what Schemas express, and I'd rather
>>>>>>>>>>>>>> have one concept
>>>>>>>>>>>>>> >> >> if possible. Or have a clear division of
>>>>>>>>>>>>>> responsibilities.
>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>> >> >> > *Multimap*: what does it add over an array-valued map
>>>>>>>>>>>>>> or large-iterable-valued map? (honest question, not rhetorical)
>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>> >> >> Multimap has a different notion of what it means to
>>>>>>>>>>>>>> contain a value,
>>>>>>>>>>>>>> >> >> can handle (unordered) unions of non-disjoint keys,
>>>>>>>>>>>>>> etc. Maybe this
>>>>>>>>>>>>>> >> >> isn't worth a new primitive type.
>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>> >> >> > *URN/enum for type names*: I see the case for both.
>>>>>>>>>>>>>> The core types are fundamental enough they should never really 
>>>>>>>>>>>>>> change -
>>>>>>>>>>>>>> after all, proto, thrift, avro, arrow, have addressed this (not 
>>>>>>>>>>>>>> to mention
>>>>>>>>>>>>>> most programming languages). Maybe additions once every few 
>>>>>>>>>>>>>> years. I prefer
>>>>>>>>>>>>>> the smallest intersection of these schema languages. A oneof is 
>>>>>>>>>>>>>> more clear,
>>>>>>>>>>>>>> while URN emphasizes the similarity of built-in and logical 
>>>>>>>>>>>>>> types.
>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>> >> >> Hmm... Do we have any examples of the multi-level
>>>>>>>>>>>>>> primitive/logical
>>>>>>>>>>>>>> >> >> type in any of these other systems? I have a bias
>>>>>>>>>>>>>> towards all types
>>>>>>>>>>>>>> >> >> being on the same footing unless there is compelling
>>>>>>>>>>>>>> reason to divide
>>>>>>>>>>>>>> >> >> things into primitive/use-defined ones.
>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>> >> >> Here it seems like the most essential value of the
>>>>>>>>>>>>>> primitive type set
>>>>>>>>>>>>>> >> >> is to describe the underlying representation, for
>>>>>>>>>>>>>> encoding elements in
>>>>>>>>>>>>>> >> >> a variety of ways (notably columnar, but also
>>>>>>>>>>>>>> interfacing with other
>>>>>>>>>>>>>> >> >> external systems like IOs). Perhaps, rather than the
>>>>>>>>>>>>>> previous
>>>>>>>>>>>>>> >> >> suggestion of making everything a logical of bytes,
>>>>>>>>>>>>>> this could be made
>>>>>>>>>>>>>> >> >> clear by still making everything a logical type, but
>>>>>>>>>>>>>> renaming
>>>>>>>>>>>>>> >> >> "TypeName" to Representation. There would be URNs
>>>>>>>>>>>>>> (typically with
>>>>>>>>>>>>>> >> >> empty payloads) for the various primitive types (whose
>>>>>>>>>>>>>> mapping to
>>>>>>>>>>>>>> >> >> their representations would be the identity).
>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>> >> >> - Robert
>>>>>>>>>>>>>>
>>>>>>>>>>>>>

Re: [DISCUSS] Portability representation of schemas

Reply via email to