Re: [DISCUSS] Portability representation of schemas

Reuven Lax Sun, 26 May 2019 13:26:24 -0700

On Fri, May 24, 2019 at 11:42 AM Brian Hulette <[email protected]> wrote:


> *tl;dr:* SchemaCoder represents a logical type with a base type of Row
> and we should think about that.
>
> I'm a little concerned that the current proposals for a portable
> representation don't actually fully represent Schemas. It seems to me that
> the current java-only Schemas are made up three concepts that are
> intertwined:
> (a) The Java SDK specific code for schema inference, type coercion, and
> "schema-aware" transforms.
> (b) A RowCoder[1] that encodes Rows[2] which have a particular Schema[3].
> (c) A SchemaCoder[4] that has a RowCoder for a particular schema, and
> functions for converting Rows with that schema to/from a Java type T. Those
> functions and the RowCoder are then composed to provider a Coder for the
> type T.
>

RowCoder is currently just an internal implementation detail, it can be
eliminated. SchemaCoder is the only thing that determines a schema today.


>
> We're not concerned with (a) at this time since that's specific to the
> SDK, not the interface between them. My understanding is we just want to
> define a portable representation for (b) and/or (c).
>
> What has been discussed so far is really just a portable representation
> for (b), the RowCoder, since the discussion is only around how to represent
> the schema itself and not the to/from functions.
>

Correct. The to/from functions are actually related to a). One of the big
goals of schemas was that users should not be forced to operate on rows to
get schemas. A user can create PCollection<MyRandomType> and as long as the
SDK can infer a schema from MyRandomType, the user never needs to even see
a Row object. The to/fromRow functions are what make this work today.


> One of the outstanding questions for that schema representation is how to
> represent logical types, which may or may not have some language type in
> each SDK (the canonical example being a timsetamp type with seconds and
> nanos and java.time.Instant). I think this question is critically
> important, because (c), the SchemaCoder, is actually *defining a logical
> type* with a language type T in the Java SDK. This becomes clear when you
> compare SchemaCoder[4] to the Schema.LogicalType interface[5] - both
> essentially have three attributes: a base type, and two functions for
> converting to/from that base type. The only difference is for SchemaCoder
> that base type must be a Row so it can be represented by a Schema alone,
> while LogicalType can have any base type that can be represented by
> FieldType, including a Row.
>

This is not true actually. SchemaCoder can have any base type, that's why
(in Java) it's SchemaCoder<T>. This is why PCollection<T> can have a
schema, even if T is not Row.


> I think it may make sense to fully embrace this duality, by letting
> SchemaCoder have a baseType other than just Row and renaming it to
> LogicalTypeCoder/LanguageTypeCoder. The current Java SDK schema-aware
> transforms (a) would operate only on LogicalTypeCoders with a Row base
> type. Perhaps some of the current schema logic could  alsobe applied more
> generally to any logical type  - for example, to provide type coercion for
> logical types with a base type other than Row, like int64 and a timestamp
> class backed by millis, or fixed size bytes and a UUID class. And having a
> portable representation that represents those (non Row backed) logical
> types with some URN would also allow us to pass them to other languages
> without unnecessarily wrapping them in a Row in order to use SchemaCoder.
>

I think the actual overlap here is between the to/from functions in
SchemaCoder (which is what allows SchemaCoder<T> where T != Row) and the
equivalent functionality in LogicalType. However making all of schemas
simply just a logical type feels a bit awkward and circular to me. Maybe we
should refactor that part out into a LogicalTypeConversion proto, and
reference that from both LogicalType and from SchemaCoder?


I've added a section to the doc [6] to propose this alternative in the
> context of the portable representation but I wanted to bring it up here as
> well to solicit feedback.
>
> [1]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/RowCoder.java#L41
> [2]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/Row.java#L59
> [3]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L48
> [4]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/SchemaCoder.java#L33
> [5]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L489
> [6]
> https://docs.google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?ts=5cdf6a5b#heading=h.7570feur1qin
>
> On Fri, May 10, 2019 at 9:16 AM Brian Hulette <[email protected]> wrote:
>
>> Ah thanks! I added some language there.
>>
>> *From: *Kenneth Knowles <[email protected]>
>> *Date: *Thu, May 9, 2019 at 5:31 PM
>> *To: *dev
>>
>>
>>> *From: *Brian Hulette <[email protected]>
>>> *Date: *Thu, May 9, 2019 at 2:02 PM
>>> *To: * <[email protected]>
>>>
>>> We briefly discussed using arrow schemas in place of beam schemas
>>>> entirely in an arrow thread [1]. The biggest reason not to this was that we
>>>> wanted to have a type for large iterables in beam schemas. But given that
>>>> large iterables aren't currently implemented, beam schemas look very
>>>> similar to arrow schemas.
>>>>
>>>
>>>
>>>> I think it makes sense to take inspiration from arrow schemas where
>>>> possible, and maybe even copy them outright. Arrow already has a portable
>>>> (flatbuffers) schema representation [2], and implementations for it in many
>>>> languages that we may be able to re-use as we bring schemas to more SDKs
>>>> (the project has Python and Go implementations). There are a couple of
>>>> concepts in Arrow schemas that are specific for the format and wouldn't
>>>> make sense for us, (fields can indicate whether or not they are dictionary
>>>> encoded, and the schema has an endianness field), but if you drop those
>>>> concepts the arrow spec looks pretty similar to the beam proto spec.
>>>>
>>>
>>> FWIW I left a blank section in the doc for filling out what the
>>> differences are and why, and conversely what the interop opportunities may
>>> be. Such sections are some of my favorite sections of design docs.
>>>
>>> Kenn
>>>
>>>
>>> Brian
>>>>
>>>> [1]
>>>> https://lists.apache.org/thread.html/6be7715e13b71c2d161e4378c5ca1c76ac40cfc5988a03ba87f1c434@%3Cdev.beam.apache.org%3E
>>>> [2] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L194
>>>>
>>>> *From: *Robert Bradshaw <[email protected]>
>>>> *Date: *Thu, May 9, 2019 at 1:38 PM
>>>> *To: *dev
>>>>
>>>> From: Reuven Lax <[email protected]>
>>>>> Date: Thu, May 9, 2019 at 7:29 PM
>>>>> To: dev
>>>>>
>>>>> > Also in the future we might be able to do optimizations at the
>>>>> runner level if at the portability layer we understood schemes instead of
>>>>> just raw coders. This could be things like only parsing a subset of a row
>>>>> (if we know only a few fields are accessed) or using a columnar data
>>>>> structure like Arrow to encode batches of rows across portability. This
>>>>> doesn't affect data semantics of course, but having a richer,
>>>>> more-expressive type system opens up other opportunities.
>>>>>
>>>>> But we could do all of that with a RowCoder we understood to designate
>>>>> the type(s), right?
>>>>>
>>>>> > On Thu, May 9, 2019 at 10:16 AM Robert Bradshaw <[email protected]>
>>>>> wrote:
>>>>> >>
>>>>> >> On the flip side, Schemas are equivalent to the space of Coders with
>>>>> >> the addition of a RowCoder and the ability to materialize to
>>>>> something
>>>>> >> other than bytes, right? (Perhaps I'm missing something big here...)
>>>>> >> This may make a backwards-compatible transition easier. (SDK-side,
>>>>> the
>>>>> >> ability to reason about and operate on such types is of course much
>>>>> >> richer than anything Coders offer right now.)
>>>>> >>
>>>>> >> From: Reuven Lax <[email protected]>
>>>>> >> Date: Thu, May 9, 2019 at 4:52 PM
>>>>> >> To: dev
>>>>> >>
>>>>> >> > FYI I can imagine a world in which we have no coders. We could
>>>>> define the entire model on top of schemas. Today's "Coder" is completely
>>>>> equivalent to a single-field schema with a logical-type field (actually 
>>>>> the
>>>>> latter is slightly more expressive as you aren't forced to serialize into
>>>>> bytes).
>>>>> >> >
>>>>> >> > Due to compatibility constraints and the effort that would be
>>>>> involved in such a change, I think the practical decision should be for
>>>>> schemas and coders to coexist for the time being. However when we start
>>>>> planning Beam 3.0, deprecating coders is something I would like to 
>>>>> suggest.
>>>>> >> >
>>>>> >> > On Thu, May 9, 2019 at 7:48 AM Robert Bradshaw <
>>>>> [email protected]> wrote:
>>>>> >> >>
>>>>> >> >> From: Kenneth Knowles <[email protected]>
>>>>> >> >> Date: Thu, May 9, 2019 at 10:05 AM
>>>>> >> >> To: dev
>>>>> >> >>
>>>>> >> >> > This is a huge development. Top posting because I can be more
>>>>> compact.
>>>>> >> >> >
>>>>> >> >> > I really think after the initial idea converges this needs a
>>>>> design doc with goals and alternatives. It is an extraordinarily
>>>>> consequential model change. So in the spirit of doing the work / bias
>>>>> towards action, I created a quick draft at
>>>>> https://s.apache.org/beam-schemas and added everyone on this thread
>>>>> as editors. I am still in the process of writing this to match the thread.
>>>>> >> >>
>>>>> >> >> Thanks! Added some comments there.
>>>>> >> >>
>>>>> >> >> > *Multiple timestamp resolutions*: you can use logcial types to
>>>>> represent nanos the same way Java and proto do.
>>>>> >> >>
>>>>> >> >> As per the other discussion, I'm unsure the value in supporting
>>>>> >> >> multiple timestamp resolutions is high enough to outweigh the
>>>>> cost.
>>>>> >> >>
>>>>> >> >> > *Why multiple int types?* The domain of values for these types
>>>>> are different. For a language with one "int" or "number" type, that's
>>>>> another domain of values.
>>>>> >> >>
>>>>> >> >> What is the value in having different domains? If your data has a
>>>>> >> >> natural domain, chances are it doesn't line up exactly with one
>>>>> of
>>>>> >> >> these. I guess it's for languages whose types have specific
>>>>> domains?
>>>>> >> >> (There's also compactness in representation, encoded and
>>>>> in-memory,
>>>>> >> >> though I'm not sure that's high.)
>>>>> >> >>
>>>>> >> >> > *Columnar/Arrow*: making sure we unlock the ability to take
>>>>> this path is Paramount. So tying it directly to a row-oriented coder seems
>>>>> counterproductive.
>>>>> >> >>
>>>>> >> >> I don't think Coders are necessarily row-oriented. They are,
>>>>> however,
>>>>> >> >> bytes-oriented. (Perhaps they need not be.) There seems to be a
>>>>> lot of
>>>>> >> >> overlap between what Coders express in terms of element typing
>>>>> >> >> information and what Schemas express, and I'd rather have one
>>>>> concept
>>>>> >> >> if possible. Or have a clear division of responsibilities.
>>>>> >> >>
>>>>> >> >> > *Multimap*: what does it add over an array-valued map or
>>>>> large-iterable-valued map? (honest question, not rhetorical)
>>>>> >> >>
>>>>> >> >> Multimap has a different notion of what it means to contain a
>>>>> value,
>>>>> >> >> can handle (unordered) unions of non-disjoint keys, etc. Maybe
>>>>> this
>>>>> >> >> isn't worth a new primitive type.
>>>>> >> >>
>>>>> >> >> > *URN/enum for type names*: I see the case for both. The core
>>>>> types are fundamental enough they should never really change - after all,
>>>>> proto, thrift, avro, arrow, have addressed this (not to mention most
>>>>> programming languages). Maybe additions once every few years. I prefer the
>>>>> smallest intersection of these schema languages. A oneof is more clear,
>>>>> while URN emphasizes the similarity of built-in and logical types.
>>>>> >> >>
>>>>> >> >> Hmm... Do we have any examples of the multi-level
>>>>> primitive/logical
>>>>> >> >> type in any of these other systems? I have a bias towards all
>>>>> types
>>>>> >> >> being on the same footing unless there is compelling reason to
>>>>> divide
>>>>> >> >> things into primitive/use-defined ones.
>>>>> >> >>
>>>>> >> >> Here it seems like the most essential value of the primitive
>>>>> type set
>>>>> >> >> is to describe the underlying representation, for encoding
>>>>> elements in
>>>>> >> >> a variety of ways (notably columnar, but also interfacing with
>>>>> other
>>>>> >> >> external systems like IOs). Perhaps, rather than the previous
>>>>> >> >> suggestion of making everything a logical of bytes, this could
>>>>> be made
>>>>> >> >> clear by still making everything a logical type, but renaming
>>>>> >> >> "TypeName" to Representation. There would be URNs (typically with
>>>>> >> >> empty payloads) for the various primitive types (whose mapping to
>>>>> >> >> their representations would be the identity).
>>>>> >> >>
>>>>> >> >> - Robert
>>>>>
>>>>

Re: [DISCUSS] Portability representation of schemas

Reply via email to