Re: [DISCUSS] Portability representation of schemas

Lukasz Cwik Tue, 28 May 2019 10:28:28 -0700

I like the concept of expressing type coercion as a wrapper coder which
says that this language treats this type as Foo. This seems to be useful in
general for cross language pipelines since it is much more likely that two
languages will understand an encoding but may want to express the type
within the language differently. This would allow for coders that are like:


LogicalTypeCoder<
  type = org.joda.Datetime,
  component = LogicalTypeCoder<
                type = datetime.datetime,
                component = <beam:coder:varlong:v1>>>



On Tue, May 28, 2019 at 10:11 AM Brian Hulette <bhule...@google.com> wrote:

>
>
> On Sun, May 26, 2019 at 1:25 PM Reuven Lax <re...@google.com> wrote:
>
>>
>>
>> On Fri, May 24, 2019 at 11:42 AM Brian Hulette <bhule...@google.com>
>> wrote:
>>
>>> *tl;dr:* SchemaCoder represents a logical type with a base type of Row
>>> and we should think about that.
>>>
>>> I'm a little concerned that the current proposals for a portable
>>> representation don't actually fully represent Schemas. It seems to me that
>>> the current java-only Schemas are made up three concepts that are
>>> intertwined:
>>> (a) The Java SDK specific code for schema inference, type coercion, and
>>> "schema-aware" transforms.
>>> (b) A RowCoder[1] that encodes Rows[2] which have a particular Schema[3].
>>> (c) A SchemaCoder[4] that has a RowCoder for a particular schema, and
>>> functions for converting Rows with that schema to/from a Java type T. Those
>>> functions and the RowCoder are then composed to provider a Coder for the
>>> type T.
>>>
>>
>> RowCoder is currently just an internal implementation detail, it can be
>> eliminated. SchemaCoder is the only thing that determines a schema today.
>>
> Why not keep it around? I think it would make sense to have a RowCoder
> implementation in every SDK, as well as something like SchemaCoder that
> defines a conversion from that SDK's "Row" to the language type.
>
>>
>>
>>>
>>> We're not concerned with (a) at this time since that's specific to the
>>> SDK, not the interface between them. My understanding is we just want to
>>> define a portable representation for (b) and/or (c).
>>>
>>> What has been discussed so far is really just a portable representation
>>> for (b), the RowCoder, since the discussion is only around how to represent
>>> the schema itself and not the to/from functions.
>>>
>>
>> Correct. The to/from functions are actually related to a). One of the big
>> goals of schemas was that users should not be forced to operate on rows to
>> get schemas. A user can create PCollection<MyRandomType> and as long as the
>> SDK can infer a schema from MyRandomType, the user never needs to even see
>> a Row object. The to/fromRow functions are what make this work today.
>>
>>
>
> One of the points I'd like to make is that this type coercion is a useful
> concept on it's own, separate from schemas. It's especially useful for a
> type that has a schema and is encoded by RowCoder since that can represent
> many more types, but the type coercion doesn't have to be tied to just
> schemas and RowCoder. We could also do type coercion for types that are
> effectively wrappers around an integer or a string. It could just be a
> general way to map language types to base types (i.e. types that we have a
> coder for). Then it just becomes a general framework for extending coders
> to represent more language types.
>
>
>
>> One of the outstanding questions for that schema representation is how to
>>> represent logical types, which may or may not have some language type in
>>> each SDK (the canonical example being a timsetamp type with seconds and
>>> nanos and java.time.Instant). I think this question is critically
>>> important, because (c), the SchemaCoder, is actually *defining a logical
>>> type* with a language type T in the Java SDK. This becomes clear when you
>>> compare SchemaCoder[4] to the Schema.LogicalType interface[5] - both
>>> essentially have three attributes: a base type, and two functions for
>>> converting to/from that base type. The only difference is for SchemaCoder
>>> that base type must be a Row so it can be represented by a Schema alone,
>>> while LogicalType can have any base type that can be represented by
>>> FieldType, including a Row.
>>>
>>
>> This is not true actually. SchemaCoder can have any base type, that's why
>> (in Java) it's SchemaCoder<T>. This is why PCollection<T> can have a
>> schema, even if T is not Row.
>>
>>
> I'm not sure I effectively communicated what I meant - When I said
> SchemaCoder's "base type" I wasn't referring to T, I was referring to the
> base FieldType, whose coder we use for this type. I meant "base type" to be
> analogous to LogicalType's `getBaseType`, or what Kenn is suggesting we
> call "representation" in the portable beam schemas doc. To define some
> terms from my original message:
> base type = an instance of FieldType, crucially this is something that we
> have a coder for (be it VarIntCoder, Utf8Coder, RowCoder, ...)
> language type (or "T", "type T", "logical type") = Some Java class (or
> something analogous in the other SDKs) that we may or may not have a coder
> for. It's possible to define functions for converting instances of the
> language type to/from the base type.
>
> I was just trying to make the case that SchemaCoder is really a special
> case of LogicalType, where `getBaseType` always returns a Row with the
> stored Schema.
>
> To make the point with code: SchemaCoder<T> can be made to implement
> Schema.LogicalType<T,Row> with trivial implementations of getBaseType,
> toBaseType, and toInputType (I'm not trying to say we should or shouldn't
> do this, just using it illustrate my point):
>
> class SchemaCoder extends CustomCoder<T> implements Schema.LogicalType<T,
> Row> {
>   ...
>
>   @Override
>   FieldType getBaseType() {
>     return FieldType.row(getSchema());
>   }
>
>   @Override
>   public Row toBaseType() {
>     return this.toRowFunction.apply(input);
>   }
>
>   @Override
>   public T toInputType(Row base) {
>     return this.fromRowFunction.apply(base);
>   }
>   ...
> }
>
>
>>> I think it may make sense to fully embrace this duality, by letting
>>> SchemaCoder have a baseType other than just Row and renaming it to
>>> LogicalTypeCoder/LanguageTypeCoder. The current Java SDK schema-aware
>>> transforms (a) would operate only on LogicalTypeCoders with a Row base
>>> type. Perhaps some of the current schema logic could  alsobe applied more
>>> generally to any logical type  - for example, to provide type coercion for
>>> logical types with a base type other than Row, like int64 and a timestamp
>>> class backed by millis, or fixed size bytes and a UUID class. And having a
>>> portable representation that represents those (non Row backed) logical
>>> types with some URN would also allow us to pass them to other languages
>>> without unnecessarily wrapping them in a Row in order to use SchemaCoder.
>>>
>>
>> I think the actual overlap here is between the to/from functions in
>> SchemaCoder (which is what allows SchemaCoder<T> where T != Row) and the
>> equivalent functionality in LogicalType. However making all of schemas
>> simply just a logical type feels a bit awkward and circular to me. Maybe we
>> should refactor that part out into a LogicalTypeConversion proto, and
>> reference that from both LogicalType and from SchemaCoder?
>>
>
> LogicalType is already potentially circular though. A schema can have a
> field with a logical type, and that logical type can have a base type of
> Row with a field with a logical type (and on and on...). To me it seems
> elegant, not awkward, to recognize that SchemaCoder is just a special case
> of this concept.
>
> Something like the LogicalTypeConversion proto would definitely be an
> improvement, but I would still prefer just using a top-level logical type :)
>
>>
>>
>> I've added a section to the doc [6] to propose this alternative in the
>>> context of the portable representation but I wanted to bring it up here as
>>> well to solicit feedback.
>>>
>>> [1]
>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/RowCoder.java#L41
>>> [2]
>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/Row.java#L59
>>> [3]
>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L48
>>> [4]
>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/SchemaCoder.java#L33
>>> [5]
>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L489
>>> [6]
>>> https://docs.google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?ts=5cdf6a5b#heading=h.7570feur1qin
>>>
>>> On Fri, May 10, 2019 at 9:16 AM Brian Hulette <bhule...@google.com>
>>> wrote:
>>>
>>>> Ah thanks! I added some language there.
>>>>
>>>> *From: *Kenneth Knowles <k...@apache.org>
>>>> *Date: *Thu, May 9, 2019 at 5:31 PM
>>>> *To: *dev
>>>>
>>>>
>>>>> *From: *Brian Hulette <bhule...@google.com>
>>>>> *Date: *Thu, May 9, 2019 at 2:02 PM
>>>>> *To: * <dev@beam.apache.org>
>>>>>
>>>>> We briefly discussed using arrow schemas in place of beam schemas
>>>>>> entirely in an arrow thread [1]. The biggest reason not to this was that 
>>>>>> we
>>>>>> wanted to have a type for large iterables in beam schemas. But given that
>>>>>> large iterables aren't currently implemented, beam schemas look very
>>>>>> similar to arrow schemas.
>>>>>>
>>>>>
>>>>>
>>>>>> I think it makes sense to take inspiration from arrow schemas where
>>>>>> possible, and maybe even copy them outright. Arrow already has a portable
>>>>>> (flatbuffers) schema representation [2], and implementations for it in 
>>>>>> many
>>>>>> languages that we may be able to re-use as we bring schemas to more SDKs
>>>>>> (the project has Python and Go implementations). There are a couple of
>>>>>> concepts in Arrow schemas that are specific for the format and wouldn't
>>>>>> make sense for us, (fields can indicate whether or not they are 
>>>>>> dictionary
>>>>>> encoded, and the schema has an endianness field), but if you drop those
>>>>>> concepts the arrow spec looks pretty similar to the beam proto spec.
>>>>>>
>>>>>
>>>>> FWIW I left a blank section in the doc for filling out what the
>>>>> differences are and why, and conversely what the interop opportunities may
>>>>> be. Such sections are some of my favorite sections of design docs.
>>>>>
>>>>> Kenn
>>>>>
>>>>>
>>>>> Brian
>>>>>>
>>>>>> [1]
>>>>>> https://lists.apache.org/thread.html/6be7715e13b71c2d161e4378c5ca1c76ac40cfc5988a03ba87f1c434@%3Cdev.beam.apache.org%3E
>>>>>> [2]
>>>>>> https://github.com/apache/arrow/blob/master/format/Schema.fbs#L194
>>>>>>
>>>>>> *From: *Robert Bradshaw <rober...@google.com>
>>>>>> *Date: *Thu, May 9, 2019 at 1:38 PM
>>>>>> *To: *dev
>>>>>>
>>>>>> From: Reuven Lax <re...@google.com>
>>>>>>> Date: Thu, May 9, 2019 at 7:29 PM
>>>>>>> To: dev
>>>>>>>
>>>>>>> > Also in the future we might be able to do optimizations at the
>>>>>>> runner level if at the portability layer we understood schemes instead 
>>>>>>> of
>>>>>>> just raw coders. This could be things like only parsing a subset of a 
>>>>>>> row
>>>>>>> (if we know only a few fields are accessed) or using a columnar data
>>>>>>> structure like Arrow to encode batches of rows across portability. This
>>>>>>> doesn't affect data semantics of course, but having a richer,
>>>>>>> more-expressive type system opens up other opportunities.
>>>>>>>
>>>>>>> But we could do all of that with a RowCoder we understood to
>>>>>>> designate
>>>>>>> the type(s), right?
>>>>>>>
>>>>>>> > On Thu, May 9, 2019 at 10:16 AM Robert Bradshaw <
>>>>>>> rober...@google.com> wrote:
>>>>>>> >>
>>>>>>> >> On the flip side, Schemas are equivalent to the space of Coders
>>>>>>> with
>>>>>>> >> the addition of a RowCoder and the ability to materialize to
>>>>>>> something
>>>>>>> >> other than bytes, right? (Perhaps I'm missing something big
>>>>>>> here...)
>>>>>>> >> This may make a backwards-compatible transition easier.
>>>>>>> (SDK-side, the
>>>>>>> >> ability to reason about and operate on such types is of course
>>>>>>> much
>>>>>>> >> richer than anything Coders offer right now.)
>>>>>>> >>
>>>>>>> >> From: Reuven Lax <re...@google.com>
>>>>>>> >> Date: Thu, May 9, 2019 at 4:52 PM
>>>>>>> >> To: dev
>>>>>>> >>
>>>>>>> >> > FYI I can imagine a world in which we have no coders. We could
>>>>>>> define the entire model on top of schemas. Today's "Coder" is completely
>>>>>>> equivalent to a single-field schema with a logical-type field (actually 
>>>>>>> the
>>>>>>> latter is slightly more expressive as you aren't forced to serialize 
>>>>>>> into
>>>>>>> bytes).
>>>>>>> >> >
>>>>>>> >> > Due to compatibility constraints and the effort that would be
>>>>>>> involved in such a change, I think the practical decision should be for
>>>>>>> schemas and coders to coexist for the time being. However when we start
>>>>>>> planning Beam 3.0, deprecating coders is something I would like to 
>>>>>>> suggest.
>>>>>>> >> >
>>>>>>> >> > On Thu, May 9, 2019 at 7:48 AM Robert Bradshaw <
>>>>>>> rober...@google.com> wrote:
>>>>>>> >> >>
>>>>>>> >> >> From: Kenneth Knowles <k...@apache.org>
>>>>>>> >> >> Date: Thu, May 9, 2019 at 10:05 AM
>>>>>>> >> >> To: dev
>>>>>>> >> >>
>>>>>>> >> >> > This is a huge development. Top posting because I can be
>>>>>>> more compact.
>>>>>>> >> >> >
>>>>>>> >> >> > I really think after the initial idea converges this needs a
>>>>>>> design doc with goals and alternatives. It is an extraordinarily
>>>>>>> consequential model change. So in the spirit of doing the work / bias
>>>>>>> towards action, I created a quick draft at
>>>>>>> https://s.apache.org/beam-schemas and added everyone on this thread
>>>>>>> as editors. I am still in the process of writing this to match the 
>>>>>>> thread.
>>>>>>> >> >>
>>>>>>> >> >> Thanks! Added some comments there.
>>>>>>> >> >>
>>>>>>> >> >> > *Multiple timestamp resolutions*: you can use logcial types
>>>>>>> to represent nanos the same way Java and proto do.
>>>>>>> >> >>
>>>>>>> >> >> As per the other discussion, I'm unsure the value in supporting
>>>>>>> >> >> multiple timestamp resolutions is high enough to outweigh the
>>>>>>> cost.
>>>>>>> >> >>
>>>>>>> >> >> > *Why multiple int types?* The domain of values for these
>>>>>>> types are different. For a language with one "int" or "number" type, 
>>>>>>> that's
>>>>>>> another domain of values.
>>>>>>> >> >>
>>>>>>> >> >> What is the value in having different domains? If your data
>>>>>>> has a
>>>>>>> >> >> natural domain, chances are it doesn't line up exactly with
>>>>>>> one of
>>>>>>> >> >> these. I guess it's for languages whose types have specific
>>>>>>> domains?
>>>>>>> >> >> (There's also compactness in representation, encoded and
>>>>>>> in-memory,
>>>>>>> >> >> though I'm not sure that's high.)
>>>>>>> >> >>
>>>>>>> >> >> > *Columnar/Arrow*: making sure we unlock the ability to take
>>>>>>> this path is Paramount. So tying it directly to a row-oriented coder 
>>>>>>> seems
>>>>>>> counterproductive.
>>>>>>> >> >>
>>>>>>> >> >> I don't think Coders are necessarily row-oriented. They are,
>>>>>>> however,
>>>>>>> >> >> bytes-oriented. (Perhaps they need not be.) There seems to be
>>>>>>> a lot of
>>>>>>> >> >> overlap between what Coders express in terms of element typing
>>>>>>> >> >> information and what Schemas express, and I'd rather have one
>>>>>>> concept
>>>>>>> >> >> if possible. Or have a clear division of responsibilities.
>>>>>>> >> >>
>>>>>>> >> >> > *Multimap*: what does it add over an array-valued map or
>>>>>>> large-iterable-valued map? (honest question, not rhetorical)
>>>>>>> >> >>
>>>>>>> >> >> Multimap has a different notion of what it means to contain a
>>>>>>> value,
>>>>>>> >> >> can handle (unordered) unions of non-disjoint keys, etc. Maybe
>>>>>>> this
>>>>>>> >> >> isn't worth a new primitive type.
>>>>>>> >> >>
>>>>>>> >> >> > *URN/enum for type names*: I see the case for both. The core
>>>>>>> types are fundamental enough they should never really change - after 
>>>>>>> all,
>>>>>>> proto, thrift, avro, arrow, have addressed this (not to mention most
>>>>>>> programming languages). Maybe additions once every few years. I prefer 
>>>>>>> the
>>>>>>> smallest intersection of these schema languages. A oneof is more clear,
>>>>>>> while URN emphasizes the similarity of built-in and logical types.
>>>>>>> >> >>
>>>>>>> >> >> Hmm... Do we have any examples of the multi-level
>>>>>>> primitive/logical
>>>>>>> >> >> type in any of these other systems? I have a bias towards all
>>>>>>> types
>>>>>>> >> >> being on the same footing unless there is compelling reason to
>>>>>>> divide
>>>>>>> >> >> things into primitive/use-defined ones.
>>>>>>> >> >>
>>>>>>> >> >> Here it seems like the most essential value of the primitive
>>>>>>> type set
>>>>>>> >> >> is to describe the underlying representation, for encoding
>>>>>>> elements in
>>>>>>> >> >> a variety of ways (notably columnar, but also interfacing with
>>>>>>> other
>>>>>>> >> >> external systems like IOs). Perhaps, rather than the previous
>>>>>>> >> >> suggestion of making everything a logical of bytes, this could
>>>>>>> be made
>>>>>>> >> >> clear by still making everything a logical type, but renaming
>>>>>>> >> >> "TypeName" to Representation. There would be URNs (typically
>>>>>>> with
>>>>>>> >> >> empty payloads) for the various primitive types (whose mapping
>>>>>>> to
>>>>>>> >> >> their representations would be the identity).
>>>>>>> >> >>
>>>>>>> >> >> - Robert
>>>>>>>
>>>>>>

Re: [DISCUSS] Portability representation of schemas

Reply via email to