Your reasoning about SchemaCoder really being a type coercion coder makes a lot of sense to me.
On Fri, May 24, 2019 at 11:42 AM Brian Hulette <[email protected]> wrote: > *tl;dr:* SchemaCoder represents a logical type with a base type of Row > and we should think about that. > > I'm a little concerned that the current proposals for a portable > representation don't actually fully represent Schemas. It seems to me that > the current java-only Schemas are made up three concepts that are > intertwined: > (a) The Java SDK specific code for schema inference, type coercion, and > "schema-aware" transforms. > (b) A RowCoder[1] that encodes Rows[2] which have a particular Schema[3]. > (c) A SchemaCoder[4] that has a RowCoder for a particular schema, and > functions for converting Rows with that schema to/from a Java type T. Those > functions and the RowCoder are then composed to provider a Coder for the > type T. > > We're not concerned with (a) at this time since that's specific to the > SDK, not the interface between them. My understanding is we just want to > define a portable representation for (b) and/or (c). > > What has been discussed so far is really just a portable representation > for (b), the RowCoder, since the discussion is only around how to represent > the schema itself and not the to/from functions. One of the outstanding > questions for that schema representation is how to represent logical types, > which may or may not have some language type in each SDK (the canonical > example being a timsetamp type with seconds and nanos and > java.time.Instant). I think this question is critically important, because > (c), the SchemaCoder, is actually *defining a logical type* with a language > type T in the Java SDK. This becomes clear when you compare SchemaCoder[4] > to the Schema.LogicalType interface[5] - both essentially have three > attributes: a base type, and two functions for converting to/from that base > type. The only difference is for SchemaCoder that base type must be a Row > so it can be represented by a Schema alone, while LogicalType can have any > base type that can be represented by FieldType, including a Row. > > I think it may make sense to fully embrace this duality, by letting > SchemaCoder have a baseType other than just Row and renaming it to > LogicalTypeCoder/LanguageTypeCoder. The current Java SDK schema-aware > transforms (a) would operate only on LogicalTypeCoders with a Row base > type. Perhaps some of the current schema logic could alsobe applied more > generally to any logical type - for example, to provide type coercion for > logical types with a base type other than Row, like int64 and a timestamp > class backed by millis, or fixed size bytes and a UUID class. And having a > portable representation that represents those (non Row backed) logical > types with some URN would also allow us to pass them to other languages > without unnecessarily wrapping them in a Row in order to use SchemaCoder. > > I've added a section to the doc [6] to propose this alternative in the > context of the portable representation but I wanted to bring it up here as > well to solicit feedback. > > [1] > https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/RowCoder.java#L41 > [2] > https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/Row.java#L59 > [3] > https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L48 > [4] > https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/SchemaCoder.java#L33 > [5] > https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L489 > [6] > https://docs.google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?ts=5cdf6a5b#heading=h.7570feur1qin > > On Fri, May 10, 2019 at 9:16 AM Brian Hulette <[email protected]> wrote: > >> Ah thanks! I added some language there. >> >> *From: *Kenneth Knowles <[email protected]> >> *Date: *Thu, May 9, 2019 at 5:31 PM >> *To: *dev >> >> >>> *From: *Brian Hulette <[email protected]> >>> *Date: *Thu, May 9, 2019 at 2:02 PM >>> *To: * <[email protected]> >>> >>> We briefly discussed using arrow schemas in place of beam schemas >>>> entirely in an arrow thread [1]. The biggest reason not to this was that we >>>> wanted to have a type for large iterables in beam schemas. But given that >>>> large iterables aren't currently implemented, beam schemas look very >>>> similar to arrow schemas. >>>> >>> >>> >>>> I think it makes sense to take inspiration from arrow schemas where >>>> possible, and maybe even copy them outright. Arrow already has a portable >>>> (flatbuffers) schema representation [2], and implementations for it in many >>>> languages that we may be able to re-use as we bring schemas to more SDKs >>>> (the project has Python and Go implementations). There are a couple of >>>> concepts in Arrow schemas that are specific for the format and wouldn't >>>> make sense for us, (fields can indicate whether or not they are dictionary >>>> encoded, and the schema has an endianness field), but if you drop those >>>> concepts the arrow spec looks pretty similar to the beam proto spec. >>>> >>> >>> FWIW I left a blank section in the doc for filling out what the >>> differences are and why, and conversely what the interop opportunities may >>> be. Such sections are some of my favorite sections of design docs. >>> >>> Kenn >>> >>> >>> Brian >>>> >>>> [1] >>>> https://lists.apache.org/thread.html/6be7715e13b71c2d161e4378c5ca1c76ac40cfc5988a03ba87f1c434@%3Cdev.beam.apache.org%3E >>>> [2] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L194 >>>> >>>> *From: *Robert Bradshaw <[email protected]> >>>> *Date: *Thu, May 9, 2019 at 1:38 PM >>>> *To: *dev >>>> >>>> From: Reuven Lax <[email protected]> >>>>> Date: Thu, May 9, 2019 at 7:29 PM >>>>> To: dev >>>>> >>>>> > Also in the future we might be able to do optimizations at the >>>>> runner level if at the portability layer we understood schemes instead of >>>>> just raw coders. This could be things like only parsing a subset of a row >>>>> (if we know only a few fields are accessed) or using a columnar data >>>>> structure like Arrow to encode batches of rows across portability. This >>>>> doesn't affect data semantics of course, but having a richer, >>>>> more-expressive type system opens up other opportunities. >>>>> >>>>> But we could do all of that with a RowCoder we understood to designate >>>>> the type(s), right? >>>>> >>>>> > On Thu, May 9, 2019 at 10:16 AM Robert Bradshaw <[email protected]> >>>>> wrote: >>>>> >> >>>>> >> On the flip side, Schemas are equivalent to the space of Coders with >>>>> >> the addition of a RowCoder and the ability to materialize to >>>>> something >>>>> >> other than bytes, right? (Perhaps I'm missing something big here...) >>>>> >> This may make a backwards-compatible transition easier. (SDK-side, >>>>> the >>>>> >> ability to reason about and operate on such types is of course much >>>>> >> richer than anything Coders offer right now.) >>>>> >> >>>>> >> From: Reuven Lax <[email protected]> >>>>> >> Date: Thu, May 9, 2019 at 4:52 PM >>>>> >> To: dev >>>>> >> >>>>> >> > FYI I can imagine a world in which we have no coders. We could >>>>> define the entire model on top of schemas. Today's "Coder" is completely >>>>> equivalent to a single-field schema with a logical-type field (actually >>>>> the >>>>> latter is slightly more expressive as you aren't forced to serialize into >>>>> bytes). >>>>> >> > >>>>> >> > Due to compatibility constraints and the effort that would be >>>>> involved in such a change, I think the practical decision should be for >>>>> schemas and coders to coexist for the time being. However when we start >>>>> planning Beam 3.0, deprecating coders is something I would like to >>>>> suggest. >>>>> >> > >>>>> >> > On Thu, May 9, 2019 at 7:48 AM Robert Bradshaw < >>>>> [email protected]> wrote: >>>>> >> >> >>>>> >> >> From: Kenneth Knowles <[email protected]> >>>>> >> >> Date: Thu, May 9, 2019 at 10:05 AM >>>>> >> >> To: dev >>>>> >> >> >>>>> >> >> > This is a huge development. Top posting because I can be more >>>>> compact. >>>>> >> >> > >>>>> >> >> > I really think after the initial idea converges this needs a >>>>> design doc with goals and alternatives. It is an extraordinarily >>>>> consequential model change. So in the spirit of doing the work / bias >>>>> towards action, I created a quick draft at >>>>> https://s.apache.org/beam-schemas and added everyone on this thread >>>>> as editors. I am still in the process of writing this to match the thread. >>>>> >> >> >>>>> >> >> Thanks! Added some comments there. >>>>> >> >> >>>>> >> >> > *Multiple timestamp resolutions*: you can use logcial types to >>>>> represent nanos the same way Java and proto do. >>>>> >> >> >>>>> >> >> As per the other discussion, I'm unsure the value in supporting >>>>> >> >> multiple timestamp resolutions is high enough to outweigh the >>>>> cost. >>>>> >> >> >>>>> >> >> > *Why multiple int types?* The domain of values for these types >>>>> are different. For a language with one "int" or "number" type, that's >>>>> another domain of values. >>>>> >> >> >>>>> >> >> What is the value in having different domains? If your data has a >>>>> >> >> natural domain, chances are it doesn't line up exactly with one >>>>> of >>>>> >> >> these. I guess it's for languages whose types have specific >>>>> domains? >>>>> >> >> (There's also compactness in representation, encoded and >>>>> in-memory, >>>>> >> >> though I'm not sure that's high.) >>>>> >> >> >>>>> >> >> > *Columnar/Arrow*: making sure we unlock the ability to take >>>>> this path is Paramount. So tying it directly to a row-oriented coder seems >>>>> counterproductive. >>>>> >> >> >>>>> >> >> I don't think Coders are necessarily row-oriented. They are, >>>>> however, >>>>> >> >> bytes-oriented. (Perhaps they need not be.) There seems to be a >>>>> lot of >>>>> >> >> overlap between what Coders express in terms of element typing >>>>> >> >> information and what Schemas express, and I'd rather have one >>>>> concept >>>>> >> >> if possible. Or have a clear division of responsibilities. >>>>> >> >> >>>>> >> >> > *Multimap*: what does it add over an array-valued map or >>>>> large-iterable-valued map? (honest question, not rhetorical) >>>>> >> >> >>>>> >> >> Multimap has a different notion of what it means to contain a >>>>> value, >>>>> >> >> can handle (unordered) unions of non-disjoint keys, etc. Maybe >>>>> this >>>>> >> >> isn't worth a new primitive type. >>>>> >> >> >>>>> >> >> > *URN/enum for type names*: I see the case for both. The core >>>>> types are fundamental enough they should never really change - after all, >>>>> proto, thrift, avro, arrow, have addressed this (not to mention most >>>>> programming languages). Maybe additions once every few years. I prefer the >>>>> smallest intersection of these schema languages. A oneof is more clear, >>>>> while URN emphasizes the similarity of built-in and logical types. >>>>> >> >> >>>>> >> >> Hmm... Do we have any examples of the multi-level >>>>> primitive/logical >>>>> >> >> type in any of these other systems? I have a bias towards all >>>>> types >>>>> >> >> being on the same footing unless there is compelling reason to >>>>> divide >>>>> >> >> things into primitive/use-defined ones. >>>>> >> >> >>>>> >> >> Here it seems like the most essential value of the primitive >>>>> type set >>>>> >> >> is to describe the underlying representation, for encoding >>>>> elements in >>>>> >> >> a variety of ways (notably columnar, but also interfacing with >>>>> other >>>>> >> >> external systems like IOs). Perhaps, rather than the previous >>>>> >> >> suggestion of making everything a logical of bytes, this could >>>>> be made >>>>> >> >> clear by still making everything a logical type, but renaming >>>>> >> >> "TypeName" to Representation. There would be URNs (typically with >>>>> >> >> empty payloads) for the various primitive types (whose mapping to >>>>> >> >> their representations would be the identity). >>>>> >> >> >>>>> >> >> - Robert >>>>> >>>>
