On Tue, Jun 4, 2019 at 9:20 AM Brian Hulette <[email protected]> wrote:
> > > On Mon, Jun 3, 2019 at 10:04 PM Reuven Lax <[email protected]> wrote: > >> >> >> On Mon, Jun 3, 2019 at 12:27 PM Brian Hulette <[email protected]> >> wrote: >> >>> > It has to go into the proto somewhere (since that's the only way the >>> SDK can get it), but I'm not sure they should be considered integral parts >>> of the type. >>> Are you just advocating for an approach where any SDK-specific >>> information is stored outside of the Schema message itself so that Schema >>> really does just represent the type? That seems reasonable to me, and >>> alleviates my concerns about how this applies to columnar encodings a bit >>> as well. >>> >> >> Yes, that's exactly what I'm advocating. >> >> >>> >>> We could lift all of the LogicalTypeConversion messages out of the >>> Schema and the LogicalType like this: >>> >>> message SchemaCoder { >>> Schema schema = 1; >>> LogicalTypeConversion root_conversion = 2; >>> map<string, LogicalTypeConversion> attribute_conversions = 3; // only >>> necessary for user type aliases, portable logical types by definition have >>> nothing SDK-specific >>> } >>> >> >> I'm not sure what the map is for? I think we have status quo wihtout it. >> > > My intention was that the SDK-specific information (to/from functions) for > any nested fields that are themselves user type aliases would be stored in > this map. That was the motivation for my next question, if we don't allow > user types to be nested within other user types we may not need it. > Oh, is this meant to contain the ids of all the logical types in this schema? If so I don't think SchemaCoder is the right place for this. Any "registry" of logical types should be global to the pipeline, not scoped to a single PCollection IMO. > I may be missing your meaning - but I think we currently only have status > quo without this map in the Java SDK because Schema.LogicalType is just an > interface that must be implemented. It's appropriate for just portable > logical types, not user-type aliases. Note I've adopted Kenn's terminology > where portable logical type is a type that can be identified by just a URN > and maybe some parameters, while a user type alias needs some SDK specific > information, like a class and to/from UDFs. > > >> >>> I think a critical question (that has implications for the above >>> proposal) is how/if the two different concepts Kenn mentioned are allowed >>> to nest. For example, you could argue it's redundant to have a user type >>> alias that has a Row representation with a field that is itself a user type >>> alias, because instead you could just have a single top-level type alias >>> with to/from functions that pack and unpack the entire hierarchy. On the >>> other hand, I think it does make sense for a user type alias or a truly >>> portable logical type to have a field that is itself a truly portable >>> logical type (e.g. a user type alias or portable type with a DateTime). >>> >>> I've been assuming that user-type aliases could be nested, but should we >>> disallow that? Or should we go the other way and require that logical types >>> define at most one "level"? >>> >> >> No I think it's useful to allow things to be nested (though of course the >> nesting must terminate). >> > >> >>> >>> Brian >>> >>> On Mon, Jun 3, 2019 at 11:08 AM Kenneth Knowles <[email protected]> wrote: >>> >>>> >>>> On Mon, Jun 3, 2019 at 10:53 AM Reuven Lax <[email protected]> wrote: >>>> >>>>> So I feel a bit leery about making the to/from functions a fundamental >>>>> part of the portability representation. In my mind, that is very tied to a >>>>> specific SDK/language. A SDK (say the Java SDK) wants to allow users to >>>>> use >>>>> a wide variety of native types with schemas, and under the covers uses the >>>>> to/from functions to implement that. However from the portable Beam >>>>> perspective, the schema itself should be the real "type" of the >>>>> PCollection; the to/from methods are simply a way that a particular SDK >>>>> makes schemas easier to use. It has to go into the proto somewhere (since >>>>> that's the only way the SDK can get it), but I'm not sure they should be >>>>> considered integral parts of the type. >>>>> >>>> >>>> On the doc in a couple places this distinction was made: >>>> >>>> * For truly portable logical types, no instructions for the SDK are >>>> needed. Instead, they require: >>>> - URN: a standardized identifier any SDK can recognize >>>> - A spec: what is the universe of values in this type? >>>> - A representation: how is it represented in built-in types? This is >>>> how SDKs who do not know/care about the URN will process it >>>> - (optional): SDKs choose preferred SDK-specific types to embed the >>>> values in. SDKs have to know about the URN and choose for themselves. >>>> >>>> *For user-level type aliases, written as convenience by the user in >>>> their pipeline, what Java schemas have today: >>>> - to/from UDFs: the code is SDK-specific >>>> - some representation of the intended type (like java class): also >>>> SDK specific >>>> - a representation >>>> - any "id" is just like other ids in the pipeline, just avoiding >>>> duplicating the proto >>>> - Luke points out that nesting these can give multiple SDKs a hint >>>> >>>> In my mind the remaining complexity is whether or not we need to be >>>> able to move between the two. Composite PTransforms, for example, do have >>>> fluidity between being strictly user-defined versus portable URN+payload. >>>> But it requires lots of engineering, namely the current work on expansion >>>> service. >>>> >>>> Kenn >>>> >>>> >>>>> On Mon, Jun 3, 2019 at 10:23 AM Brian Hulette <[email protected]> >>>>> wrote: >>>>> >>>>>> Ah I see, I didn't realize that. Then I suppose we'll need to/from >>>>>> functions somewhere in the logical type conversion to preserve the >>>>>> current >>>>>> behavior. >>>>>> >>>>>> I'm still a little hesitant to make these functions an explicit part >>>>>> of LogicalTypeConversion for another reason. Down the road, schemas could >>>>>> give us an avenue to use a batched columnar format (presumably arrow, but >>>>>> of course others are possible). By making to/from an explicit part of >>>>>> logical types we add some element-wise logic to a schema representation >>>>>> that's otherwise ambivalent to element-wise vs. batched encodings. >>>>>> >>>>>> I suppose you could make an argument that to/from are only for >>>>>> custom types. There will also be some set of well-known types identified >>>>>> only by URN and some parameters, which could easily be translated to a >>>>>> columnar format. We could just not support custom types fully if we add a >>>>>> columnar encoding, or maybe add optional toBatch/fromBatch functions >>>>>> when/if we get there. >>>>>> >>>>>> What about something like this that makes the two different types of >>>>>> logical types explicit? >>>>>> >>>>>> // Describes a logical type and how to convert between it and its >>>>>> representation (e.g. Row). >>>>>> message LogicalTypeConversion { >>>>>> oneof conversion { >>>>>> message Standard standard = 1; >>>>>> message Custom custom = 2; >>>>>> } >>>>>> >>>>>> message Standard { >>>>>> String urn = 1; >>>>>> repeated string args = 2; // could also be a map >>>>>> } >>>>>> >>>>>> message Custom { >>>>>> FunctionSpec(?) toRepresentation = 1; >>>>>> FunctionSpec(?) fromRepresentation = 2; >>>>>> bytes type = 3; // e.g. serialized class for Java >>>>>> } >>>>>> } >>>>>> >>>>>> And LogicalType and Schema become: >>>>>> >>>>>> message LogicalType { >>>>>> FieldType representation = 1; >>>>>> LogicalTypeConversion conversion = 2; >>>>>> } >>>>>> >>>>>> message Schema { >>>>>> ... >>>>>> repeated Field fields = 1; >>>>>> LogicalTypeConversion conversion = 2; // implied that >>>>>> representation is Row >>>>>> } >>>>>> >>>>>> Brian >>>>>> >>>>>> On Sat, Jun 1, 2019 at 10:44 AM Reuven Lax <[email protected]> wrote: >>>>>> >>>>>>> Keep in mind that right now the SchemaRegistry is only assumed to >>>>>>> exist at graph-construction time, not at execution time; all >>>>>>> information in >>>>>>> the schema registry is embedded in the SchemaCoder, which is the only >>>>>>> thing >>>>>>> we keep around when the pipeline is actually running. We could look into >>>>>>> changing this, but it would potentially be a very big change, and I do >>>>>>> think we should start getting users actively using schemas soon. >>>>>>> >>>>>>> On Fri, May 31, 2019 at 3:40 PM Brian Hulette <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> > Can you propose what the protos would look like in this case? >>>>>>>> Right now LogicalType does not contain the to/from conversion >>>>>>>> functions in >>>>>>>> the proto. Do you think we'll need to add these in? >>>>>>>> >>>>>>>> Maybe. Right now the proposed LogicalType message is pretty >>>>>>>> simple/generic: >>>>>>>> message LogicalType { >>>>>>>> FieldType representation = 1; >>>>>>>> string logical_urn = 2; >>>>>>>> bytes logical_payload = 3; >>>>>>>> } >>>>>>>> >>>>>>>> If we keep just logical_urn and logical_payload, the >>>>>>>> logical_payload could itself be a protobuf with attributes of 1) a >>>>>>>> serialized class and 2/3) to/from functions. Or, alternatively, we >>>>>>>> could >>>>>>>> have a generalization of the SchemaRegistry for logical types. >>>>>>>> Implementations for standard types and user-defined types would be >>>>>>>> registered by URN, and the SDK could look them up given just a URN. I >>>>>>>> put a >>>>>>>> brief section about this alternative in the doc last week [1]. What I >>>>>>>> suggested there included removing the logical_payload field, which is >>>>>>>> probably overkill. The critical piece is just relying on a registry in >>>>>>>> the >>>>>>>> SDK to look up types and to/from functions rather than storing them in >>>>>>>> the >>>>>>>> portable schema itself. >>>>>>>> >>>>>>>> I kind of like keeping the LogicalType message generic for now, >>>>>>>> since it gives us a way to try out these various approaches, but maybe >>>>>>>> that's just a cop out. >>>>>>>> >>>>>>>> [1] >>>>>>>> https://docs.google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?ts=5cdf6a5b#heading=h.jlt5hdrolfy >>>>>>>> >>>>>>>> On Fri, May 31, 2019 at 12:36 PM Reuven Lax <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, May 28, 2019 at 10:11 AM Brian Hulette < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Sun, May 26, 2019 at 1:25 PM Reuven Lax <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Fri, May 24, 2019 at 11:42 AM Brian Hulette < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> *tl;dr:* SchemaCoder represents a logical type with a base >>>>>>>>>>>> type of Row and we should think about that. >>>>>>>>>>>> >>>>>>>>>>>> I'm a little concerned that the current proposals for a >>>>>>>>>>>> portable representation don't actually fully represent Schemas. It >>>>>>>>>>>> seems to >>>>>>>>>>>> me that the current java-only Schemas are made up three concepts >>>>>>>>>>>> that are >>>>>>>>>>>> intertwined: >>>>>>>>>>>> (a) The Java SDK specific code for schema inference, type >>>>>>>>>>>> coercion, and "schema-aware" transforms. >>>>>>>>>>>> (b) A RowCoder[1] that encodes Rows[2] which have a particular >>>>>>>>>>>> Schema[3]. >>>>>>>>>>>> (c) A SchemaCoder[4] that has a RowCoder for a >>>>>>>>>>>> particular schema, and functions for converting Rows with that >>>>>>>>>>>> schema >>>>>>>>>>>> to/from a Java type T. Those functions and the RowCoder are then >>>>>>>>>>>> composed >>>>>>>>>>>> to provider a Coder for the type T. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> RowCoder is currently just an internal implementation detail, it >>>>>>>>>>> can be eliminated. SchemaCoder is the only thing that determines a >>>>>>>>>>> schema >>>>>>>>>>> today. >>>>>>>>>>> >>>>>>>>>> Why not keep it around? I think it would make sense to have a >>>>>>>>>> RowCoder implementation in every SDK, as well as something like >>>>>>>>>> SchemaCoder >>>>>>>>>> that defines a conversion from that SDK's "Row" to the language type. >>>>>>>>>> >>>>>>>>> >>>>>>>>> The point is that from a programmer's perspective, there is >>>>>>>>> nothing much special about Row. Any type can have a schema, and the >>>>>>>>> only >>>>>>>>> special thing about Row is that it's always guaranteed to exist. From >>>>>>>>> that >>>>>>>>> standpoint, Row is nearly an implementation detail. Today RowCoder is >>>>>>>>> never >>>>>>>>> set on _any_ PCollection, it's literally just used as a helper >>>>>>>>> library, so >>>>>>>>> there's no real need for it to exist as a "Coder." >>>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> We're not concerned with (a) at this time since that's specific >>>>>>>>>>>> to the SDK, not the interface between them. My understanding is we >>>>>>>>>>>> just >>>>>>>>>>>> want to define a portable representation for (b) and/or (c). >>>>>>>>>>>> >>>>>>>>>>>> What has been discussed so far is really just a portable >>>>>>>>>>>> representation for (b), the RowCoder, since the discussion is only >>>>>>>>>>>> around >>>>>>>>>>>> how to represent the schema itself and not the to/from functions. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Correct. The to/from functions are actually related to a). One >>>>>>>>>>> of the big goals of schemas was that users should not be forced to >>>>>>>>>>> operate >>>>>>>>>>> on rows to get schemas. A user can create PCollection<MyRandomType> >>>>>>>>>>> and as >>>>>>>>>>> long as the SDK can infer a schema from MyRandomType, the user >>>>>>>>>>> never needs >>>>>>>>>>> to even see a Row object. The to/fromRow functions are what make >>>>>>>>>>> this work >>>>>>>>>>> today. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> One of the points I'd like to make is that this type coercion is >>>>>>>>>> a useful concept on it's own, separate from schemas. It's especially >>>>>>>>>> useful >>>>>>>>>> for a type that has a schema and is encoded by RowCoder since that >>>>>>>>>> can >>>>>>>>>> represent many more types, but the type coercion doesn't have to be >>>>>>>>>> tied to >>>>>>>>>> just schemas and RowCoder. We could also do type coercion for types >>>>>>>>>> that >>>>>>>>>> are effectively wrappers around an integer or a string. It could >>>>>>>>>> just be a >>>>>>>>>> general way to map language types to base types (i.e. types that we >>>>>>>>>> have a >>>>>>>>>> coder for). Then it just becomes a general framework for extending >>>>>>>>>> coders >>>>>>>>>> to represent more language types. >>>>>>>>>> >>>>>>>>> >>>>>>>>> Let's not tie those conversations. Maybe a similar concept will >>>>>>>>> hold true for general coders (or we might decide to get rid of coders >>>>>>>>> in >>>>>>>>> favor of schemas, in which case that becomes moot), but I don't think >>>>>>>>> we >>>>>>>>> should prematurely generalize. >>>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> One of the outstanding questions for that schema representation >>>>>>>>>>>> is how to represent logical types, which may or may not have some >>>>>>>>>>>> language >>>>>>>>>>>> type in each SDK (the canonical example being a timsetamp type >>>>>>>>>>>> with seconds >>>>>>>>>>>> and nanos and java.time.Instant). I think this question is >>>>>>>>>>>> critically >>>>>>>>>>>> important, because (c), the SchemaCoder, is actually *defining a >>>>>>>>>>>> logical >>>>>>>>>>>> type* with a language type T in the Java SDK. This becomes clear >>>>>>>>>>>> when you >>>>>>>>>>>> compare SchemaCoder[4] to the Schema.LogicalType interface[5] - >>>>>>>>>>>> both >>>>>>>>>>>> essentially have three attributes: a base type, and two functions >>>>>>>>>>>> for >>>>>>>>>>>> converting to/from that base type. The only difference is for >>>>>>>>>>>> SchemaCoder >>>>>>>>>>>> that base type must be a Row so it can be represented by a Schema >>>>>>>>>>>> alone, >>>>>>>>>>>> while LogicalType can have any base type that can be represented by >>>>>>>>>>>> FieldType, including a Row. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> This is not true actually. SchemaCoder can have any base type, >>>>>>>>>>> that's why (in Java) it's SchemaCoder<T>. This is why >>>>>>>>>>> PCollection<T> can >>>>>>>>>>> have a schema, even if T is not Row. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> I'm not sure I effectively communicated what I meant - When I >>>>>>>>>> said SchemaCoder's "base type" I wasn't referring to T, I was >>>>>>>>>> referring to >>>>>>>>>> the base FieldType, whose coder we use for this type. I meant "base >>>>>>>>>> type" >>>>>>>>>> to be analogous to LogicalType's `getBaseType`, or what Kenn is >>>>>>>>>> suggesting >>>>>>>>>> we call "representation" in the portable beam schemas doc. To define >>>>>>>>>> some >>>>>>>>>> terms from my original message: >>>>>>>>>> base type = an instance of FieldType, crucially this is something >>>>>>>>>> that we have a coder for (be it VarIntCoder, Utf8Coder, RowCoder, >>>>>>>>>> ...) >>>>>>>>>> language type (or "T", "type T", "logical type") = Some Java >>>>>>>>>> class (or something analogous in the other SDKs) that we may or may >>>>>>>>>> not >>>>>>>>>> have a coder for. It's possible to define functions for converting >>>>>>>>>> instances of the language type to/from the base type. >>>>>>>>>> >>>>>>>>>> I was just trying to make the case that SchemaCoder is really a >>>>>>>>>> special case of LogicalType, where `getBaseType` always returns a >>>>>>>>>> Row with >>>>>>>>>> the stored Schema. >>>>>>>>>> >>>>>>>>> >>>>>>>>> Yeah, I think I got that point. >>>>>>>>> >>>>>>>>> Can you propose what the protos would look like in this case? >>>>>>>>> Right now LogicalType does not contain the to/from conversion >>>>>>>>> functions in >>>>>>>>> the proto. Do you think we'll need to add these in? >>>>>>>>> >>>>>>>>> >>>>>>>>>> To make the point with code: SchemaCoder<T> can be made to >>>>>>>>>> implement Schema.LogicalType<T,Row> with trivial implementations of >>>>>>>>>> getBaseType, toBaseType, and toInputType (I'm not trying to say we >>>>>>>>>> should >>>>>>>>>> or shouldn't do this, just using it illustrate my point): >>>>>>>>>> >>>>>>>>>> class SchemaCoder extends CustomCoder<T> implements >>>>>>>>>> Schema.LogicalType<T, Row> { >>>>>>>>>> ... >>>>>>>>>> >>>>>>>>>> @Override >>>>>>>>>> FieldType getBaseType() { >>>>>>>>>> return FieldType.row(getSchema()); >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> @Override >>>>>>>>>> public Row toBaseType() { >>>>>>>>>> return this.toRowFunction.apply(input); >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> @Override >>>>>>>>>> public T toInputType(Row base) { >>>>>>>>>> return this.fromRowFunction.apply(base); >>>>>>>>>> } >>>>>>>>>> ... >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>>> I think it may make sense to fully embrace this duality, by >>>>>>>>>>>> letting SchemaCoder have a baseType other than just Row and >>>>>>>>>>>> renaming it to >>>>>>>>>>>> LogicalTypeCoder/LanguageTypeCoder. The current Java SDK >>>>>>>>>>>> schema-aware >>>>>>>>>>>> transforms (a) would operate only on LogicalTypeCoders with a Row >>>>>>>>>>>> base >>>>>>>>>>>> type. Perhaps some of the current schema logic could alsobe >>>>>>>>>>>> applied more >>>>>>>>>>>> generally to any logical type - for example, to provide type >>>>>>>>>>>> coercion for >>>>>>>>>>>> logical types with a base type other than Row, like int64 and a >>>>>>>>>>>> timestamp >>>>>>>>>>>> class backed by millis, or fixed size bytes and a UUID class. And >>>>>>>>>>>> having a >>>>>>>>>>>> portable representation that represents those (non Row backed) >>>>>>>>>>>> logical >>>>>>>>>>>> types with some URN would also allow us to pass them to other >>>>>>>>>>>> languages >>>>>>>>>>>> without unnecessarily wrapping them in a Row in order to use >>>>>>>>>>>> SchemaCoder. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I think the actual overlap here is between the to/from functions >>>>>>>>>>> in SchemaCoder (which is what allows SchemaCoder<T> where T != Row) >>>>>>>>>>> and the >>>>>>>>>>> equivalent functionality in LogicalType. However making all of >>>>>>>>>>> schemas >>>>>>>>>>> simply just a logical type feels a bit awkward and circular to me. >>>>>>>>>>> Maybe we >>>>>>>>>>> should refactor that part out into a LogicalTypeConversion proto, >>>>>>>>>>> and >>>>>>>>>>> reference that from both LogicalType and from SchemaCoder? >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> LogicalType is already potentially circular though. A schema can >>>>>>>>>> have a field with a logical type, and that logical type can have a >>>>>>>>>> base >>>>>>>>>> type of Row with a field with a logical type (and on and on...). To >>>>>>>>>> me it >>>>>>>>>> seems elegant, not awkward, to recognize that SchemaCoder is just a >>>>>>>>>> special >>>>>>>>>> case of this concept. >>>>>>>>>> >>>>>>>>>> Something like the LogicalTypeConversion proto would definitely >>>>>>>>>> be an improvement, but I would still prefer just using a top-level >>>>>>>>>> logical >>>>>>>>>> type :) >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I've added a section to the doc [6] to propose this alternative >>>>>>>>>>>> in the context of the portable representation but I wanted to >>>>>>>>>>>> bring it up >>>>>>>>>>>> here as well to solicit feedback. >>>>>>>>>>>> >>>>>>>>>>>> [1] >>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/RowCoder.java#L41 >>>>>>>>>>>> [2] >>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/Row.java#L59 >>>>>>>>>>>> [3] >>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L48 >>>>>>>>>>>> [4] >>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/SchemaCoder.java#L33 >>>>>>>>>>>> [5] >>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L489 >>>>>>>>>>>> [6] >>>>>>>>>>>> https://docs.google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?ts=5cdf6a5b#heading=h.7570feur1qin >>>>>>>>>>>> >>>>>>>>>>>> On Fri, May 10, 2019 at 9:16 AM Brian Hulette < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Ah thanks! I added some language there. >>>>>>>>>>>>> >>>>>>>>>>>>> *From: *Kenneth Knowles <[email protected]> >>>>>>>>>>>>> *Date: *Thu, May 9, 2019 at 5:31 PM >>>>>>>>>>>>> *To: *dev >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> *From: *Brian Hulette <[email protected]> >>>>>>>>>>>>>> *Date: *Thu, May 9, 2019 at 2:02 PM >>>>>>>>>>>>>> *To: * <[email protected]> >>>>>>>>>>>>>> >>>>>>>>>>>>>> We briefly discussed using arrow schemas in place of beam >>>>>>>>>>>>>>> schemas entirely in an arrow thread [1]. The biggest reason not >>>>>>>>>>>>>>> to this was >>>>>>>>>>>>>>> that we wanted to have a type for large iterables in beam >>>>>>>>>>>>>>> schemas. But >>>>>>>>>>>>>>> given that large iterables aren't currently implemented, beam >>>>>>>>>>>>>>> schemas look >>>>>>>>>>>>>>> very similar to arrow schemas. >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> I think it makes sense to take inspiration from arrow >>>>>>>>>>>>>>> schemas where possible, and maybe even copy them outright. >>>>>>>>>>>>>>> Arrow already >>>>>>>>>>>>>>> has a portable (flatbuffers) schema representation [2], and >>>>>>>>>>>>>>> implementations >>>>>>>>>>>>>>> for it in many languages that we may be able to re-use as we >>>>>>>>>>>>>>> bring schemas >>>>>>>>>>>>>>> to more SDKs (the project has Python and Go implementations). >>>>>>>>>>>>>>> There are a >>>>>>>>>>>>>>> couple of concepts in Arrow schemas that are specific for the >>>>>>>>>>>>>>> format and >>>>>>>>>>>>>>> wouldn't make sense for us, (fields can indicate whether or not >>>>>>>>>>>>>>> they are >>>>>>>>>>>>>>> dictionary encoded, and the schema has an endianness field), >>>>>>>>>>>>>>> but if you >>>>>>>>>>>>>>> drop those concepts the arrow spec looks pretty similar to the >>>>>>>>>>>>>>> beam proto >>>>>>>>>>>>>>> spec. >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> FWIW I left a blank section in the doc for filling out what >>>>>>>>>>>>>> the differences are and why, and conversely what the interop >>>>>>>>>>>>>> opportunities >>>>>>>>>>>>>> may be. Such sections are some of my favorite sections of design >>>>>>>>>>>>>> docs. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Kenn >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Brian >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>>> https://lists.apache.org/thread.html/6be7715e13b71c2d161e4378c5ca1c76ac40cfc5988a03ba87f1c434@%3Cdev.beam.apache.org%3E >>>>>>>>>>>>>>> [2] >>>>>>>>>>>>>>> https://github.com/apache/arrow/blob/master/format/Schema.fbs#L194 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> *From: *Robert Bradshaw <[email protected]> >>>>>>>>>>>>>>> *Date: *Thu, May 9, 2019 at 1:38 PM >>>>>>>>>>>>>>> *To: *dev >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> From: Reuven Lax <[email protected]> >>>>>>>>>>>>>>>> Date: Thu, May 9, 2019 at 7:29 PM >>>>>>>>>>>>>>>> To: dev >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> > Also in the future we might be able to do optimizations >>>>>>>>>>>>>>>> at the runner level if at the portability layer we understood >>>>>>>>>>>>>>>> schemes >>>>>>>>>>>>>>>> instead of just raw coders. This could be things like only >>>>>>>>>>>>>>>> parsing a subset >>>>>>>>>>>>>>>> of a row (if we know only a few fields are accessed) or using >>>>>>>>>>>>>>>> a columnar >>>>>>>>>>>>>>>> data structure like Arrow to encode batches of rows across >>>>>>>>>>>>>>>> portability. >>>>>>>>>>>>>>>> This doesn't affect data semantics of course, but having a >>>>>>>>>>>>>>>> richer, >>>>>>>>>>>>>>>> more-expressive type system opens up other opportunities. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> But we could do all of that with a RowCoder we understood >>>>>>>>>>>>>>>> to designate >>>>>>>>>>>>>>>> the type(s), right? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> > On Thu, May 9, 2019 at 10:16 AM Robert Bradshaw < >>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> >> On the flip side, Schemas are equivalent to the space of >>>>>>>>>>>>>>>> Coders with >>>>>>>>>>>>>>>> >> the addition of a RowCoder and the ability to >>>>>>>>>>>>>>>> materialize to something >>>>>>>>>>>>>>>> >> other than bytes, right? (Perhaps I'm missing something >>>>>>>>>>>>>>>> big here...) >>>>>>>>>>>>>>>> >> This may make a backwards-compatible transition easier. >>>>>>>>>>>>>>>> (SDK-side, the >>>>>>>>>>>>>>>> >> ability to reason about and operate on such types is of >>>>>>>>>>>>>>>> course much >>>>>>>>>>>>>>>> >> richer than anything Coders offer right now.) >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> >> From: Reuven Lax <[email protected]> >>>>>>>>>>>>>>>> >> Date: Thu, May 9, 2019 at 4:52 PM >>>>>>>>>>>>>>>> >> To: dev >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> >> > FYI I can imagine a world in which we have no coders. >>>>>>>>>>>>>>>> We could define the entire model on top of schemas. Today's >>>>>>>>>>>>>>>> "Coder" is >>>>>>>>>>>>>>>> completely equivalent to a single-field schema with a >>>>>>>>>>>>>>>> logical-type field >>>>>>>>>>>>>>>> (actually the latter is slightly more expressive as you aren't >>>>>>>>>>>>>>>> forced to >>>>>>>>>>>>>>>> serialize into bytes). >>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>> >> > Due to compatibility constraints and the effort that >>>>>>>>>>>>>>>> would be involved in such a change, I think the practical >>>>>>>>>>>>>>>> decision should >>>>>>>>>>>>>>>> be for schemas and coders to coexist for the time being. >>>>>>>>>>>>>>>> However when we >>>>>>>>>>>>>>>> start planning Beam 3.0, deprecating coders is something I >>>>>>>>>>>>>>>> would like to >>>>>>>>>>>>>>>> suggest. >>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>> >> > On Thu, May 9, 2019 at 7:48 AM Robert Bradshaw < >>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>> >> >> From: Kenneth Knowles <[email protected]> >>>>>>>>>>>>>>>> >> >> Date: Thu, May 9, 2019 at 10:05 AM >>>>>>>>>>>>>>>> >> >> To: dev >>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>> >> >> > This is a huge development. Top posting because I >>>>>>>>>>>>>>>> can be more compact. >>>>>>>>>>>>>>>> >> >> > >>>>>>>>>>>>>>>> >> >> > I really think after the initial idea converges >>>>>>>>>>>>>>>> this needs a design doc with goals and alternatives. It is an >>>>>>>>>>>>>>>> extraordinarily consequential model change. So in the spirit >>>>>>>>>>>>>>>> of doing the >>>>>>>>>>>>>>>> work / bias towards action, I created a quick draft at >>>>>>>>>>>>>>>> https://s.apache.org/beam-schemas and added everyone on >>>>>>>>>>>>>>>> this thread as editors. I am still in the process of writing >>>>>>>>>>>>>>>> this to match >>>>>>>>>>>>>>>> the thread. >>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>> >> >> Thanks! Added some comments there. >>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>> >> >> > *Multiple timestamp resolutions*: you can use >>>>>>>>>>>>>>>> logcial types to represent nanos the same way Java and proto >>>>>>>>>>>>>>>> do. >>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>> >> >> As per the other discussion, I'm unsure the value in >>>>>>>>>>>>>>>> supporting >>>>>>>>>>>>>>>> >> >> multiple timestamp resolutions is high enough to >>>>>>>>>>>>>>>> outweigh the cost. >>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>> >> >> > *Why multiple int types?* The domain of values for >>>>>>>>>>>>>>>> these types are different. For a language with one "int" or >>>>>>>>>>>>>>>> "number" type, >>>>>>>>>>>>>>>> that's another domain of values. >>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>> >> >> What is the value in having different domains? If >>>>>>>>>>>>>>>> your data has a >>>>>>>>>>>>>>>> >> >> natural domain, chances are it doesn't line up >>>>>>>>>>>>>>>> exactly with one of >>>>>>>>>>>>>>>> >> >> these. I guess it's for languages whose types have >>>>>>>>>>>>>>>> specific domains? >>>>>>>>>>>>>>>> >> >> (There's also compactness in representation, encoded >>>>>>>>>>>>>>>> and in-memory, >>>>>>>>>>>>>>>> >> >> though I'm not sure that's high.) >>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>> >> >> > *Columnar/Arrow*: making sure we unlock the ability >>>>>>>>>>>>>>>> to take this path is Paramount. So tying it directly to a >>>>>>>>>>>>>>>> row-oriented >>>>>>>>>>>>>>>> coder seems counterproductive. >>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>> >> >> I don't think Coders are necessarily row-oriented. >>>>>>>>>>>>>>>> They are, however, >>>>>>>>>>>>>>>> >> >> bytes-oriented. (Perhaps they need not be.) There >>>>>>>>>>>>>>>> seems to be a lot of >>>>>>>>>>>>>>>> >> >> overlap between what Coders express in terms of >>>>>>>>>>>>>>>> element typing >>>>>>>>>>>>>>>> >> >> information and what Schemas express, and I'd rather >>>>>>>>>>>>>>>> have one concept >>>>>>>>>>>>>>>> >> >> if possible. Or have a clear division of >>>>>>>>>>>>>>>> responsibilities. >>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>> >> >> > *Multimap*: what does it add over an array-valued >>>>>>>>>>>>>>>> map or large-iterable-valued map? (honest question, not >>>>>>>>>>>>>>>> rhetorical) >>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>> >> >> Multimap has a different notion of what it means to >>>>>>>>>>>>>>>> contain a value, >>>>>>>>>>>>>>>> >> >> can handle (unordered) unions of non-disjoint keys, >>>>>>>>>>>>>>>> etc. Maybe this >>>>>>>>>>>>>>>> >> >> isn't worth a new primitive type. >>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>> >> >> > *URN/enum for type names*: I see the case for both. >>>>>>>>>>>>>>>> The core types are fundamental enough they should never really >>>>>>>>>>>>>>>> change - >>>>>>>>>>>>>>>> after all, proto, thrift, avro, arrow, have addressed this >>>>>>>>>>>>>>>> (not to mention >>>>>>>>>>>>>>>> most programming languages). Maybe additions once every few >>>>>>>>>>>>>>>> years. I prefer >>>>>>>>>>>>>>>> the smallest intersection of these schema languages. A oneof >>>>>>>>>>>>>>>> is more clear, >>>>>>>>>>>>>>>> while URN emphasizes the similarity of built-in and logical >>>>>>>>>>>>>>>> types. >>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>> >> >> Hmm... Do we have any examples of the multi-level >>>>>>>>>>>>>>>> primitive/logical >>>>>>>>>>>>>>>> >> >> type in any of these other systems? I have a bias >>>>>>>>>>>>>>>> towards all types >>>>>>>>>>>>>>>> >> >> being on the same footing unless there is compelling >>>>>>>>>>>>>>>> reason to divide >>>>>>>>>>>>>>>> >> >> things into primitive/use-defined ones. >>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>> >> >> Here it seems like the most essential value of the >>>>>>>>>>>>>>>> primitive type set >>>>>>>>>>>>>>>> >> >> is to describe the underlying representation, for >>>>>>>>>>>>>>>> encoding elements in >>>>>>>>>>>>>>>> >> >> a variety of ways (notably columnar, but also >>>>>>>>>>>>>>>> interfacing with other >>>>>>>>>>>>>>>> >> >> external systems like IOs). Perhaps, rather than the >>>>>>>>>>>>>>>> previous >>>>>>>>>>>>>>>> >> >> suggestion of making everything a logical of bytes, >>>>>>>>>>>>>>>> this could be made >>>>>>>>>>>>>>>> >> >> clear by still making everything a logical type, but >>>>>>>>>>>>>>>> renaming >>>>>>>>>>>>>>>> >> >> "TypeName" to Representation. There would be URNs >>>>>>>>>>>>>>>> (typically with >>>>>>>>>>>>>>>> >> >> empty payloads) for the various primitive types >>>>>>>>>>>>>>>> (whose mapping to >>>>>>>>>>>>>>>> >> >> their representations would be the identity). >>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>> >> >> - Robert >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>
