OK, fair. This is parallel how timestamp are implemented in protobuf. Then it's important (and I'll join the design doc) that we have a list of standard logical types.
_/ _/ Alex Van Boxel On Thu, May 9, 2019 at 4:11 PM Reuven Lax <[email protected]> wrote: > > > On Thu, May 9, 2019 at 6:34 AM Alex Van Boxel <[email protected]> wrote: > >> My biggest concern is that if we don't make TIMESTAMP (yes, TIMESTAMP is >> a better name for DATETIME) a first class citizen that we get >> *inconsistencies* between the difference portability implementations. >> The same holds true for DECIMAL and DURATION. If we aren't given pipeline >> developers a consistent way of working with timestamp we're going to >> generate a lot of frustration. >> > > This is a fair concern. However logical types have unique ids/urns, so we > can still make TIMESTAMP a first-class citizen. The only difference is it > will not be considered a primitive type. > > >> >> I always said "TIMESTAMP's are the nail in the coffin of data >> engineers"... >> >> For the rest It's a bit too early to make a lot of informed input here, >> as I just started working with schema's for my protobuf implementation. >> >> _/ >> _/ Alex Van Boxel >> >> >> On Thu, May 9, 2019 at 10:05 AM Kenneth Knowles <[email protected]> wrote: >> >>> This is a huge development. Top posting because I can be more compact. >>> >>> I really think after the initial idea converges this needs a design doc >>> with goals and alternatives. It is an extraordinarily consequential model >>> change. So in the spirit of doing the work / bias towards action, I created >>> a quick draft at https://s.apache.org/beam-schemas and added everyone >>> on this thread as editors. I am still in the process of writing this to >>> match the thread. >>> >>> *Multiple timestamp resolutions*: you can use logcial types to represent >>> nanos the same way Java and proto do. >>> >>> *Why multiple int types?* The domain of values for these types are >>> different. For a language with one "int" or "number" type, that's another >>> domain of values. >>> >>> *Columnar/Arrow*: making sure we unlock the ability to take this path is >>> Paramount. So tying it directly to a row-oriented coder seems >>> counterproductive. >>> >>> *Nullable/optional*: optional as it exists in Java, Haskell, Scala, >>> ocaml, etc, is strictly more expressive than the billion dollar mistake. >>> Nullability of a field is different and less expressive than nullability of >>> a type. >>> >>> *Union types*: tagged disjoint unions and oneof are the most useful form >>> of union. Embedding them into a relational model you get something like >>> proto oneof. Not too hard to add later. >>> >>> *Multimap*: what does it add over an array-valued map or >>> large-iterable-valued map? (honest question, not rhetorical) >>> >>> *id* is a loaded term in other places in the model. I would call it >>> something else. >>> >>> *URN/enum for type names*: I see the case for both. The core types are >>> fundamental enough they should never really change - after all, proto, >>> thrift, avro, arrow, have addressed this (not to mention most programming >>> languages). Maybe additions once every few years. I prefer the smallest >>> intersection of these schema languages. A oneof is more clear, while URN >>> emphasizes the similarity of built-in and logical types. >>> >>> *Multiple encodings of a value*: I actually think this is a benefit. >>> There's a lot to unpack here. >>> >>> *Language specifics*: the design doc should describe the domain of >>> values, and this should go in the core docs. Then for each SDK it should >>> explicitly say what language type (or types?) the values are embedded in. >>> Just like protos language guides. >>> >>> Kenn >>> >>> *From: *Udi Meiri <[email protected]> >>> *Date: *Wed, May 8, 2019, 18:48 >>> *To: * <[email protected]> >>> >>> From a Python type hints perspective, how do schemas fit? Type hints are >>>> currently used to determine which coder to use. >>>> It seems that given a schema field, it would be useful to be able to >>>> convert it to a coder (using URNs?), and to convert the coder into a typing >>>> type. >>>> This would allow for pipeline-construction-time type compatibility >>>> checks. >>>> >>>> Some questions: >>>> 1. Why are there 4 types of int (byte, int16, int32, int64)? Is it to >>>> maintain type fidelity when writing back? If so, what happens in languages >>>> that only have "int"? >>>> 2. What is encoding_position? How does it differ from id (which is also >>>> a position)? >>>> 3. When are schema protos constructed? Are they available during >>>> pipeline construction or afterwards? >>>> 4. Once data is read into a Beam pipeline and a schema inferred, do we >>>> maintain the schema types throughout the pipeline or use language-local >>>> types? >>>> >>>> >>>> On Wed, May 8, 2019 at 6:39 PM Robert Bradshaw <[email protected]> >>>> wrote: >>>> >>>>> From: Reuven Lax <[email protected]> >>>>> Date: Wed, May 8, 2019 at 10:36 PM >>>>> To: dev >>>>> >>>>> > On Wed, May 8, 2019 at 1:23 PM Robert Bradshaw <[email protected]> >>>>> wrote: >>>>> >> >>>>> >> Very excited to see this. In particular, I think this will be very >>>>> >> useful for cross-language pipelines (not just SQL, but also for >>>>> >> describing non-trivial data (e.g. for source and sink reuse). >>>>> >> >>>>> >> The proto specification makes sense to me. The only thing that looks >>>>> >> like it's missing (other than possibly iterable, for >>>>> arbitrarily-large >>>>> >> support) is multimap. Another basic type, should we want to support >>>>> >> it, is union (though this of course can get messy). >>>>> > >>>>> > multimap is an interesting suggestion. Do you have a use case in >>>>> mind? >>>>> > >>>>> > union (or oneof) is also a good suggestion. There are good use cases >>>>> for this, but this is a more fundamental change. >>>>> >>>>> No specific usecase, they just seemed to round out the options. >>>>> >>>>> >> I'm curious what the rational was for going with a oneof for >>>>> type_info >>>>> >> rather than an repeated components like we do with coders. >>>>> > >>>>> > No strong reason. Do you think repeated components is better than >>>>> oneof? >>>>> >>>>> It's more consistent with how we currently do coders (which has pros >>>>> and cons). >>>>> >>>>> >> Removing DATETIME as a logical coder on top of INT64 may cause >>>>> issues >>>>> >> of insufficient resolution and/or timespan. Similarly with DECIMAL >>>>> (or >>>>> >> would it be backed by string?) >>>>> > >>>>> > There could be multiple TIMESTAMP types for different resolutions, >>>>> and they don't all need the same backing field type. E.g. the backing type >>>>> for nanoseconds could by Row(INT64, INT64), or it could just be a byte >>>>> array. >>>>> >>>>> Hmm.... What would the value be in supporting different types of >>>>> timestamps? Would all SDKs have to support all of them? Can one >>>>> compare, take differences, etc. across timestamp types? (As Luke >>>>> points out, the other conversation on timestamps is likely relevant >>>>> here as well.) >>>>> >>>>> >> The biggest question, as far as portability is concerned at least, >>>>> is >>>>> >> the notion of logical types. serialized_class is clearly not >>>>> portable, >>>>> >> and I also think we'll want a way to share semantic meaning across >>>>> >> SDKs (especially if things like dates become logical types). Perhaps >>>>> >> URNs (+payloads) would be a better fit here? >>>>> > >>>>> > Yes, URN + payload is probably the better fit for portability. >>>>> > >>>>> >> Taking a step back, I think it's worth asking why we have different >>>>> >> types, rather than simply making everything a LogicalType of bytes >>>>> >> (aka coder). Other than encoding format, the answer I can come up >>>>> with >>>>> >> is that the type decides the kinds of operations that can be done on >>>>> >> it, e.g. does it support comparison? Arithmetic? Containment? >>>>> >> Higher-level date operations? Perhaps this should be used to guide >>>>> the >>>>> >> set of types we provide. >>>>> > >>>>> > Also even though we could make everything a LogicalType (though at >>>>> least byte array would have to stay primitive), I think it's useful to >>>>> have a slightly larger set of primitive types. It makes things easier to >>>>> understand and debug, and it makes it simpler for the various SDKs to map >>>>> them to their types (e.g. mapping to POJOs). >>>>> >>>>> This would be the case if one didn't have LogicalType at all, but >>>>> once one introduces that one now has this more complicated two-level >>>>> hierarchy of types which doesn't seem simpler to me. >>>>> >>>>> I'm trying to understand what information Schema encodes that a >>>>> NamedTupleCoder (or RowCoder) would/could not. (Coders have the >>>>> disadvantage that there are multiple encodings of a single value, e.g. >>>>> BigEndian vs. VarInt, but if we have multiple resolutions of timestamp >>>>> that would still seem to be an issue. Possibly another advantage is >>>>> encoding into non-record-oriented formats, e.g. Parquet or Arrow, that >>>>> have a set of primitives.) >>>>> >>>>
