My biggest concern is that if we don't make TIMESTAMP (yes, TIMESTAMP is a better name for DATETIME) a first class citizen that we get *inconsistencies* between the difference portability implementations. The same holds true for DECIMAL and DURATION. If we aren't given pipeline developers a consistent way of working with timestamp we're going to generate a lot of frustration.
I always said "TIMESTAMP's are the nail in the coffin of data engineers"... For the rest It's a bit too early to make a lot of informed input here, as I just started working with schema's for my protobuf implementation. _/ _/ Alex Van Boxel On Thu, May 9, 2019 at 10:05 AM Kenneth Knowles <[email protected]> wrote: > This is a huge development. Top posting because I can be more compact. > > I really think after the initial idea converges this needs a design doc > with goals and alternatives. It is an extraordinarily consequential model > change. So in the spirit of doing the work / bias towards action, I created > a quick draft at https://s.apache.org/beam-schemas and added everyone on > this thread as editors. I am still in the process of writing this to match > the thread. > > *Multiple timestamp resolutions*: you can use logcial types to represent > nanos the same way Java and proto do. > > *Why multiple int types?* The domain of values for these types are > different. For a language with one "int" or "number" type, that's another > domain of values. > > *Columnar/Arrow*: making sure we unlock the ability to take this path is > Paramount. So tying it directly to a row-oriented coder seems > counterproductive. > > *Nullable/optional*: optional as it exists in Java, Haskell, Scala, ocaml, > etc, is strictly more expressive than the billion dollar mistake. > Nullability of a field is different and less expressive than nullability of > a type. > > *Union types*: tagged disjoint unions and oneof are the most useful form > of union. Embedding them into a relational model you get something like > proto oneof. Not too hard to add later. > > *Multimap*: what does it add over an array-valued map or > large-iterable-valued map? (honest question, not rhetorical) > > *id* is a loaded term in other places in the model. I would call it > something else. > > *URN/enum for type names*: I see the case for both. The core types are > fundamental enough they should never really change - after all, proto, > thrift, avro, arrow, have addressed this (not to mention most programming > languages). Maybe additions once every few years. I prefer the smallest > intersection of these schema languages. A oneof is more clear, while URN > emphasizes the similarity of built-in and logical types. > > *Multiple encodings of a value*: I actually think this is a benefit. > There's a lot to unpack here. > > *Language specifics*: the design doc should describe the domain of values, > and this should go in the core docs. Then for each SDK it should explicitly > say what language type (or types?) the values are embedded in. Just like > protos language guides. > > Kenn > > *From: *Udi Meiri <[email protected]> > *Date: *Wed, May 8, 2019, 18:48 > *To: * <[email protected]> > > From a Python type hints perspective, how do schemas fit? Type hints are >> currently used to determine which coder to use. >> It seems that given a schema field, it would be useful to be able to >> convert it to a coder (using URNs?), and to convert the coder into a typing >> type. >> This would allow for pipeline-construction-time type compatibility checks. >> >> Some questions: >> 1. Why are there 4 types of int (byte, int16, int32, int64)? Is it to >> maintain type fidelity when writing back? If so, what happens in languages >> that only have "int"? >> 2. What is encoding_position? How does it differ from id (which is also a >> position)? >> 3. When are schema protos constructed? Are they available during pipeline >> construction or afterwards? >> 4. Once data is read into a Beam pipeline and a schema inferred, do we >> maintain the schema types throughout the pipeline or use language-local >> types? >> >> >> On Wed, May 8, 2019 at 6:39 PM Robert Bradshaw <[email protected]> >> wrote: >> >>> From: Reuven Lax <[email protected]> >>> Date: Wed, May 8, 2019 at 10:36 PM >>> To: dev >>> >>> > On Wed, May 8, 2019 at 1:23 PM Robert Bradshaw <[email protected]> >>> wrote: >>> >> >>> >> Very excited to see this. In particular, I think this will be very >>> >> useful for cross-language pipelines (not just SQL, but also for >>> >> describing non-trivial data (e.g. for source and sink reuse). >>> >> >>> >> The proto specification makes sense to me. The only thing that looks >>> >> like it's missing (other than possibly iterable, for arbitrarily-large >>> >> support) is multimap. Another basic type, should we want to support >>> >> it, is union (though this of course can get messy). >>> > >>> > multimap is an interesting suggestion. Do you have a use case in mind? >>> > >>> > union (or oneof) is also a good suggestion. There are good use cases >>> for this, but this is a more fundamental change. >>> >>> No specific usecase, they just seemed to round out the options. >>> >>> >> I'm curious what the rational was for going with a oneof for type_info >>> >> rather than an repeated components like we do with coders. >>> > >>> > No strong reason. Do you think repeated components is better than >>> oneof? >>> >>> It's more consistent with how we currently do coders (which has pros and >>> cons). >>> >>> >> Removing DATETIME as a logical coder on top of INT64 may cause issues >>> >> of insufficient resolution and/or timespan. Similarly with DECIMAL (or >>> >> would it be backed by string?) >>> > >>> > There could be multiple TIMESTAMP types for different resolutions, and >>> they don't all need the same backing field type. E.g. the backing type for >>> nanoseconds could by Row(INT64, INT64), or it could just be a byte array. >>> >>> Hmm.... What would the value be in supporting different types of >>> timestamps? Would all SDKs have to support all of them? Can one >>> compare, take differences, etc. across timestamp types? (As Luke >>> points out, the other conversation on timestamps is likely relevant >>> here as well.) >>> >>> >> The biggest question, as far as portability is concerned at least, is >>> >> the notion of logical types. serialized_class is clearly not portable, >>> >> and I also think we'll want a way to share semantic meaning across >>> >> SDKs (especially if things like dates become logical types). Perhaps >>> >> URNs (+payloads) would be a better fit here? >>> > >>> > Yes, URN + payload is probably the better fit for portability. >>> > >>> >> Taking a step back, I think it's worth asking why we have different >>> >> types, rather than simply making everything a LogicalType of bytes >>> >> (aka coder). Other than encoding format, the answer I can come up with >>> >> is that the type decides the kinds of operations that can be done on >>> >> it, e.g. does it support comparison? Arithmetic? Containment? >>> >> Higher-level date operations? Perhaps this should be used to guide the >>> >> set of types we provide. >>> > >>> > Also even though we could make everything a LogicalType (though at >>> least byte array would have to stay primitive), I think it's useful to >>> have a slightly larger set of primitive types. It makes things easier to >>> understand and debug, and it makes it simpler for the various SDKs to map >>> them to their types (e.g. mapping to POJOs). >>> >>> This would be the case if one didn't have LogicalType at all, but >>> once one introduces that one now has this more complicated two-level >>> hierarchy of types which doesn't seem simpler to me. >>> >>> I'm trying to understand what information Schema encodes that a >>> NamedTupleCoder (or RowCoder) would/could not. (Coders have the >>> disadvantage that there are multiple encodings of a single value, e.g. >>> BigEndian vs. VarInt, but if we have multiple resolutions of timestamp >>> that would still seem to be an issue. Possibly another advantage is >>> encoding into non-record-oriented formats, e.g. Parquet or Arrow, that >>> have a set of primitives.) >>> >>
