On Thu, May 9, 2019 at 6:34 AM Alex Van Boxel <[email protected]> wrote:
> My biggest concern is that if we don't make TIMESTAMP (yes, TIMESTAMP is a > better name for DATETIME) a first class citizen that we get > *inconsistencies* between the difference portability implementations. The > same holds true for DECIMAL and DURATION. If we aren't given pipeline > developers a consistent way of working with timestamp we're going to > generate a lot of frustration. > This is a fair concern. However logical types have unique ids/urns, so we can still make TIMESTAMP a first-class citizen. The only difference is it will not be considered a primitive type. > > I always said "TIMESTAMP's are the nail in the coffin of data engineers"... > > For the rest It's a bit too early to make a lot of informed input here, as > I just started working with schema's for my protobuf implementation. > > _/ > _/ Alex Van Boxel > > > On Thu, May 9, 2019 at 10:05 AM Kenneth Knowles <[email protected]> wrote: > >> This is a huge development. Top posting because I can be more compact. >> >> I really think after the initial idea converges this needs a design doc >> with goals and alternatives. It is an extraordinarily consequential model >> change. So in the spirit of doing the work / bias towards action, I created >> a quick draft at https://s.apache.org/beam-schemas and added everyone on >> this thread as editors. I am still in the process of writing this to match >> the thread. >> >> *Multiple timestamp resolutions*: you can use logcial types to represent >> nanos the same way Java and proto do. >> >> *Why multiple int types?* The domain of values for these types are >> different. For a language with one "int" or "number" type, that's another >> domain of values. >> >> *Columnar/Arrow*: making sure we unlock the ability to take this path is >> Paramount. So tying it directly to a row-oriented coder seems >> counterproductive. >> >> *Nullable/optional*: optional as it exists in Java, Haskell, Scala, >> ocaml, etc, is strictly more expressive than the billion dollar mistake. >> Nullability of a field is different and less expressive than nullability of >> a type. >> >> *Union types*: tagged disjoint unions and oneof are the most useful form >> of union. Embedding them into a relational model you get something like >> proto oneof. Not too hard to add later. >> >> *Multimap*: what does it add over an array-valued map or >> large-iterable-valued map? (honest question, not rhetorical) >> >> *id* is a loaded term in other places in the model. I would call it >> something else. >> >> *URN/enum for type names*: I see the case for both. The core types are >> fundamental enough they should never really change - after all, proto, >> thrift, avro, arrow, have addressed this (not to mention most programming >> languages). Maybe additions once every few years. I prefer the smallest >> intersection of these schema languages. A oneof is more clear, while URN >> emphasizes the similarity of built-in and logical types. >> >> *Multiple encodings of a value*: I actually think this is a benefit. >> There's a lot to unpack here. >> >> *Language specifics*: the design doc should describe the domain of >> values, and this should go in the core docs. Then for each SDK it should >> explicitly say what language type (or types?) the values are embedded in. >> Just like protos language guides. >> >> Kenn >> >> *From: *Udi Meiri <[email protected]> >> *Date: *Wed, May 8, 2019, 18:48 >> *To: * <[email protected]> >> >> From a Python type hints perspective, how do schemas fit? Type hints are >>> currently used to determine which coder to use. >>> It seems that given a schema field, it would be useful to be able to >>> convert it to a coder (using URNs?), and to convert the coder into a typing >>> type. >>> This would allow for pipeline-construction-time type compatibility >>> checks. >>> >>> Some questions: >>> 1. Why are there 4 types of int (byte, int16, int32, int64)? Is it to >>> maintain type fidelity when writing back? If so, what happens in languages >>> that only have "int"? >>> 2. What is encoding_position? How does it differ from id (which is also >>> a position)? >>> 3. When are schema protos constructed? Are they available during >>> pipeline construction or afterwards? >>> 4. Once data is read into a Beam pipeline and a schema inferred, do we >>> maintain the schema types throughout the pipeline or use language-local >>> types? >>> >>> >>> On Wed, May 8, 2019 at 6:39 PM Robert Bradshaw <[email protected]> >>> wrote: >>> >>>> From: Reuven Lax <[email protected]> >>>> Date: Wed, May 8, 2019 at 10:36 PM >>>> To: dev >>>> >>>> > On Wed, May 8, 2019 at 1:23 PM Robert Bradshaw <[email protected]> >>>> wrote: >>>> >> >>>> >> Very excited to see this. In particular, I think this will be very >>>> >> useful for cross-language pipelines (not just SQL, but also for >>>> >> describing non-trivial data (e.g. for source and sink reuse). >>>> >> >>>> >> The proto specification makes sense to me. The only thing that looks >>>> >> like it's missing (other than possibly iterable, for >>>> arbitrarily-large >>>> >> support) is multimap. Another basic type, should we want to support >>>> >> it, is union (though this of course can get messy). >>>> > >>>> > multimap is an interesting suggestion. Do you have a use case in mind? >>>> > >>>> > union (or oneof) is also a good suggestion. There are good use cases >>>> for this, but this is a more fundamental change. >>>> >>>> No specific usecase, they just seemed to round out the options. >>>> >>>> >> I'm curious what the rational was for going with a oneof for >>>> type_info >>>> >> rather than an repeated components like we do with coders. >>>> > >>>> > No strong reason. Do you think repeated components is better than >>>> oneof? >>>> >>>> It's more consistent with how we currently do coders (which has pros >>>> and cons). >>>> >>>> >> Removing DATETIME as a logical coder on top of INT64 may cause issues >>>> >> of insufficient resolution and/or timespan. Similarly with DECIMAL >>>> (or >>>> >> would it be backed by string?) >>>> > >>>> > There could be multiple TIMESTAMP types for different resolutions, >>>> and they don't all need the same backing field type. E.g. the backing type >>>> for nanoseconds could by Row(INT64, INT64), or it could just be a byte >>>> array. >>>> >>>> Hmm.... What would the value be in supporting different types of >>>> timestamps? Would all SDKs have to support all of them? Can one >>>> compare, take differences, etc. across timestamp types? (As Luke >>>> points out, the other conversation on timestamps is likely relevant >>>> here as well.) >>>> >>>> >> The biggest question, as far as portability is concerned at least, is >>>> >> the notion of logical types. serialized_class is clearly not >>>> portable, >>>> >> and I also think we'll want a way to share semantic meaning across >>>> >> SDKs (especially if things like dates become logical types). Perhaps >>>> >> URNs (+payloads) would be a better fit here? >>>> > >>>> > Yes, URN + payload is probably the better fit for portability. >>>> > >>>> >> Taking a step back, I think it's worth asking why we have different >>>> >> types, rather than simply making everything a LogicalType of bytes >>>> >> (aka coder). Other than encoding format, the answer I can come up >>>> with >>>> >> is that the type decides the kinds of operations that can be done on >>>> >> it, e.g. does it support comparison? Arithmetic? Containment? >>>> >> Higher-level date operations? Perhaps this should be used to guide >>>> the >>>> >> set of types we provide. >>>> > >>>> > Also even though we could make everything a LogicalType (though at >>>> least byte array would have to stay primitive), I think it's useful to >>>> have a slightly larger set of primitive types. It makes things easier to >>>> understand and debug, and it makes it simpler for the various SDKs to map >>>> them to their types (e.g. mapping to POJOs). >>>> >>>> This would be the case if one didn't have LogicalType at all, but >>>> once one introduces that one now has this more complicated two-level >>>> hierarchy of types which doesn't seem simpler to me. >>>> >>>> I'm trying to understand what information Schema encodes that a >>>> NamedTupleCoder (or RowCoder) would/could not. (Coders have the >>>> disadvantage that there are multiple encodings of a single value, e.g. >>>> BigEndian vs. VarInt, but if we have multiple resolutions of timestamp >>>> that would still seem to be an issue. Possibly another advantage is >>>> encoding into non-record-oriented formats, e.g. Parquet or Arrow, that >>>> have a set of primitives.) >>>> >>>
