Are you suggesting that schemas become an explicit field on PCollection or that the coder on PCollections has a well known schema coder type that has a payload that has field names, ids, type, ...? I'm much more for the latter since it allows for versioning schema representations over time without needing a change to the protos.
On Wed, May 8, 2019 at 1:36 PM Reuven Lax <[email protected]> wrote: > > > On Wed, May 8, 2019 at 1:23 PM Robert Bradshaw <[email protected]> > wrote: > >> Very excited to see this. In particular, I think this will be very >> useful for cross-language pipelines (not just SQL, but also for >> describing non-trivial data (e.g. for source and sink reuse). >> >> The proto specification makes sense to me. The only thing that looks >> like it's missing (other than possibly iterable, for arbitrarily-large >> support) is multimap. Another basic type, should we want to support >> it, is union (though this of course can get messy). >> > > multimap is an interesting suggestion. Do you have a use case in mind? > > union (or oneof) is also a good suggestion. There are good use cases for > this, but this is a more fundamental change. > > >> I'm curious what the rational was for going with a oneof for type_info >> rather than an repeated components like we do with coders. >> > > No strong reason. Do you think repeated components is better than oneof? > > >> Removing DATETIME as a logical coder on top of INT64 may cause issues >> of insufficient resolution and/or timespan. Similarly with DECIMAL (or >> would it be backed by string?) >> > > There could be multiple TIMESTAMP types for different resolutions, and > they don't all need the same backing field type. E.g. the backing type for > nanoseconds could by Row(INT64, INT64), or it could just be a byte array. > This seems to overlap heavily with the discussion about timestamp precision in this other ML thread[1]. > > >> >> The biggest question, as far as portability is concerned at least, is >> the notion of logical types. serialized_class is clearly not portable, >> and I also think we'll want a way to share semantic meaning across >> SDKs (especially if things like dates become logical types). Perhaps >> URNs (+payloads) would be a better fit here? >> > > Yes, URN + payload is probably the better fit for portability. > +1 > >> >> Taking a step back, I think it's worth asking why we have different >> types, rather than simply making everything a LogicalType of bytes >> (aka coder). Other than encoding format, the answer I can come up with >> is that the type decides the kinds of operations that can be done on >> it, e.g. does it support comparison? Arithmetic? Containment? >> Higher-level date operations? Perhaps this should be used to guide the >> set of types we provide. >> > > Also even though we could make everything a LogicalType (though at least > byte array would have to stay primitive), I think it's useful to have a > slightly larger set of primitive types. It makes things easier to > understand and debug, and it makes it simpler for the various SDKs to map > them to their types (e.g. mapping to POJOs). > > >> (Also, +1 to optional over nullable.) >> > > sounds good. do others prefer optional as well? > Can rows backed by schemas have unset fields? If so, wouldn't you want to differentiate between unset and null which means you would need to support both null and optional? I know in proto2, unset vs null was distinct but with proto3, that distinction was removed. > >> >> From: Reuven Lax <[email protected]> >> Date: Wed, May 8, 2019 at 6:54 PM >> To: dev >> >> > Beam Java's support for schemas is just about done: we infer schemas >> from a variety of types, we have a variety of utility transforms (join, >> aggregate, etc.) for schemas, and schemas are integrated with the ParDo >> machinery. The big remaining task I'm working on is writing documentation >> and examples for all of this so that users are aware. If you're interested, >> these slides from the London Beam meetup show a bit more how schemas can be >> used and how they simplify the API. >> > >> > I want to start integrating schemas into portability so that they can >> be used from other languages such as Python (in particular this will also >> allow BeamSQL to be invoked from other languages). In order to do this, the >> Beam portability protos must have a way of representing schemas. Since this >> has not been discussed before, I'm starting this discussion now on the list. >> > >> > As a reminder: a schema represents the type of a PCollection as a >> collection of fields. Each field has a name, an id (position), and a field >> type. A field type can be either a primitive type (int, long, string, byte >> array, etc.), a nested row (itself with a schema), an array, or a map. >> > >> > We also support logical types. A logical type is a way for the user to >> embed their own types in schema fields. A logical type is always backed by >> a schema type, and contains a function for mapping the user's logical type >> to the field type. You can think of this as a generalization of a coder: >> while a coder always maps the user type to a byte array, a logical type can >> map to an int, or a string, or any other schema field type (in fact any >> coder can always be used as a logical type for mapping to byte-array field >> types). Logical types are used extensively by Beam SQL to represent SQL >> types that have no correspondence in Beam's field types (e.g. SQL has 4 >> different date/time types). Logical types for Beam schemas have a lot of >> similarities to AVRO logical types. >> > >> > An initial proto representation for schemas is here. Before we go >> further with this, I would like community consensus on what this >> representation should be. I can start by suggesting a few possible changes >> to this representation (and hopefully others will suggest others): >> > >> > Kenn Knowles has suggested removing DATETIME as a primitive type, and >> instead making it a logical type backed by INT64 as this keeps our >> primitive types closer to "classical" PL primitive types. This also allows >> us to create multiple versions of this type - e.g. TIMESTAMP(millis), >> TIMESTAMP(micros), TIMESTAMP(nanos). >> > If we do the above, we can also consider removing DECIMAL and making >> that a logical type as well. >> > The id field is currently used for some performance optimizations only. >> If we formalized the idea of schema types having ids, then we might be able >> to use this to allow self-recursive schemas (self-recursive types are not >> currently allowed). >> > Beam Schemas currently have an ARRAY type. However Beam supports "large >> iterables" (iterables that don't fit in memory that the runner can page >> in), and this doesn't match well to arrays. I think we need to add an >> ITERABLE type as well to support things like GroupByKey results. >> > >> > It would also be interesting to explore allowing well-known metadata >> tags on fields that Beam interprets. e.g. key and value, to allow Beam to >> interpret any two-field schema as a KV, or window and timestamp to allow >> automatically filling those out. However this would be an extension to the >> current schema concept and deserves a separate discussion thread IMO. >> > >> > I ask that we please limit this discussion to the proto representation >> of schemas. If people want to discuss (or rediscuss) other things around >> Beam schemas, I'll be happy to create separate threads for those >> discussions. >> > >> > Thank you! >> > >> > Reuven >> > 1: https://lists.apache.org/thread.html/221b06e81bba335d0ea8d770212cc7ee047dba65bec7978368a51473@%3Cdev.beam.apache.org%3E
