Re: [DISCUSS] Portability representation of schemas

Robert Bradshaw Wed, 08 May 2019 13:29:49 -0700

Very excited to see this. In particular, I think this will be very
useful for cross-language pipelines (not just SQL, but also for
describing non-trivial data (e.g. for source and sink reuse).


The proto specification makes sense to me. The only thing that looks
like it's missing (other than possibly iterable, for arbitrarily-large
support) is multimap. Another basic type, should we want to support
it, is union (though this of course can get messy).

I'm curious what the rational was for going with a oneof for type_info
rather than an repeated components like we do with coders.

Removing DATETIME as a logical coder on top of INT64 may cause issues
of insufficient resolution and/or timespan. Similarly with DECIMAL (or
would it be backed by string?)

The biggest question, as far as portability is concerned at least, is
the notion of logical types. serialized_class is clearly not portable,
and I also think we'll want a way to share semantic meaning across
SDKs (especially if things like dates become logical types). Perhaps
URNs (+payloads) would be a better fit here?


Taking a step back, I think it's worth asking why we have different
types, rather than simply making everything a LogicalType of bytes
(aka coder). Other than encoding format, the answer I can come up with
is that the type decides the kinds of operations that can be done on
it, e.g. does it support comparison? Arithmetic? Containment?
Higher-level date operations? Perhaps this should be used to guide the
set of types we provide.

(Also, +1 to optional over nullable.)


From: Reuven Lax <re...@google.com>
Date: Wed, May 8, 2019 at 6:54 PM
To: dev

> Beam Java's support for schemas is just about done: we infer schemas from a 
> variety of types, we have a variety of utility transforms (join, aggregate, 
> etc.) for schemas, and schemas are integrated with the ParDo machinery. The 
> big remaining task I'm working on is writing documentation and examples for 
> all of this so that users are aware. If you're interested, these slides from 
> the London Beam meetup show a bit more how schemas can be used and how they 
> simplify the API.
>
> I want to start integrating schemas into portability so that they can be used 
> from other languages such as Python (in particular this will also allow 
> BeamSQL to be invoked from other languages). In order to do this, the Beam 
> portability protos must have a way of representing schemas. Since this has 
> not been discussed before, I'm starting this discussion now on the list.
>
> As a reminder: a schema represents the type of a PCollection as a collection 
> of fields. Each field has a name, an id (position), and a field type. A field 
> type can be either a primitive type (int, long, string, byte array, etc.), a 
> nested row (itself with a schema), an array, or a map.
>
> We also support logical types. A logical type is a way for the user to embed 
> their own types in schema fields. A logical type is always backed by a schema 
> type, and contains a function for mapping the user's logical type to the 
> field type. You can think of this as a generalization of a coder: while a 
> coder always maps the user type to a byte array, a logical type can map to an 
> int, or a string, or any other schema field type (in fact any coder can 
> always be used as a logical type for mapping to byte-array field types). 
> Logical types are used extensively by Beam SQL to represent SQL types that 
> have no correspondence in Beam's field types (e.g. SQL has 4 different 
> date/time types). Logical types for Beam schemas have a lot of similarities 
> to AVRO logical types.
>
> An initial proto representation for schemas is here. Before we go further 
> with this, I would like community consensus on what this representation 
> should be. I can start by suggesting a few possible changes to this 
> representation (and hopefully others will suggest others):
>
> Kenn Knowles has suggested removing DATETIME as a primitive type, and instead 
> making it a logical type backed by INT64 as this keeps our primitive types 
> closer to "classical" PL primitive types. This also allows us to create 
> multiple versions of this type - e.g. TIMESTAMP(millis), TIMESTAMP(micros), 
> TIMESTAMP(nanos).
> If we do the above, we can also consider removing DECIMAL and making that a 
> logical type as well.
> The id field is currently used for some performance optimizations only. If we 
> formalized the idea of schema types having ids, then we might be able to use 
> this to allow self-recursive schemas (self-recursive types are not currently 
> allowed).
> Beam Schemas currently have an ARRAY type. However Beam supports "large 
> iterables" (iterables that don't fit in memory that the runner can page in), 
> and this doesn't match well to arrays. I think we need to add an ITERABLE 
> type as well to support things like GroupByKey results.
>
> It would also be interesting to explore allowing well-known metadata tags on 
> fields that Beam interprets. e.g. key and value, to allow Beam to interpret 
> any two-field schema as a KV, or window and timestamp to allow automatically 
> filling those out. However this would be an extension to the current schema 
> concept and deserves a separate discussion thread IMO.
>
> I ask that we please limit this discussion to the proto representation of 
> schemas. If people want to discuss (or rediscuss) other things around Beam 
> schemas, I'll be happy to create separate threads for those discussions.
>
> Thank you!
>
> Reuven

Re: [DISCUSS] Portability representation of schemas

Reply via email to