Re: [DISCUSS] Portability representation of schemas

Reuven Lax Wed, 08 May 2019 13:41:55 -0700

On Wed, May 8, 2019 at 1:23 PM Robert Bradshaw <[email protected]> wrote:


> Very excited to see this. In particular, I think this will be very
> useful for cross-language pipelines (not just SQL, but also for
> describing non-trivial data (e.g. for source and sink reuse).
>
> The proto specification makes sense to me. The only thing that looks
> like it's missing (other than possibly iterable, for arbitrarily-large
> support) is multimap. Another basic type, should we want to support
> it, is union (though this of course can get messy).
>

multimap is an interesting suggestion. Do you have a use case in mind?

union (or oneof) is also a good suggestion. There are good use cases for
this, but this is a more fundamental change.


> I'm curious what the rational was for going with a oneof for type_info
> rather than an repeated components like we do with coders.
>

No strong reason. Do you think repeated components is better than oneof?


> Removing DATETIME as a logical coder on top of INT64 may cause issues
> of insufficient resolution and/or timespan. Similarly with DECIMAL (or
> would it be backed by string?)
>

There could be multiple TIMESTAMP types for different resolutions, and they
don't all need the same backing field type. E.g. the backing type for
nanoseconds could by Row(INT64, INT64), or it could just be a byte array.



>
> The biggest question, as far as portability is concerned at least, is
> the notion of logical types. serialized_class is clearly not portable,
> and I also think we'll want a way to share semantic meaning across
> SDKs (especially if things like dates become logical types). Perhaps
> URNs (+payloads) would be a better fit here?
>

Yes, URN + payload is probably the better fit for portability.


>
>
> Taking a step back, I think it's worth asking why we have different
> types, rather than simply making everything a LogicalType of bytes
> (aka coder). Other than encoding format, the answer I can come up with
> is that the type decides the kinds of operations that can be done on
> it, e.g. does it support comparison? Arithmetic? Containment?
> Higher-level date operations? Perhaps this should be used to guide the
> set of types we provide.
>

Also even though we could make everything a LogicalType (though at least
byte array would have to stay primitive), I think  it's useful to have a
slightly larger set of primitive types.  It makes things easier to
understand and debug, and it makes it simpler for the various SDKs to map
them to their types (e.g. mapping to POJOs).


> (Also, +1 to optional over nullable.)
>

sounds good. do others prefer optional as well?


>
>
> From: Reuven Lax <[email protected]>
> Date: Wed, May 8, 2019 at 6:54 PM
> To: dev
>
> > Beam Java's support for schemas is just about done: we infer schemas
> from a variety of types, we have a variety of utility transforms (join,
> aggregate, etc.) for schemas, and schemas are integrated with the ParDo
> machinery. The big remaining task I'm working on is writing documentation
> and examples for all of this so that users are aware. If you're interested,
> these slides from the London Beam meetup show a bit more how schemas can be
> used and how they simplify the API.
> >
> > I want to start integrating schemas into portability so that they can be
> used from other languages such as Python (in particular this will also
> allow BeamSQL to be invoked from other languages). In order to do this, the
> Beam portability protos must have a way of representing schemas. Since this
> has not been discussed before, I'm starting this discussion now on the list.
> >
> > As a reminder: a schema represents the type of a PCollection as a
> collection of fields. Each field has a name, an id (position), and a field
> type. A field type can be either a primitive type (int, long, string, byte
> array, etc.), a nested row (itself with a schema), an array, or a map.
> >
> > We also support logical types. A logical type is a way for the user to
> embed their own types in schema fields. A logical type is always backed by
> a schema type, and contains a function for mapping the user's logical type
> to the field type. You can think of this as a generalization of a coder:
> while a coder always maps the user type to a byte array, a logical type can
> map to an int, or a string, or any other schema field type (in fact any
> coder can always be used as a logical type for mapping to byte-array field
> types). Logical types are used extensively by Beam SQL to represent SQL
> types that have no correspondence in Beam's field types (e.g. SQL has 4
> different date/time types). Logical types for Beam schemas have a lot of
> similarities to AVRO logical types.
> >
> > An initial proto representation for schemas is here. Before we go
> further with this, I would like community consensus on what this
> representation should be. I can start by suggesting a few possible changes
> to this representation (and hopefully others will suggest others):
> >
> > Kenn Knowles has suggested removing DATETIME as a primitive type, and
> instead making it a logical type backed by INT64 as this keeps our
> primitive types closer to "classical" PL primitive types. This also allows
> us to create multiple versions of this type - e.g. TIMESTAMP(millis),
> TIMESTAMP(micros), TIMESTAMP(nanos).
> > If we do the above, we can also consider removing DECIMAL and making
> that a logical type as well.
> > The id field is currently used for some performance optimizations only.
> If we formalized the idea of schema types having ids, then we might be able
> to use this to allow self-recursive schemas (self-recursive types are not
> currently allowed).
> > Beam Schemas currently have an ARRAY type. However Beam supports "large
> iterables" (iterables that don't fit in memory that the runner can page
> in), and this doesn't match well to arrays. I think we need to add an
> ITERABLE type as well to support things like GroupByKey results.
> >
> > It would also be interesting to explore allowing well-known metadata
> tags on fields that Beam interprets. e.g. key and value, to allow Beam to
> interpret any two-field schema as a KV, or window and timestamp to allow
> automatically filling those out. However this would be an extension to the
> current schema concept and deserves a separate discussion thread IMO.
> >
> > I ask that we please limit this discussion to the proto representation
> of schemas. If people want to discuss (or rediscuss) other things around
> Beam schemas, I'll be happy to create separate threads for those
> discussions.
> >
> > Thank you!
> >
> > Reuven
>

Re: [DISCUSS] Portability representation of schemas

Reply via email to