Re: [DISCUSS] Portability representation of schemas

Lukasz Cwik Wed, 08 May 2019 16:08:02 -0700

Are you suggesting that schemas become an explicit field on PCollection or
that the coder on PCollections has a well known schema coder type that has
a payload that has field names, ids, type, ...?
I'm much more for the latter since it allows for versioning schema
representations over time without needing a change to the protos.


On Wed, May 8, 2019 at 1:36 PM Reuven Lax <[email protected]> wrote:

>
>
> On Wed, May 8, 2019 at 1:23 PM Robert Bradshaw <[email protected]>
> wrote:
>
>> Very excited to see this. In particular, I think this will be very
>> useful for cross-language pipelines (not just SQL, but also for
>> describing non-trivial data (e.g. for source and sink reuse).
>>
>> The proto specification makes sense to me. The only thing that looks
>> like it's missing (other than possibly iterable, for arbitrarily-large
>> support) is multimap. Another basic type, should we want to support
>> it, is union (though this of course can get messy).
>>
>
> multimap is an interesting suggestion. Do you have a use case in mind?
>
> union (or oneof) is also a good suggestion. There are good use cases for
> this, but this is a more fundamental change.
>
>
>> I'm curious what the rational was for going with a oneof for type_info
>> rather than an repeated components like we do with coders.
>>
>
> No strong reason. Do you think repeated components is better than oneof?
>
>
>> Removing DATETIME as a logical coder on top of INT64 may cause issues
>> of insufficient resolution and/or timespan. Similarly with DECIMAL (or
>> would it be backed by string?)
>>
>
> There could be multiple TIMESTAMP types for different resolutions, and
> they don't all need the same backing field type. E.g. the backing type for
> nanoseconds could by Row(INT64, INT64), or it could just be a byte array.
>

This seems to overlap heavily with the discussion about timestamp precision
in this other ML thread[1].


>
>
>>
>> The biggest question, as far as portability is concerned at least, is
>> the notion of logical types. serialized_class is clearly not portable,
>> and I also think we'll want a way to share semantic meaning across
>> SDKs (especially if things like dates become logical types). Perhaps
>> URNs (+payloads) would be a better fit here?
>>
>
> Yes, URN + payload is probably the better fit for portability.
>

+1


>
>>
>> Taking a step back, I think it's worth asking why we have different
>> types, rather than simply making everything a LogicalType of bytes
>> (aka coder). Other than encoding format, the answer I can come up with
>> is that the type decides the kinds of operations that can be done on
>> it, e.g. does it support comparison? Arithmetic? Containment?
>> Higher-level date operations? Perhaps this should be used to guide the
>> set of types we provide.
>>
>
> Also even though we could make everything a LogicalType (though at least
> byte array would have to stay primitive), I think  it's useful to have a
> slightly larger set of primitive types.  It makes things easier to
> understand and debug, and it makes it simpler for the various SDKs to map
> them to their types (e.g. mapping to POJOs).
>
>
>> (Also, +1 to optional over nullable.)
>>
>
> sounds good. do others prefer optional as well?
>

Can rows backed by schemas have unset fields? If so, wouldn't you want to
differentiate between unset and null which means you would need to support
both null and optional?
I know in proto2, unset vs null was distinct but with proto3, that
distinction was removed.


>
>>
>> From: Reuven Lax <[email protected]>
>> Date: Wed, May 8, 2019 at 6:54 PM
>> To: dev
>>
>> > Beam Java's support for schemas is just about done: we infer schemas
>> from a variety of types, we have a variety of utility transforms (join,
>> aggregate, etc.) for schemas, and schemas are integrated with the ParDo
>> machinery. The big remaining task I'm working on is writing documentation
>> and examples for all of this so that users are aware. If you're interested,
>> these slides from the London Beam meetup show a bit more how schemas can be
>> used and how they simplify the API.
>> >
>> > I want to start integrating schemas into portability so that they can
>> be used from other languages such as Python (in particular this will also
>> allow BeamSQL to be invoked from other languages). In order to do this, the
>> Beam portability protos must have a way of representing schemas. Since this
>> has not been discussed before, I'm starting this discussion now on the list.
>> >
>> > As a reminder: a schema represents the type of a PCollection as a
>> collection of fields. Each field has a name, an id (position), and a field
>> type. A field type can be either a primitive type (int, long, string, byte
>> array, etc.), a nested row (itself with a schema), an array, or a map.
>> >
>> > We also support logical types. A logical type is a way for the user to
>> embed their own types in schema fields. A logical type is always backed by
>> a schema type, and contains a function for mapping the user's logical type
>> to the field type. You can think of this as a generalization of a coder:
>> while a coder always maps the user type to a byte array, a logical type can
>> map to an int, or a string, or any other schema field type (in fact any
>> coder can always be used as a logical type for mapping to byte-array field
>> types). Logical types are used extensively by Beam SQL to represent SQL
>> types that have no correspondence in Beam's field types (e.g. SQL has 4
>> different date/time types). Logical types for Beam schemas have a lot of
>> similarities to AVRO logical types.
>> >
>> > An initial proto representation for schemas is here. Before we go
>> further with this, I would like community consensus on what this
>> representation should be. I can start by suggesting a few possible changes
>> to this representation (and hopefully others will suggest others):
>> >
>> > Kenn Knowles has suggested removing DATETIME as a primitive type, and
>> instead making it a logical type backed by INT64 as this keeps our
>> primitive types closer to "classical" PL primitive types. This also allows
>> us to create multiple versions of this type - e.g. TIMESTAMP(millis),
>> TIMESTAMP(micros), TIMESTAMP(nanos).
>> > If we do the above, we can also consider removing DECIMAL and making
>> that a logical type as well.
>> > The id field is currently used for some performance optimizations only.
>> If we formalized the idea of schema types having ids, then we might be able
>> to use this to allow self-recursive schemas (self-recursive types are not
>> currently allowed).
>> > Beam Schemas currently have an ARRAY type. However Beam supports "large
>> iterables" (iterables that don't fit in memory that the runner can page
>> in), and this doesn't match well to arrays. I think we need to add an
>> ITERABLE type as well to support things like GroupByKey results.
>> >
>> > It would also be interesting to explore allowing well-known metadata
>> tags on fields that Beam interprets. e.g. key and value, to allow Beam to
>> interpret any two-field schema as a KV, or window and timestamp to allow
>> automatically filling those out. However this would be an extension to the
>> current schema concept and deserves a separate discussion thread IMO.
>> >
>> > I ask that we please limit this discussion to the proto representation
>> of schemas. If people want to discuss (or rediscuss) other things around
>> Beam schemas, I'll be happy to create separate threads for those
>> discussions.
>> >
>> > Thank you!
>> >
>> > Reuven
>>
>
1:
https://lists.apache.org/thread.html/221b06e81bba335d0ea8d770212cc7ee047dba65bec7978368a51473@%3Cdev.beam.apache.org%3E

Re: [DISCUSS] Portability representation of schemas

Reply via email to