Re: [DISCUSS] Portability representation of schemas

Kenneth Knowles Thu, 09 May 2019 01:06:06 -0700

This is a huge development. Top posting because I can be more compact.

I really think after the initial idea converges this needs a design doc
with goals and alternatives. It is an extraordinarily consequential model
change. So in the spirit of doing the work / bias towards action, I created
a quick draft at https://s.apache.org/beam-schemas and added everyone on
this thread as editors. I am still in the process of writing this to match
the thread.


*Multiple timestamp resolutions*: you can use logcial types to represent
nanos the same way Java and proto do.

*Why multiple int types?* The domain of values for these types are
different. For a language with one "int" or "number" type, that's another
domain of values.

*Columnar/Arrow*: making sure we unlock the ability to take this path is
Paramount. So tying it directly to a row-oriented coder seems
counterproductive.

*Nullable/optional*: optional as it exists in Java, Haskell, Scala, ocaml,
etc, is strictly more expressive than the billion dollar mistake.
Nullability of a field is different and less expressive than nullability of
a type.

*Union types*: tagged disjoint unions and oneof are the most useful form of
union. Embedding them into a relational model you get something like proto
oneof. Not too hard to add later.

*Multimap*: what does it add over an array-valued map or
large-iterable-valued map? (honest question, not rhetorical)

*id* is a loaded term in other places in the model. I would call it
something else.

*URN/enum for type names*: I see the case for both. The core types are
fundamental enough they should never really change - after all, proto,
thrift, avro, arrow, have addressed this (not to mention most programming
languages). Maybe additions once every few years. I prefer the smallest
intersection of these schema languages. A oneof is more clear, while URN
emphasizes the similarity of built-in and logical types.

*Multiple encodings of a value*: I actually think this is a benefit.
There's a lot to unpack here.

*Language specifics*: the design doc should describe the domain of values,
and this should go in the core docs. Then for each SDK it should explicitly
say what language type (or types?) the values are embedded in. Just like
protos language guides.

Kenn

*From: *Udi Meiri <[email protected]>
*Date: *Wed, May 8, 2019, 18:48
*To: * <[email protected]>

>From a Python type hints perspective, how do schemas fit? Type hints are
> currently used to determine which coder to use.
> It seems that given a schema field, it would be useful to be able to
> convert it to a coder (using URNs?), and to convert the coder into a typing
> type.
> This would allow for pipeline-construction-time type compatibility checks.
>
> Some questions:
> 1. Why are there 4 types of int (byte, int16, int32, int64)? Is it to
> maintain type fidelity when writing back? If so, what happens in languages
> that only have "int"?
> 2. What is encoding_position? How does it differ from id (which is also a
> position)?
> 3. When are schema protos constructed? Are they available during pipeline
> construction or afterwards?
> 4. Once data is read into a Beam pipeline and a schema inferred, do we
> maintain the schema types throughout the pipeline or use language-local
> types?
>
>
> On Wed, May 8, 2019 at 6:39 PM Robert Bradshaw <[email protected]>
> wrote:
>
>> From: Reuven Lax <[email protected]>
>> Date: Wed, May 8, 2019 at 10:36 PM
>> To: dev
>>
>> > On Wed, May 8, 2019 at 1:23 PM Robert Bradshaw <[email protected]>
>> wrote:
>> >>
>> >> Very excited to see this. In particular, I think this will be very
>> >> useful for cross-language pipelines (not just SQL, but also for
>> >> describing non-trivial data (e.g. for source and sink reuse).
>> >>
>> >> The proto specification makes sense to me. The only thing that looks
>> >> like it's missing (other than possibly iterable, for arbitrarily-large
>> >> support) is multimap. Another basic type, should we want to support
>> >> it, is union (though this of course can get messy).
>> >
>> > multimap is an interesting suggestion. Do you have a use case in mind?
>> >
>> > union (or oneof) is also a good suggestion. There are good use cases
>> for this, but this is a more fundamental change.
>>
>> No specific usecase, they just seemed to round out the options.
>>
>> >> I'm curious what the rational was for going with a oneof for type_info
>> >> rather than an repeated components like we do with coders.
>> >
>> > No strong reason. Do you think repeated components is better than oneof?
>>
>> It's more consistent with how we currently do coders (which has pros and
>> cons).
>>
>> >> Removing DATETIME as a logical coder on top of INT64 may cause issues
>> >> of insufficient resolution and/or timespan. Similarly with DECIMAL (or
>> >> would it be backed by string?)
>> >
>> > There could be multiple TIMESTAMP types for different resolutions, and
>> they don't all need the same backing field type. E.g. the backing type for
>> nanoseconds could by Row(INT64, INT64), or it could just be a byte array.
>>
>> Hmm.... What would the value be in supporting different types of
>> timestamps? Would all SDKs have to support all of them? Can one
>> compare, take differences, etc. across timestamp types? (As Luke
>> points out, the other conversation on timestamps is likely relevant
>> here as well.)
>>
>> >> The biggest question, as far as portability is concerned at least, is
>> >> the notion of logical types. serialized_class is clearly not portable,
>> >> and I also think we'll want a way to share semantic meaning across
>> >> SDKs (especially if things like dates become logical types). Perhaps
>> >> URNs (+payloads) would be a better fit here?
>> >
>> > Yes, URN + payload is probably the better fit for portability.
>> >
>> >> Taking a step back, I think it's worth asking why we have different
>> >> types, rather than simply making everything a LogicalType of bytes
>> >> (aka coder). Other than encoding format, the answer I can come up with
>> >> is that the type decides the kinds of operations that can be done on
>> >> it, e.g. does it support comparison? Arithmetic? Containment?
>> >> Higher-level date operations? Perhaps this should be used to guide the
>> >> set of types we provide.
>> >
>> > Also even though we could make everything a LogicalType (though at
>> least byte array would have to stay primitive), I think  it's useful to
>> have a slightly larger set of primitive types.  It makes things easier to
>> understand and debug, and it makes it simpler for the various SDKs to map
>> them to their types (e.g. mapping to POJOs).
>>
>>  This would be the case if one didn't have LogicalType at all, but
>> once one introduces that one now has this more complicated two-level
>> hierarchy of types which doesn't seem simpler to me.
>>
>> I'm trying to understand what information Schema encodes that a
>> NamedTupleCoder (or RowCoder) would/could not. (Coders have the
>> disadvantage that there are multiple encodings of a single value, e.g.
>> BigEndian vs. VarInt, but if we have multiple resolutions of timestamp
>> that would still seem to be an issue. Possibly another advantage is
>> encoding into non-record-oriented formats, e.g. Parquet or Arrow, that
>> have a set of primitives.)
>>
>

Re: [DISCUSS] Portability representation of schemas

Reply via email to