Re: [DISCUSS] Portability representation of schemas

Alex Van Boxel Thu, 09 May 2019 06:34:22 -0700

My biggest concern is that if we don't make TIMESTAMP (yes, TIMESTAMP is a
better name for DATETIME) a first class citizen that we get
*inconsistencies* between the difference portability implementations. The
same holds true for DECIMAL and DURATION. If we aren't given pipeline
developers a consistent way of working with timestamp we're going to
generate a lot of frustration.


I always said "TIMESTAMP's are the nail in the coffin of data engineers"...

For the rest It's a bit too early to make a lot of informed input here, as
I just started working with schema's for my protobuf implementation.

 _/
_/ Alex Van Boxel


On Thu, May 9, 2019 at 10:05 AM Kenneth Knowles <[email protected]> wrote:

> This is a huge development. Top posting because I can be more compact.
>
> I really think after the initial idea converges this needs a design doc
> with goals and alternatives. It is an extraordinarily consequential model
> change. So in the spirit of doing the work / bias towards action, I created
> a quick draft at https://s.apache.org/beam-schemas and added everyone on
> this thread as editors. I am still in the process of writing this to match
> the thread.
>
> *Multiple timestamp resolutions*: you can use logcial types to represent
> nanos the same way Java and proto do.
>
> *Why multiple int types?* The domain of values for these types are
> different. For a language with one "int" or "number" type, that's another
> domain of values.
>
> *Columnar/Arrow*: making sure we unlock the ability to take this path is
> Paramount. So tying it directly to a row-oriented coder seems
> counterproductive.
>
> *Nullable/optional*: optional as it exists in Java, Haskell, Scala, ocaml,
> etc, is strictly more expressive than the billion dollar mistake.
> Nullability of a field is different and less expressive than nullability of
> a type.
>
> *Union types*: tagged disjoint unions and oneof are the most useful form
> of union. Embedding them into a relational model you get something like
> proto oneof. Not too hard to add later.
>
> *Multimap*: what does it add over an array-valued map or
> large-iterable-valued map? (honest question, not rhetorical)
>
> *id* is a loaded term in other places in the model. I would call it
> something else.
>
> *URN/enum for type names*: I see the case for both. The core types are
> fundamental enough they should never really change - after all, proto,
> thrift, avro, arrow, have addressed this (not to mention most programming
> languages). Maybe additions once every few years. I prefer the smallest
> intersection of these schema languages. A oneof is more clear, while URN
> emphasizes the similarity of built-in and logical types.
>
> *Multiple encodings of a value*: I actually think this is a benefit.
> There's a lot to unpack here.
>
> *Language specifics*: the design doc should describe the domain of values,
> and this should go in the core docs. Then for each SDK it should explicitly
> say what language type (or types?) the values are embedded in. Just like
> protos language guides.
>
> Kenn
>
> *From: *Udi Meiri <[email protected]>
> *Date: *Wed, May 8, 2019, 18:48
> *To: * <[email protected]>
>
> From a Python type hints perspective, how do schemas fit? Type hints are
>> currently used to determine which coder to use.
>> It seems that given a schema field, it would be useful to be able to
>> convert it to a coder (using URNs?), and to convert the coder into a typing
>> type.
>> This would allow for pipeline-construction-time type compatibility checks.
>>
>> Some questions:
>> 1. Why are there 4 types of int (byte, int16, int32, int64)? Is it to
>> maintain type fidelity when writing back? If so, what happens in languages
>> that only have "int"?
>> 2. What is encoding_position? How does it differ from id (which is also a
>> position)?
>> 3. When are schema protos constructed? Are they available during pipeline
>> construction or afterwards?
>> 4. Once data is read into a Beam pipeline and a schema inferred, do we
>> maintain the schema types throughout the pipeline or use language-local
>> types?
>>
>>
>> On Wed, May 8, 2019 at 6:39 PM Robert Bradshaw <[email protected]>
>> wrote:
>>
>>> From: Reuven Lax <[email protected]>
>>> Date: Wed, May 8, 2019 at 10:36 PM
>>> To: dev
>>>
>>> > On Wed, May 8, 2019 at 1:23 PM Robert Bradshaw <[email protected]>
>>> wrote:
>>> >>
>>> >> Very excited to see this. In particular, I think this will be very
>>> >> useful for cross-language pipelines (not just SQL, but also for
>>> >> describing non-trivial data (e.g. for source and sink reuse).
>>> >>
>>> >> The proto specification makes sense to me. The only thing that looks
>>> >> like it's missing (other than possibly iterable, for arbitrarily-large
>>> >> support) is multimap. Another basic type, should we want to support
>>> >> it, is union (though this of course can get messy).
>>> >
>>> > multimap is an interesting suggestion. Do you have a use case in mind?
>>> >
>>> > union (or oneof) is also a good suggestion. There are good use cases
>>> for this, but this is a more fundamental change.
>>>
>>> No specific usecase, they just seemed to round out the options.
>>>
>>> >> I'm curious what the rational was for going with a oneof for type_info
>>> >> rather than an repeated components like we do with coders.
>>> >
>>> > No strong reason. Do you think repeated components is better than
>>> oneof?
>>>
>>> It's more consistent with how we currently do coders (which has pros and
>>> cons).
>>>
>>> >> Removing DATETIME as a logical coder on top of INT64 may cause issues
>>> >> of insufficient resolution and/or timespan. Similarly with DECIMAL (or
>>> >> would it be backed by string?)
>>> >
>>> > There could be multiple TIMESTAMP types for different resolutions, and
>>> they don't all need the same backing field type. E.g. the backing type for
>>> nanoseconds could by Row(INT64, INT64), or it could just be a byte array.
>>>
>>> Hmm.... What would the value be in supporting different types of
>>> timestamps? Would all SDKs have to support all of them? Can one
>>> compare, take differences, etc. across timestamp types? (As Luke
>>> points out, the other conversation on timestamps is likely relevant
>>> here as well.)
>>>
>>> >> The biggest question, as far as portability is concerned at least, is
>>> >> the notion of logical types. serialized_class is clearly not portable,
>>> >> and I also think we'll want a way to share semantic meaning across
>>> >> SDKs (especially if things like dates become logical types). Perhaps
>>> >> URNs (+payloads) would be a better fit here?
>>> >
>>> > Yes, URN + payload is probably the better fit for portability.
>>> >
>>> >> Taking a step back, I think it's worth asking why we have different
>>> >> types, rather than simply making everything a LogicalType of bytes
>>> >> (aka coder). Other than encoding format, the answer I can come up with
>>> >> is that the type decides the kinds of operations that can be done on
>>> >> it, e.g. does it support comparison? Arithmetic? Containment?
>>> >> Higher-level date operations? Perhaps this should be used to guide the
>>> >> set of types we provide.
>>> >
>>> > Also even though we could make everything a LogicalType (though at
>>> least byte array would have to stay primitive), I think  it's useful to
>>> have a slightly larger set of primitive types.  It makes things easier to
>>> understand and debug, and it makes it simpler for the various SDKs to map
>>> them to their types (e.g. mapping to POJOs).
>>>
>>>  This would be the case if one didn't have LogicalType at all, but
>>> once one introduces that one now has this more complicated two-level
>>> hierarchy of types which doesn't seem simpler to me.
>>>
>>> I'm trying to understand what information Schema encodes that a
>>> NamedTupleCoder (or RowCoder) would/could not. (Coders have the
>>> disadvantage that there are multiple encodings of a single value, e.g.
>>> BigEndian vs. VarInt, but if we have multiple resolutions of timestamp
>>> that would still seem to be an issue. Possibly another advantage is
>>> encoding into non-record-oriented formats, e.g. Parquet or Arrow, that
>>> have a set of primitives.)
>>>
>>

Re: [DISCUSS] Portability representation of schemas

Reply via email to