Re: [DISCUSS] Portability representation of schemas

Alex Van Boxel Thu, 09 May 2019 08:14:27 -0700

OK, fair. This is parallel how timestamp are implemented in protobuf. Then
it's important (and I'll join the design doc) that we have a list of
standard logical types.


 _/
_/ Alex Van Boxel


On Thu, May 9, 2019 at 4:11 PM Reuven Lax <[email protected]> wrote:

>
>
> On Thu, May 9, 2019 at 6:34 AM Alex Van Boxel <[email protected]> wrote:
>
>> My biggest concern is that if we don't make TIMESTAMP (yes, TIMESTAMP is
>> a better name for DATETIME) a first class citizen that we get
>> *inconsistencies* between the difference portability implementations.
>> The same holds true for DECIMAL and DURATION. If we aren't given pipeline
>> developers a consistent way of working with timestamp we're going to
>> generate a lot of frustration.
>>
>
> This is a fair concern. However logical types have unique ids/urns, so we
> can still make TIMESTAMP a first-class citizen. The only difference is it
> will not be considered a primitive type.
>
>
>>
>> I always said "TIMESTAMP's are the nail in the coffin of data
>> engineers"...
>>
>> For the rest It's a bit too early to make a lot of informed input here,
>> as I just started working with schema's for my protobuf implementation.
>>
>>  _/
>> _/ Alex Van Boxel
>>
>>
>> On Thu, May 9, 2019 at 10:05 AM Kenneth Knowles <[email protected]> wrote:
>>
>>> This is a huge development. Top posting because I can be more compact.
>>>
>>> I really think after the initial idea converges this needs a design doc
>>> with goals and alternatives. It is an extraordinarily consequential model
>>> change. So in the spirit of doing the work / bias towards action, I created
>>> a quick draft at https://s.apache.org/beam-schemas and added everyone
>>> on this thread as editors. I am still in the process of writing this to
>>> match the thread.
>>>
>>> *Multiple timestamp resolutions*: you can use logcial types to represent
>>> nanos the same way Java and proto do.
>>>
>>> *Why multiple int types?* The domain of values for these types are
>>> different. For a language with one "int" or "number" type, that's another
>>> domain of values.
>>>
>>> *Columnar/Arrow*: making sure we unlock the ability to take this path is
>>> Paramount. So tying it directly to a row-oriented coder seems
>>> counterproductive.
>>>
>>> *Nullable/optional*: optional as it exists in Java, Haskell, Scala,
>>> ocaml, etc, is strictly more expressive than the billion dollar mistake.
>>> Nullability of a field is different and less expressive than nullability of
>>> a type.
>>>
>>> *Union types*: tagged disjoint unions and oneof are the most useful form
>>> of union. Embedding them into a relational model you get something like
>>> proto oneof. Not too hard to add later.
>>>
>>> *Multimap*: what does it add over an array-valued map or
>>> large-iterable-valued map? (honest question, not rhetorical)
>>>
>>> *id* is a loaded term in other places in the model. I would call it
>>> something else.
>>>
>>> *URN/enum for type names*: I see the case for both. The core types are
>>> fundamental enough they should never really change - after all, proto,
>>> thrift, avro, arrow, have addressed this (not to mention most programming
>>> languages). Maybe additions once every few years. I prefer the smallest
>>> intersection of these schema languages. A oneof is more clear, while URN
>>> emphasizes the similarity of built-in and logical types.
>>>
>>> *Multiple encodings of a value*: I actually think this is a benefit.
>>> There's a lot to unpack here.
>>>
>>> *Language specifics*: the design doc should describe the domain of
>>> values, and this should go in the core docs. Then for each SDK it should
>>> explicitly say what language type (or types?) the values are embedded in.
>>> Just like protos language guides.
>>>
>>> Kenn
>>>
>>> *From: *Udi Meiri <[email protected]>
>>> *Date: *Wed, May 8, 2019, 18:48
>>> *To: * <[email protected]>
>>>
>>> From a Python type hints perspective, how do schemas fit? Type hints are
>>>> currently used to determine which coder to use.
>>>> It seems that given a schema field, it would be useful to be able to
>>>> convert it to a coder (using URNs?), and to convert the coder into a typing
>>>> type.
>>>> This would allow for pipeline-construction-time type compatibility
>>>> checks.
>>>>
>>>> Some questions:
>>>> 1. Why are there 4 types of int (byte, int16, int32, int64)? Is it to
>>>> maintain type fidelity when writing back? If so, what happens in languages
>>>> that only have "int"?
>>>> 2. What is encoding_position? How does it differ from id (which is also
>>>> a position)?
>>>> 3. When are schema protos constructed? Are they available during
>>>> pipeline construction or afterwards?
>>>> 4. Once data is read into a Beam pipeline and a schema inferred, do we
>>>> maintain the schema types throughout the pipeline or use language-local
>>>> types?
>>>>
>>>>
>>>> On Wed, May 8, 2019 at 6:39 PM Robert Bradshaw <[email protected]>
>>>> wrote:
>>>>
>>>>> From: Reuven Lax <[email protected]>
>>>>> Date: Wed, May 8, 2019 at 10:36 PM
>>>>> To: dev
>>>>>
>>>>> > On Wed, May 8, 2019 at 1:23 PM Robert Bradshaw <[email protected]>
>>>>> wrote:
>>>>> >>
>>>>> >> Very excited to see this. In particular, I think this will be very
>>>>> >> useful for cross-language pipelines (not just SQL, but also for
>>>>> >> describing non-trivial data (e.g. for source and sink reuse).
>>>>> >>
>>>>> >> The proto specification makes sense to me. The only thing that looks
>>>>> >> like it's missing (other than possibly iterable, for
>>>>> arbitrarily-large
>>>>> >> support) is multimap. Another basic type, should we want to support
>>>>> >> it, is union (though this of course can get messy).
>>>>> >
>>>>> > multimap is an interesting suggestion. Do you have a use case in
>>>>> mind?
>>>>> >
>>>>> > union (or oneof) is also a good suggestion. There are good use cases
>>>>> for this, but this is a more fundamental change.
>>>>>
>>>>> No specific usecase, they just seemed to round out the options.
>>>>>
>>>>> >> I'm curious what the rational was for going with a oneof for
>>>>> type_info
>>>>> >> rather than an repeated components like we do with coders.
>>>>> >
>>>>> > No strong reason. Do you think repeated components is better than
>>>>> oneof?
>>>>>
>>>>> It's more consistent with how we currently do coders (which has pros
>>>>> and cons).
>>>>>
>>>>> >> Removing DATETIME as a logical coder on top of INT64 may cause
>>>>> issues
>>>>> >> of insufficient resolution and/or timespan. Similarly with DECIMAL
>>>>> (or
>>>>> >> would it be backed by string?)
>>>>> >
>>>>> > There could be multiple TIMESTAMP types for different resolutions,
>>>>> and they don't all need the same backing field type. E.g. the backing type
>>>>> for nanoseconds could by Row(INT64, INT64), or it could just be a byte
>>>>> array.
>>>>>
>>>>> Hmm.... What would the value be in supporting different types of
>>>>> timestamps? Would all SDKs have to support all of them? Can one
>>>>> compare, take differences, etc. across timestamp types? (As Luke
>>>>> points out, the other conversation on timestamps is likely relevant
>>>>> here as well.)
>>>>>
>>>>> >> The biggest question, as far as portability is concerned at least,
>>>>> is
>>>>> >> the notion of logical types. serialized_class is clearly not
>>>>> portable,
>>>>> >> and I also think we'll want a way to share semantic meaning across
>>>>> >> SDKs (especially if things like dates become logical types). Perhaps
>>>>> >> URNs (+payloads) would be a better fit here?
>>>>> >
>>>>> > Yes, URN + payload is probably the better fit for portability.
>>>>> >
>>>>> >> Taking a step back, I think it's worth asking why we have different
>>>>> >> types, rather than simply making everything a LogicalType of bytes
>>>>> >> (aka coder). Other than encoding format, the answer I can come up
>>>>> with
>>>>> >> is that the type decides the kinds of operations that can be done on
>>>>> >> it, e.g. does it support comparison? Arithmetic? Containment?
>>>>> >> Higher-level date operations? Perhaps this should be used to guide
>>>>> the
>>>>> >> set of types we provide.
>>>>> >
>>>>> > Also even though we could make everything a LogicalType (though at
>>>>> least byte array would have to stay primitive), I think  it's useful to
>>>>> have a slightly larger set of primitive types.  It makes things easier to
>>>>> understand and debug, and it makes it simpler for the various SDKs to map
>>>>> them to their types (e.g. mapping to POJOs).
>>>>>
>>>>>  This would be the case if one didn't have LogicalType at all, but
>>>>> once one introduces that one now has this more complicated two-level
>>>>> hierarchy of types which doesn't seem simpler to me.
>>>>>
>>>>> I'm trying to understand what information Schema encodes that a
>>>>> NamedTupleCoder (or RowCoder) would/could not. (Coders have the
>>>>> disadvantage that there are multiple encodings of a single value, e.g.
>>>>> BigEndian vs. VarInt, but if we have multiple resolutions of timestamp
>>>>> that would still seem to be an issue. Possibly another advantage is
>>>>> encoding into non-record-oriented formats, e.g. Parquet or Arrow, that
>>>>> have a set of primitives.)
>>>>>
>>>>

Re: [DISCUSS] Portability representation of schemas

Reply via email to