Re: [DISCUSS] Portability representation of schemas

Reuven Lax Thu, 09 May 2019 07:18:16 -0700

On Thu, May 9, 2019 at 6:34 AM Alex Van Boxel <[email protected]> wrote:


> My biggest concern is that if we don't make TIMESTAMP (yes, TIMESTAMP is a
> better name for DATETIME) a first class citizen that we get
> *inconsistencies* between the difference portability implementations. The
> same holds true for DECIMAL and DURATION. If we aren't given pipeline
> developers a consistent way of working with timestamp we're going to
> generate a lot of frustration.
>

This is a fair concern. However logical types have unique ids/urns, so we
can still make TIMESTAMP a first-class citizen. The only difference is it
will not be considered a primitive type.


>
> I always said "TIMESTAMP's are the nail in the coffin of data engineers"...
>
> For the rest It's a bit too early to make a lot of informed input here, as
> I just started working with schema's for my protobuf implementation.
>
>  _/
> _/ Alex Van Boxel
>
>
> On Thu, May 9, 2019 at 10:05 AM Kenneth Knowles <[email protected]> wrote:
>
>> This is a huge development. Top posting because I can be more compact.
>>
>> I really think after the initial idea converges this needs a design doc
>> with goals and alternatives. It is an extraordinarily consequential model
>> change. So in the spirit of doing the work / bias towards action, I created
>> a quick draft at https://s.apache.org/beam-schemas and added everyone on
>> this thread as editors. I am still in the process of writing this to match
>> the thread.
>>
>> *Multiple timestamp resolutions*: you can use logcial types to represent
>> nanos the same way Java and proto do.
>>
>> *Why multiple int types?* The domain of values for these types are
>> different. For a language with one "int" or "number" type, that's another
>> domain of values.
>>
>> *Columnar/Arrow*: making sure we unlock the ability to take this path is
>> Paramount. So tying it directly to a row-oriented coder seems
>> counterproductive.
>>
>> *Nullable/optional*: optional as it exists in Java, Haskell, Scala,
>> ocaml, etc, is strictly more expressive than the billion dollar mistake.
>> Nullability of a field is different and less expressive than nullability of
>> a type.
>>
>> *Union types*: tagged disjoint unions and oneof are the most useful form
>> of union. Embedding them into a relational model you get something like
>> proto oneof. Not too hard to add later.
>>
>> *Multimap*: what does it add over an array-valued map or
>> large-iterable-valued map? (honest question, not rhetorical)
>>
>> *id* is a loaded term in other places in the model. I would call it
>> something else.
>>
>> *URN/enum for type names*: I see the case for both. The core types are
>> fundamental enough they should never really change - after all, proto,
>> thrift, avro, arrow, have addressed this (not to mention most programming
>> languages). Maybe additions once every few years. I prefer the smallest
>> intersection of these schema languages. A oneof is more clear, while URN
>> emphasizes the similarity of built-in and logical types.
>>
>> *Multiple encodings of a value*: I actually think this is a benefit.
>> There's a lot to unpack here.
>>
>> *Language specifics*: the design doc should describe the domain of
>> values, and this should go in the core docs. Then for each SDK it should
>> explicitly say what language type (or types?) the values are embedded in.
>> Just like protos language guides.
>>
>> Kenn
>>
>> *From: *Udi Meiri <[email protected]>
>> *Date: *Wed, May 8, 2019, 18:48
>> *To: * <[email protected]>
>>
>> From a Python type hints perspective, how do schemas fit? Type hints are
>>> currently used to determine which coder to use.
>>> It seems that given a schema field, it would be useful to be able to
>>> convert it to a coder (using URNs?), and to convert the coder into a typing
>>> type.
>>> This would allow for pipeline-construction-time type compatibility
>>> checks.
>>>
>>> Some questions:
>>> 1. Why are there 4 types of int (byte, int16, int32, int64)? Is it to
>>> maintain type fidelity when writing back? If so, what happens in languages
>>> that only have "int"?
>>> 2. What is encoding_position? How does it differ from id (which is also
>>> a position)?
>>> 3. When are schema protos constructed? Are they available during
>>> pipeline construction or afterwards?
>>> 4. Once data is read into a Beam pipeline and a schema inferred, do we
>>> maintain the schema types throughout the pipeline or use language-local
>>> types?
>>>
>>>
>>> On Wed, May 8, 2019 at 6:39 PM Robert Bradshaw <[email protected]>
>>> wrote:
>>>
>>>> From: Reuven Lax <[email protected]>
>>>> Date: Wed, May 8, 2019 at 10:36 PM
>>>> To: dev
>>>>
>>>> > On Wed, May 8, 2019 at 1:23 PM Robert Bradshaw <[email protected]>
>>>> wrote:
>>>> >>
>>>> >> Very excited to see this. In particular, I think this will be very
>>>> >> useful for cross-language pipelines (not just SQL, but also for
>>>> >> describing non-trivial data (e.g. for source and sink reuse).
>>>> >>
>>>> >> The proto specification makes sense to me. The only thing that looks
>>>> >> like it's missing (other than possibly iterable, for
>>>> arbitrarily-large
>>>> >> support) is multimap. Another basic type, should we want to support
>>>> >> it, is union (though this of course can get messy).
>>>> >
>>>> > multimap is an interesting suggestion. Do you have a use case in mind?
>>>> >
>>>> > union (or oneof) is also a good suggestion. There are good use cases
>>>> for this, but this is a more fundamental change.
>>>>
>>>> No specific usecase, they just seemed to round out the options.
>>>>
>>>> >> I'm curious what the rational was for going with a oneof for
>>>> type_info
>>>> >> rather than an repeated components like we do with coders.
>>>> >
>>>> > No strong reason. Do you think repeated components is better than
>>>> oneof?
>>>>
>>>> It's more consistent with how we currently do coders (which has pros
>>>> and cons).
>>>>
>>>> >> Removing DATETIME as a logical coder on top of INT64 may cause issues
>>>> >> of insufficient resolution and/or timespan. Similarly with DECIMAL
>>>> (or
>>>> >> would it be backed by string?)
>>>> >
>>>> > There could be multiple TIMESTAMP types for different resolutions,
>>>> and they don't all need the same backing field type. E.g. the backing type
>>>> for nanoseconds could by Row(INT64, INT64), or it could just be a byte
>>>> array.
>>>>
>>>> Hmm.... What would the value be in supporting different types of
>>>> timestamps? Would all SDKs have to support all of them? Can one
>>>> compare, take differences, etc. across timestamp types? (As Luke
>>>> points out, the other conversation on timestamps is likely relevant
>>>> here as well.)
>>>>
>>>> >> The biggest question, as far as portability is concerned at least, is
>>>> >> the notion of logical types. serialized_class is clearly not
>>>> portable,
>>>> >> and I also think we'll want a way to share semantic meaning across
>>>> >> SDKs (especially if things like dates become logical types). Perhaps
>>>> >> URNs (+payloads) would be a better fit here?
>>>> >
>>>> > Yes, URN + payload is probably the better fit for portability.
>>>> >
>>>> >> Taking a step back, I think it's worth asking why we have different
>>>> >> types, rather than simply making everything a LogicalType of bytes
>>>> >> (aka coder). Other than encoding format, the answer I can come up
>>>> with
>>>> >> is that the type decides the kinds of operations that can be done on
>>>> >> it, e.g. does it support comparison? Arithmetic? Containment?
>>>> >> Higher-level date operations? Perhaps this should be used to guide
>>>> the
>>>> >> set of types we provide.
>>>> >
>>>> > Also even though we could make everything a LogicalType (though at
>>>> least byte array would have to stay primitive), I think  it's useful to
>>>> have a slightly larger set of primitive types.  It makes things easier to
>>>> understand and debug, and it makes it simpler for the various SDKs to map
>>>> them to their types (e.g. mapping to POJOs).
>>>>
>>>>  This would be the case if one didn't have LogicalType at all, but
>>>> once one introduces that one now has this more complicated two-level
>>>> hierarchy of types which doesn't seem simpler to me.
>>>>
>>>> I'm trying to understand what information Schema encodes that a
>>>> NamedTupleCoder (or RowCoder) would/could not. (Coders have the
>>>> disadvantage that there are multiple encodings of a single value, e.g.
>>>> BigEndian vs. VarInt, but if we have multiple resolutions of timestamp
>>>> that would still seem to be an issue. Possibly another advantage is
>>>> encoding into non-record-oriented formats, e.g. Parquet or Arrow, that
>>>> have a set of primitives.)
>>>>
>>>

Re: [DISCUSS] Portability representation of schemas

Reply via email to