Re: [DISCUSS] Portability representation of schemas

Udi Meiri Wed, 08 May 2019 18:48:13 -0700

>From a Python type hints perspective, how do schemas fit? Type hints are
currently used to determine which coder to use.
It seems that given a schema field, it would be useful to be able to
convert it to a coder (using URNs?), and to convert the coder into a typing
type.
This would allow for pipeline-construction-time type compatibility checks.


Some questions:
1. Why are there 4 types of int (byte, int16, int32, int64)? Is it to
maintain type fidelity when writing back? If so, what happens in languages
that only have "int"?
2. What is encoding_position? How does it differ from id (which is also a
position)?
3. When are schema protos constructed? Are they available during pipeline
construction or afterwards?
4. Once data is read into a Beam pipeline and a schema inferred, do we
maintain the schema types throughout the pipeline or use language-local
types?


On Wed, May 8, 2019 at 6:39 PM Robert Bradshaw <[email protected]> wrote:

> From: Reuven Lax <[email protected]>
> Date: Wed, May 8, 2019 at 10:36 PM
> To: dev
>
> > On Wed, May 8, 2019 at 1:23 PM Robert Bradshaw <[email protected]>
> wrote:
> >>
> >> Very excited to see this. In particular, I think this will be very
> >> useful for cross-language pipelines (not just SQL, but also for
> >> describing non-trivial data (e.g. for source and sink reuse).
> >>
> >> The proto specification makes sense to me. The only thing that looks
> >> like it's missing (other than possibly iterable, for arbitrarily-large
> >> support) is multimap. Another basic type, should we want to support
> >> it, is union (though this of course can get messy).
> >
> > multimap is an interesting suggestion. Do you have a use case in mind?
> >
> > union (or oneof) is also a good suggestion. There are good use cases for
> this, but this is a more fundamental change.
>
> No specific usecase, they just seemed to round out the options.
>
> >> I'm curious what the rational was for going with a oneof for type_info
> >> rather than an repeated components like we do with coders.
> >
> > No strong reason. Do you think repeated components is better than oneof?
>
> It's more consistent with how we currently do coders (which has pros and
> cons).
>
> >> Removing DATETIME as a logical coder on top of INT64 may cause issues
> >> of insufficient resolution and/or timespan. Similarly with DECIMAL (or
> >> would it be backed by string?)
> >
> > There could be multiple TIMESTAMP types for different resolutions, and
> they don't all need the same backing field type. E.g. the backing type for
> nanoseconds could by Row(INT64, INT64), or it could just be a byte array.
>
> Hmm.... What would the value be in supporting different types of
> timestamps? Would all SDKs have to support all of them? Can one
> compare, take differences, etc. across timestamp types? (As Luke
> points out, the other conversation on timestamps is likely relevant
> here as well.)
>
> >> The biggest question, as far as portability is concerned at least, is
> >> the notion of logical types. serialized_class is clearly not portable,
> >> and I also think we'll want a way to share semantic meaning across
> >> SDKs (especially if things like dates become logical types). Perhaps
> >> URNs (+payloads) would be a better fit here?
> >
> > Yes, URN + payload is probably the better fit for portability.
> >
> >> Taking a step back, I think it's worth asking why we have different
> >> types, rather than simply making everything a LogicalType of bytes
> >> (aka coder). Other than encoding format, the answer I can come up with
> >> is that the type decides the kinds of operations that can be done on
> >> it, e.g. does it support comparison? Arithmetic? Containment?
> >> Higher-level date operations? Perhaps this should be used to guide the
> >> set of types we provide.
> >
> > Also even though we could make everything a LogicalType (though at least
> byte array would have to stay primitive), I think  it's useful to have a
> slightly larger set of primitive types.  It makes things easier to
> understand and debug, and it makes it simpler for the various SDKs to map
> them to their types (e.g. mapping to POJOs).
>
>  This would be the case if one didn't have LogicalType at all, but
> once one introduces that one now has this more complicated two-level
> hierarchy of types which doesn't seem simpler to me.
>
> I'm trying to understand what information Schema encodes that a
> NamedTupleCoder (or RowCoder) would/could not. (Coders have the
> disadvantage that there are multiple encodings of a single value, e.g.
> BigEndian vs. VarInt, but if we have multiple resolutions of timestamp
> that would still seem to be an issue. Possibly another advantage is
> encoding into non-record-oriented formats, e.g. Parquet or Arrow, that
> have a set of primitives.)
>

smime.p7s
Description: S/MIME Cryptographic Signature

Re: [DISCUSS] Portability representation of schemas

Reply via email to