>From a Python type hints perspective, how do schemas fit? Type hints are currently used to determine which coder to use. It seems that given a schema field, it would be useful to be able to convert it to a coder (using URNs?), and to convert the coder into a typing type. This would allow for pipeline-construction-time type compatibility checks.
Some questions: 1. Why are there 4 types of int (byte, int16, int32, int64)? Is it to maintain type fidelity when writing back? If so, what happens in languages that only have "int"? 2. What is encoding_position? How does it differ from id (which is also a position)? 3. When are schema protos constructed? Are they available during pipeline construction or afterwards? 4. Once data is read into a Beam pipeline and a schema inferred, do we maintain the schema types throughout the pipeline or use language-local types? On Wed, May 8, 2019 at 6:39 PM Robert Bradshaw <[email protected]> wrote: > From: Reuven Lax <[email protected]> > Date: Wed, May 8, 2019 at 10:36 PM > To: dev > > > On Wed, May 8, 2019 at 1:23 PM Robert Bradshaw <[email protected]> > wrote: > >> > >> Very excited to see this. In particular, I think this will be very > >> useful for cross-language pipelines (not just SQL, but also for > >> describing non-trivial data (e.g. for source and sink reuse). > >> > >> The proto specification makes sense to me. The only thing that looks > >> like it's missing (other than possibly iterable, for arbitrarily-large > >> support) is multimap. Another basic type, should we want to support > >> it, is union (though this of course can get messy). > > > > multimap is an interesting suggestion. Do you have a use case in mind? > > > > union (or oneof) is also a good suggestion. There are good use cases for > this, but this is a more fundamental change. > > No specific usecase, they just seemed to round out the options. > > >> I'm curious what the rational was for going with a oneof for type_info > >> rather than an repeated components like we do with coders. > > > > No strong reason. Do you think repeated components is better than oneof? > > It's more consistent with how we currently do coders (which has pros and > cons). > > >> Removing DATETIME as a logical coder on top of INT64 may cause issues > >> of insufficient resolution and/or timespan. Similarly with DECIMAL (or > >> would it be backed by string?) > > > > There could be multiple TIMESTAMP types for different resolutions, and > they don't all need the same backing field type. E.g. the backing type for > nanoseconds could by Row(INT64, INT64), or it could just be a byte array. > > Hmm.... What would the value be in supporting different types of > timestamps? Would all SDKs have to support all of them? Can one > compare, take differences, etc. across timestamp types? (As Luke > points out, the other conversation on timestamps is likely relevant > here as well.) > > >> The biggest question, as far as portability is concerned at least, is > >> the notion of logical types. serialized_class is clearly not portable, > >> and I also think we'll want a way to share semantic meaning across > >> SDKs (especially if things like dates become logical types). Perhaps > >> URNs (+payloads) would be a better fit here? > > > > Yes, URN + payload is probably the better fit for portability. > > > >> Taking a step back, I think it's worth asking why we have different > >> types, rather than simply making everything a LogicalType of bytes > >> (aka coder). Other than encoding format, the answer I can come up with > >> is that the type decides the kinds of operations that can be done on > >> it, e.g. does it support comparison? Arithmetic? Containment? > >> Higher-level date operations? Perhaps this should be used to guide the > >> set of types we provide. > > > > Also even though we could make everything a LogicalType (though at least > byte array would have to stay primitive), I think it's useful to have a > slightly larger set of primitive types. It makes things easier to > understand and debug, and it makes it simpler for the various SDKs to map > them to their types (e.g. mapping to POJOs). > > This would be the case if one didn't have LogicalType at all, but > once one introduces that one now has this more complicated two-level > hierarchy of types which doesn't seem simpler to me. > > I'm trying to understand what information Schema encodes that a > NamedTupleCoder (or RowCoder) would/could not. (Coders have the > disadvantage that there are multiple encodings of a single value, e.g. > BigEndian vs. VarInt, but if we have multiple resolutions of timestamp > that would still seem to be an issue. Possibly another advantage is > encoding into non-record-oriented formats, e.g. Parquet or Arrow, that > have a set of primitives.) >
smime.p7s
Description: S/MIME Cryptographic Signature
