Re: [DISCUSS] Portability representation of schemas

Brian Hulette Thu, 09 May 2019 14:03:09 -0700

We briefly discussed using arrow schemas in place of beam schemas entirely
in an arrow thread [1]. The biggest reason not to this was that we wanted
to have a type for large iterables in beam schemas. But given that large
iterables aren't currently implemented, beam schemas look very similar to
arrow schemas.


I think it makes sense to take inspiration from arrow schemas where
possible, and maybe even copy them outright. Arrow already has a portable
(flatbuffers) schema representation [2], and implementations for it in many
languages that we may be able to re-use as we bring schemas to more SDKs
(the project has Python and Go implementations). There are a couple of
concepts in Arrow schemas that are specific for the format and wouldn't
make sense for us, (fields can indicate whether or not they are dictionary
encoded, and the schema has an endianness field), but if you drop those
concepts the arrow spec looks pretty similar to the beam proto spec.

Brian

[1]
https://lists.apache.org/thread.html/6be7715e13b71c2d161e4378c5ca1c76ac40cfc5988a03ba87f1c434@%3Cdev.beam.apache.org%3E
[2] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L194

*From: *Robert Bradshaw <[email protected]>
*Date: *Thu, May 9, 2019 at 1:38 PM
*To: *dev

From: Reuven Lax <[email protected]>
> Date: Thu, May 9, 2019 at 7:29 PM
> To: dev
>
> > Also in the future we might be able to do optimizations at the runner
> level if at the portability layer we understood schemes instead of just raw
> coders. This could be things like only parsing a subset of a row (if we
> know only a few fields are accessed) or using a columnar data structure
> like Arrow to encode batches of rows across portability. This doesn't
> affect data semantics of course, but having a richer, more-expressive type
> system opens up other opportunities.
>
> But we could do all of that with a RowCoder we understood to designate
> the type(s), right?
>
> > On Thu, May 9, 2019 at 10:16 AM Robert Bradshaw <[email protected]>
> wrote:
> >>
> >> On the flip side, Schemas are equivalent to the space of Coders with
> >> the addition of a RowCoder and the ability to materialize to something
> >> other than bytes, right? (Perhaps I'm missing something big here...)
> >> This may make a backwards-compatible transition easier. (SDK-side, the
> >> ability to reason about and operate on such types is of course much
> >> richer than anything Coders offer right now.)
> >>
> >> From: Reuven Lax <[email protected]>
> >> Date: Thu, May 9, 2019 at 4:52 PM
> >> To: dev
> >>
> >> > FYI I can imagine a world in which we have no coders. We could define
> the entire model on top of schemas. Today's "Coder" is completely
> equivalent to a single-field schema with a logical-type field (actually the
> latter is slightly more expressive as you aren't forced to serialize into
> bytes).
> >> >
> >> > Due to compatibility constraints and the effort that would be
> involved in such a change, I think the practical decision should be for
> schemas and coders to coexist for the time being. However when we start
> planning Beam 3.0, deprecating coders is something I would like to suggest.
> >> >
> >> > On Thu, May 9, 2019 at 7:48 AM Robert Bradshaw <[email protected]>
> wrote:
> >> >>
> >> >> From: Kenneth Knowles <[email protected]>
> >> >> Date: Thu, May 9, 2019 at 10:05 AM
> >> >> To: dev
> >> >>
> >> >> > This is a huge development. Top posting because I can be more
> compact.
> >> >> >
> >> >> > I really think after the initial idea converges this needs a
> design doc with goals and alternatives. It is an extraordinarily
> consequential model change. So in the spirit of doing the work / bias
> towards action, I created a quick draft at
> https://s.apache.org/beam-schemas and added everyone on this thread as
> editors. I am still in the process of writing this to match the thread.
> >> >>
> >> >> Thanks! Added some comments there.
> >> >>
> >> >> > *Multiple timestamp resolutions*: you can use logcial types to
> represent nanos the same way Java and proto do.
> >> >>
> >> >> As per the other discussion, I'm unsure the value in supporting
> >> >> multiple timestamp resolutions is high enough to outweigh the cost.
> >> >>
> >> >> > *Why multiple int types?* The domain of values for these types are
> different. For a language with one "int" or "number" type, that's another
> domain of values.
> >> >>
> >> >> What is the value in having different domains? If your data has a
> >> >> natural domain, chances are it doesn't line up exactly with one of
> >> >> these. I guess it's for languages whose types have specific domains?
> >> >> (There's also compactness in representation, encoded and in-memory,
> >> >> though I'm not sure that's high.)
> >> >>
> >> >> > *Columnar/Arrow*: making sure we unlock the ability to take this
> path is Paramount. So tying it directly to a row-oriented coder seems
> counterproductive.
> >> >>
> >> >> I don't think Coders are necessarily row-oriented. They are, however,
> >> >> bytes-oriented. (Perhaps they need not be.) There seems to be a lot
> of
> >> >> overlap between what Coders express in terms of element typing
> >> >> information and what Schemas express, and I'd rather have one concept
> >> >> if possible. Or have a clear division of responsibilities.
> >> >>
> >> >> > *Multimap*: what does it add over an array-valued map or
> large-iterable-valued map? (honest question, not rhetorical)
> >> >>
> >> >> Multimap has a different notion of what it means to contain a value,
> >> >> can handle (unordered) unions of non-disjoint keys, etc. Maybe this
> >> >> isn't worth a new primitive type.
> >> >>
> >> >> > *URN/enum for type names*: I see the case for both. The core types
> are fundamental enough they should never really change - after all, proto,
> thrift, avro, arrow, have addressed this (not to mention most programming
> languages). Maybe additions once every few years. I prefer the smallest
> intersection of these schema languages. A oneof is more clear, while URN
> emphasizes the similarity of built-in and logical types.
> >> >>
> >> >> Hmm... Do we have any examples of the multi-level primitive/logical
> >> >> type in any of these other systems? I have a bias towards all types
> >> >> being on the same footing unless there is compelling reason to divide
> >> >> things into primitive/use-defined ones.
> >> >>
> >> >> Here it seems like the most essential value of the primitive type set
> >> >> is to describe the underlying representation, for encoding elements
> in
> >> >> a variety of ways (notably columnar, but also interfacing with other
> >> >> external systems like IOs). Perhaps, rather than the previous
> >> >> suggestion of making everything a logical of bytes, this could be
> made
> >> >> clear by still making everything a logical type, but renaming
> >> >> "TypeName" to Representation. There would be URNs (typically with
> >> >> empty payloads) for the various primitive types (whose mapping to
> >> >> their representations would be the identity).
> >> >>
> >> >> - Robert
>

Re: [DISCUSS] Portability representation of schemas

Reply via email to