Re: [DISCUSS] Portability representation of schemas

Brian Hulette Fri, 10 May 2019 09:17:07 -0700

Ah thanks! I added some language there.

*From: *Kenneth Knowles <[email protected]>
*Date: *Thu, May 9, 2019 at 5:31 PM
*To: *dev



> *From: *Brian Hulette <[email protected]>
> *Date: *Thu, May 9, 2019 at 2:02 PM
> *To: * <[email protected]>
>
> We briefly discussed using arrow schemas in place of beam schemas entirely
>> in an arrow thread [1]. The biggest reason not to this was that we wanted
>> to have a type for large iterables in beam schemas. But given that large
>> iterables aren't currently implemented, beam schemas look very similar to
>> arrow schemas.
>>
>
>
>> I think it makes sense to take inspiration from arrow schemas where
>> possible, and maybe even copy them outright. Arrow already has a portable
>> (flatbuffers) schema representation [2], and implementations for it in many
>> languages that we may be able to re-use as we bring schemas to more SDKs
>> (the project has Python and Go implementations). There are a couple of
>> concepts in Arrow schemas that are specific for the format and wouldn't
>> make sense for us, (fields can indicate whether or not they are dictionary
>> encoded, and the schema has an endianness field), but if you drop those
>> concepts the arrow spec looks pretty similar to the beam proto spec.
>>
>
> FWIW I left a blank section in the doc for filling out what the
> differences are and why, and conversely what the interop opportunities may
> be. Such sections are some of my favorite sections of design docs.
>
> Kenn
>
>
> Brian
>>
>> [1]
>> https://lists.apache.org/thread.html/6be7715e13b71c2d161e4378c5ca1c76ac40cfc5988a03ba87f1c434@%3Cdev.beam.apache.org%3E
>> [2] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L194
>>
>> *From: *Robert Bradshaw <[email protected]>
>> *Date: *Thu, May 9, 2019 at 1:38 PM
>> *To: *dev
>>
>> From: Reuven Lax <[email protected]>
>>> Date: Thu, May 9, 2019 at 7:29 PM
>>> To: dev
>>>
>>> > Also in the future we might be able to do optimizations at the runner
>>> level if at the portability layer we understood schemes instead of just raw
>>> coders. This could be things like only parsing a subset of a row (if we
>>> know only a few fields are accessed) or using a columnar data structure
>>> like Arrow to encode batches of rows across portability. This doesn't
>>> affect data semantics of course, but having a richer, more-expressive type
>>> system opens up other opportunities.
>>>
>>> But we could do all of that with a RowCoder we understood to designate
>>> the type(s), right?
>>>
>>> > On Thu, May 9, 2019 at 10:16 AM Robert Bradshaw <[email protected]>
>>> wrote:
>>> >>
>>> >> On the flip side, Schemas are equivalent to the space of Coders with
>>> >> the addition of a RowCoder and the ability to materialize to something
>>> >> other than bytes, right? (Perhaps I'm missing something big here...)
>>> >> This may make a backwards-compatible transition easier. (SDK-side, the
>>> >> ability to reason about and operate on such types is of course much
>>> >> richer than anything Coders offer right now.)
>>> >>
>>> >> From: Reuven Lax <[email protected]>
>>> >> Date: Thu, May 9, 2019 at 4:52 PM
>>> >> To: dev
>>> >>
>>> >> > FYI I can imagine a world in which we have no coders. We could
>>> define the entire model on top of schemas. Today's "Coder" is completely
>>> equivalent to a single-field schema with a logical-type field (actually the
>>> latter is slightly more expressive as you aren't forced to serialize into
>>> bytes).
>>> >> >
>>> >> > Due to compatibility constraints and the effort that would be
>>> involved in such a change, I think the practical decision should be for
>>> schemas and coders to coexist for the time being. However when we start
>>> planning Beam 3.0, deprecating coders is something I would like to suggest.
>>> >> >
>>> >> > On Thu, May 9, 2019 at 7:48 AM Robert Bradshaw <[email protected]>
>>> wrote:
>>> >> >>
>>> >> >> From: Kenneth Knowles <[email protected]>
>>> >> >> Date: Thu, May 9, 2019 at 10:05 AM
>>> >> >> To: dev
>>> >> >>
>>> >> >> > This is a huge development. Top posting because I can be more
>>> compact.
>>> >> >> >
>>> >> >> > I really think after the initial idea converges this needs a
>>> design doc with goals and alternatives. It is an extraordinarily
>>> consequential model change. So in the spirit of doing the work / bias
>>> towards action, I created a quick draft at
>>> https://s.apache.org/beam-schemas and added everyone on this thread as
>>> editors. I am still in the process of writing this to match the thread.
>>> >> >>
>>> >> >> Thanks! Added some comments there.
>>> >> >>
>>> >> >> > *Multiple timestamp resolutions*: you can use logcial types to
>>> represent nanos the same way Java and proto do.
>>> >> >>
>>> >> >> As per the other discussion, I'm unsure the value in supporting
>>> >> >> multiple timestamp resolutions is high enough to outweigh the cost.
>>> >> >>
>>> >> >> > *Why multiple int types?* The domain of values for these types
>>> are different. For a language with one "int" or "number" type, that's
>>> another domain of values.
>>> >> >>
>>> >> >> What is the value in having different domains? If your data has a
>>> >> >> natural domain, chances are it doesn't line up exactly with one of
>>> >> >> these. I guess it's for languages whose types have specific
>>> domains?
>>> >> >> (There's also compactness in representation, encoded and in-memory,
>>> >> >> though I'm not sure that's high.)
>>> >> >>
>>> >> >> > *Columnar/Arrow*: making sure we unlock the ability to take this
>>> path is Paramount. So tying it directly to a row-oriented coder seems
>>> counterproductive.
>>> >> >>
>>> >> >> I don't think Coders are necessarily row-oriented. They are,
>>> however,
>>> >> >> bytes-oriented. (Perhaps they need not be.) There seems to be a
>>> lot of
>>> >> >> overlap between what Coders express in terms of element typing
>>> >> >> information and what Schemas express, and I'd rather have one
>>> concept
>>> >> >> if possible. Or have a clear division of responsibilities.
>>> >> >>
>>> >> >> > *Multimap*: what does it add over an array-valued map or
>>> large-iterable-valued map? (honest question, not rhetorical)
>>> >> >>
>>> >> >> Multimap has a different notion of what it means to contain a
>>> value,
>>> >> >> can handle (unordered) unions of non-disjoint keys, etc. Maybe this
>>> >> >> isn't worth a new primitive type.
>>> >> >>
>>> >> >> > *URN/enum for type names*: I see the case for both. The core
>>> types are fundamental enough they should never really change - after all,
>>> proto, thrift, avro, arrow, have addressed this (not to mention most
>>> programming languages). Maybe additions once every few years. I prefer the
>>> smallest intersection of these schema languages. A oneof is more clear,
>>> while URN emphasizes the similarity of built-in and logical types.
>>> >> >>
>>> >> >> Hmm... Do we have any examples of the multi-level primitive/logical
>>> >> >> type in any of these other systems? I have a bias towards all types
>>> >> >> being on the same footing unless there is compelling reason to
>>> divide
>>> >> >> things into primitive/use-defined ones.
>>> >> >>
>>> >> >> Here it seems like the most essential value of the primitive type
>>> set
>>> >> >> is to describe the underlying representation, for encoding
>>> elements in
>>> >> >> a variety of ways (notably columnar, but also interfacing with
>>> other
>>> >> >> external systems like IOs). Perhaps, rather than the previous
>>> >> >> suggestion of making everything a logical of bytes, this could be
>>> made
>>> >> >> clear by still making everything a logical type, but renaming
>>> >> >> "TypeName" to Representation. There would be URNs (typically with
>>> >> >> empty payloads) for the various primitive types (whose mapping to
>>> >> >> their representations would be the identity).
>>> >> >>
>>> >> >> - Robert
>>>
>>

Re: [DISCUSS] Portability representation of schemas

Reply via email to