Ah thanks! I added some language there. *From: *Kenneth Knowles <[email protected]> *Date: *Thu, May 9, 2019 at 5:31 PM *To: *dev
> *From: *Brian Hulette <[email protected]> > *Date: *Thu, May 9, 2019 at 2:02 PM > *To: * <[email protected]> > > We briefly discussed using arrow schemas in place of beam schemas entirely >> in an arrow thread [1]. The biggest reason not to this was that we wanted >> to have a type for large iterables in beam schemas. But given that large >> iterables aren't currently implemented, beam schemas look very similar to >> arrow schemas. >> > > >> I think it makes sense to take inspiration from arrow schemas where >> possible, and maybe even copy them outright. Arrow already has a portable >> (flatbuffers) schema representation [2], and implementations for it in many >> languages that we may be able to re-use as we bring schemas to more SDKs >> (the project has Python and Go implementations). There are a couple of >> concepts in Arrow schemas that are specific for the format and wouldn't >> make sense for us, (fields can indicate whether or not they are dictionary >> encoded, and the schema has an endianness field), but if you drop those >> concepts the arrow spec looks pretty similar to the beam proto spec. >> > > FWIW I left a blank section in the doc for filling out what the > differences are and why, and conversely what the interop opportunities may > be. Such sections are some of my favorite sections of design docs. > > Kenn > > > Brian >> >> [1] >> https://lists.apache.org/thread.html/6be7715e13b71c2d161e4378c5ca1c76ac40cfc5988a03ba87f1c434@%3Cdev.beam.apache.org%3E >> [2] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L194 >> >> *From: *Robert Bradshaw <[email protected]> >> *Date: *Thu, May 9, 2019 at 1:38 PM >> *To: *dev >> >> From: Reuven Lax <[email protected]> >>> Date: Thu, May 9, 2019 at 7:29 PM >>> To: dev >>> >>> > Also in the future we might be able to do optimizations at the runner >>> level if at the portability layer we understood schemes instead of just raw >>> coders. This could be things like only parsing a subset of a row (if we >>> know only a few fields are accessed) or using a columnar data structure >>> like Arrow to encode batches of rows across portability. This doesn't >>> affect data semantics of course, but having a richer, more-expressive type >>> system opens up other opportunities. >>> >>> But we could do all of that with a RowCoder we understood to designate >>> the type(s), right? >>> >>> > On Thu, May 9, 2019 at 10:16 AM Robert Bradshaw <[email protected]> >>> wrote: >>> >> >>> >> On the flip side, Schemas are equivalent to the space of Coders with >>> >> the addition of a RowCoder and the ability to materialize to something >>> >> other than bytes, right? (Perhaps I'm missing something big here...) >>> >> This may make a backwards-compatible transition easier. (SDK-side, the >>> >> ability to reason about and operate on such types is of course much >>> >> richer than anything Coders offer right now.) >>> >> >>> >> From: Reuven Lax <[email protected]> >>> >> Date: Thu, May 9, 2019 at 4:52 PM >>> >> To: dev >>> >> >>> >> > FYI I can imagine a world in which we have no coders. We could >>> define the entire model on top of schemas. Today's "Coder" is completely >>> equivalent to a single-field schema with a logical-type field (actually the >>> latter is slightly more expressive as you aren't forced to serialize into >>> bytes). >>> >> > >>> >> > Due to compatibility constraints and the effort that would be >>> involved in such a change, I think the practical decision should be for >>> schemas and coders to coexist for the time being. However when we start >>> planning Beam 3.0, deprecating coders is something I would like to suggest. >>> >> > >>> >> > On Thu, May 9, 2019 at 7:48 AM Robert Bradshaw <[email protected]> >>> wrote: >>> >> >> >>> >> >> From: Kenneth Knowles <[email protected]> >>> >> >> Date: Thu, May 9, 2019 at 10:05 AM >>> >> >> To: dev >>> >> >> >>> >> >> > This is a huge development. Top posting because I can be more >>> compact. >>> >> >> > >>> >> >> > I really think after the initial idea converges this needs a >>> design doc with goals and alternatives. It is an extraordinarily >>> consequential model change. So in the spirit of doing the work / bias >>> towards action, I created a quick draft at >>> https://s.apache.org/beam-schemas and added everyone on this thread as >>> editors. I am still in the process of writing this to match the thread. >>> >> >> >>> >> >> Thanks! Added some comments there. >>> >> >> >>> >> >> > *Multiple timestamp resolutions*: you can use logcial types to >>> represent nanos the same way Java and proto do. >>> >> >> >>> >> >> As per the other discussion, I'm unsure the value in supporting >>> >> >> multiple timestamp resolutions is high enough to outweigh the cost. >>> >> >> >>> >> >> > *Why multiple int types?* The domain of values for these types >>> are different. For a language with one "int" or "number" type, that's >>> another domain of values. >>> >> >> >>> >> >> What is the value in having different domains? If your data has a >>> >> >> natural domain, chances are it doesn't line up exactly with one of >>> >> >> these. I guess it's for languages whose types have specific >>> domains? >>> >> >> (There's also compactness in representation, encoded and in-memory, >>> >> >> though I'm not sure that's high.) >>> >> >> >>> >> >> > *Columnar/Arrow*: making sure we unlock the ability to take this >>> path is Paramount. So tying it directly to a row-oriented coder seems >>> counterproductive. >>> >> >> >>> >> >> I don't think Coders are necessarily row-oriented. They are, >>> however, >>> >> >> bytes-oriented. (Perhaps they need not be.) There seems to be a >>> lot of >>> >> >> overlap between what Coders express in terms of element typing >>> >> >> information and what Schemas express, and I'd rather have one >>> concept >>> >> >> if possible. Or have a clear division of responsibilities. >>> >> >> >>> >> >> > *Multimap*: what does it add over an array-valued map or >>> large-iterable-valued map? (honest question, not rhetorical) >>> >> >> >>> >> >> Multimap has a different notion of what it means to contain a >>> value, >>> >> >> can handle (unordered) unions of non-disjoint keys, etc. Maybe this >>> >> >> isn't worth a new primitive type. >>> >> >> >>> >> >> > *URN/enum for type names*: I see the case for both. The core >>> types are fundamental enough they should never really change - after all, >>> proto, thrift, avro, arrow, have addressed this (not to mention most >>> programming languages). Maybe additions once every few years. I prefer the >>> smallest intersection of these schema languages. A oneof is more clear, >>> while URN emphasizes the similarity of built-in and logical types. >>> >> >> >>> >> >> Hmm... Do we have any examples of the multi-level primitive/logical >>> >> >> type in any of these other systems? I have a bias towards all types >>> >> >> being on the same footing unless there is compelling reason to >>> divide >>> >> >> things into primitive/use-defined ones. >>> >> >> >>> >> >> Here it seems like the most essential value of the primitive type >>> set >>> >> >> is to describe the underlying representation, for encoding >>> elements in >>> >> >> a variety of ways (notably columnar, but also interfacing with >>> other >>> >> >> external systems like IOs). Perhaps, rather than the previous >>> >> >> suggestion of making everything a logical of bytes, this could be >>> made >>> >> >> clear by still making everything a logical type, but renaming >>> >> >> "TypeName" to Representation. There would be URNs (typically with >>> >> >> empty payloads) for the various primitive types (whose mapping to >>> >> >> their representations would be the identity). >>> >> >> >>> >> >> - Robert >>> >>
