Rocking, I'll start leaving some comments on this. I'm excited to see work being done in this area as well :)
On Thu, Nov 30, 2017 at 9:20 AM, Tyler Akidau <[email protected]> wrote: > On Wed, Nov 29, 2017 at 6:38 PM Reuven Lax <[email protected]> wrote: > >> There has been a lot of conversation about schemas on PCollections >> recently. There are a number of reasons for this. Schemas as first-class >> objects in Beam provide a nice base for building BeamSQL. Spark has >> provided schema-support via Dataframes for over two years, and it has >> proved to be very popular among Spark users; it turns out that FlumeJava - >> the original inspiration for the Beam API - has had schema support for even >> longer, though this feature was not included in the Beam (at that time >> Dataflow) API. It turns out that most records have structure, and allowing >> the system to understand record structure can both simplify usage of the >> system and allow for new performance optimizations. >> >> After discussion with JB, Eugene, Kenn, Robert, and a number of others on >> the list, I've started a proposal document here >> <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit?usp=sharing> >> describing how schemas can be added to Beam in a manner that integrates >> with the existing Beam API. The goal is not blindly copy existing systems >> that have schemas, but rather to ensure that we get the best fit for Beam. >> Please comment on this proposal - as much feedback as possible is valuable. >> >> In addition, you may notice this document is incomplete. While it does >> sketch out how schemas can fit into Beam semantically, many portions of >> this design remain to be fleshed out. In particular, the API signatures are >> only sketched at at a high level, exactly what all these APIs will look >> like has not yet been defined. I would welcome help from interested members >> of the community to define these APIs, and to make sure we're covering all >> relevant use cases. >> > > Thanks for sharing this Reuven, I'm excited to see this being discussed. > One global comment: all of the existing examples are in Java. It would be > great if we could design this with Python in mind (and how it could > interact cleanly with Pandas) at the same time. +Robert Bradshaw > <[email protected]> , +Holden Karau <[email protected]> , and +Ahmet > Altay <[email protected]> , all whom I've spoken with regarding this and > other Python things recently, just to be sure they see it. But of course > it'd be great if anyone working on Python could jump in. > > -Tyler > > > >> >> Thanks all, >> >> Reuven >> >> >> -- Twitter: https://twitter.com/holdenkarau
