Rocking, I'll start leaving some comments on this. I'm excited to see work
being done in this area as well :)

On Thu, Nov 30, 2017 at 9:20 AM, Tyler Akidau <[email protected]> wrote:

> On Wed, Nov 29, 2017 at 6:38 PM Reuven Lax <[email protected]> wrote:
>
>> There has been a lot of conversation about schemas on PCollections
>> recently. There are a number of reasons for this. Schemas as first-class
>> objects in Beam provide a nice base for building BeamSQL. Spark has
>> provided schema-support via Dataframes for over two years, and it has
>> proved to be very popular among Spark users; it turns out that FlumeJava -
>> the original inspiration for the Beam API - has had schema support for even
>> longer, though this feature was not included in the Beam (at that time
>> Dataflow) API. It turns out that most records have structure, and allowing
>> the system to understand record structure can both simplify usage of the
>> system and allow for new performance optimizations.
>>
>> After discussion with JB, Eugene, Kenn, Robert, and a number of others on
>> the list, I've started a proposal document here
>> <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit?usp=sharing>
>> describing how schemas can be added to Beam in a manner that integrates
>> with the existing Beam API. The goal is not blindly copy existing systems
>> that have schemas, but rather to ensure that we get the best fit for Beam.
>> Please comment on this proposal - as much feedback as possible is valuable.
>>
>> In addition, you may notice this document is incomplete. While it does
>> sketch out how schemas can fit into Beam semantically, many portions of
>> this design remain to be fleshed out. In particular, the API signatures are
>> only sketched at at a high level, exactly what all these APIs will look
>> like has not yet been defined. I would welcome help from interested members
>> of the community to define these APIs, and to make sure we're covering all
>> relevant use cases.
>>
>
> Thanks for sharing this Reuven, I'm excited to see this being discussed.
> One global comment: all of the existing examples are in Java. It would be
> great if we could design this with Python in mind (and how it could
> interact cleanly with Pandas) at the same time. +Robert Bradshaw
> <[email protected]> , +Holden Karau <[email protected]> , and +Ahmet
> Altay <[email protected]> , all whom I've spoken with regarding this and
> other Python things recently, just to be sure they see it. But of course
> it'd be great if anyone working on Python could jump in.
>
> -Tyler
>
>
>
>>
>> Thanks all,
>>
>> Reuven
>>
>>
>>


-- 
Twitter: https://twitter.com/holdenkarau

Reply via email to