Schema-Aware PCollections

Reuven Lax Wed, 29 Nov 2017 18:39:07 -0800

There has been a lot of conversation about schemas on PCollections
recently. There are a number of reasons for this. Schemas as first-class
objects in Beam provide a nice base for building BeamSQL. Spark has
provided schema-support via Dataframes for over two years, and it has
proved to be very popular among Spark users; it turns out that FlumeJava -
the original inspiration for the Beam API - has had schema support for even
longer, though this feature was not included in the Beam (at that time
Dataflow) API. It turns out that most records have structure, and allowing
the system to understand record structure can both simplify usage of the
system and allow for new performance optimizations.


After discussion with JB, Eugene, Kenn, Robert, and a number of others on
the list, I've started a proposal document here
<https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit?usp=sharing>
describing how schemas can be added to Beam in a manner that integrates
with the existing Beam API. The goal is not blindly copy existing systems
that have schemas, but rather to ensure that we get the best fit for Beam.
Please comment on this proposal - as much feedback as possible is valuable.

In addition, you may notice this document is incomplete. While it does
sketch out how schemas can fit into Beam semantically, many portions of
this design remain to be fleshed out. In particular, the API signatures are
only sketched at at a high level, exactly what all these APIs will look
like has not yet been defined. I would welcome help from interested members
of the community to define these APIs, and to make sure we're covering all
relevant use cases.

Thanks all,

Reuven

Schema-Aware PCollections

Reply via email to