There has been a lot of conversation about schemas on PCollections recently. There are a number of reasons for this. Schemas as first-class objects in Beam provide a nice base for building BeamSQL. Spark has provided schema-support via Dataframes for over two years, and it has proved to be very popular among Spark users; it turns out that FlumeJava - the original inspiration for the Beam API - has had schema support for even longer, though this feature was not included in the Beam (at that time Dataflow) API. It turns out that most records have structure, and allowing the system to understand record structure can both simplify usage of the system and allow for new performance optimizations.
After discussion with JB, Eugene, Kenn, Robert, and a number of others on the list, I've started a proposal document here <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit?usp=sharing> describing how schemas can be added to Beam in a manner that integrates with the existing Beam API. The goal is not blindly copy existing systems that have schemas, but rather to ensure that we get the best fit for Beam. Please comment on this proposal - as much feedback as possible is valuable. In addition, you may notice this document is incomplete. While it does sketch out how schemas can fit into Beam semantically, many portions of this design remain to be fleshed out. In particular, the API signatures are only sketched at at a high level, exactly what all these APIs will look like has not yet been defined. I would welcome help from interested members of the community to define these APIs, and to make sure we're covering all relevant use cases. Thanks all, Reuven