tl;dr: I have a PR at [1] that defines an initial Schema API in python based on the typing module, and uses typing.NamedTuple to represent a Schema. There are some risks with that approach but I propose we move forward with it as a first draft and iterate.
I've opened up a PR [1] that implements RowCoder in the Python SDK and verifies it's compatibility with the Java implementation via tests in standard_coders.yaml. A lot of miscellaneous changes are required to get that point, including a pretty significant one: providing some native python representation for schemas. As discussed in the PR description I opted to fully embrace the typing module for the native representation of schema types: - Primitive types all map to numpy types (e.g. np.int16, np.unicode). - Arrays map to typing.List. In https://s.apache.org/beam-schemas we settled on typing.Collection, but unfortunately this doesn't seem to be supported in python 2, I'm open to other suggestions here. - Map maps to typing.Mapping. - Rows map to typing.NamedTuple. - nullability is indicated with typing.Optional. Note there's no distinction between Optional[Optional[T]] and Optional[T] in typing, both map to Union[T, None] - so this is actually a good analog for the nullable flag on FieldType in schema.proto. With this approach a schema in Python might look like: ``` class Movie(NamedTuple): name: np.unicode year: Optional[np.int16] # The class/type annotation syntax doesn't work in Python 2. Instead you can use: # Movie = NamedTuple('Movie', [('name', np.unicode), ('year', Optional[np.int16])] # DoFns annotated with_output_types(Movie) will use RowCoder coders.registry.register_coder(Movie, coders.RowCoder) ``` I think the choice to use typing.NamedTuple as a row type is potentially controversial - Udi, Robert Bradshaw and I were already discussing it a bit in a comment on the portable schemas doc [2], but I wanted to bring that discussion to the ML. On the pro side: + NamedTuple is a pretty great analog for Java's Row type [3]. Both store attributes internally as an ordered collection (List<Object> in Row, a tuple in NamedTuple) and provide shortcuts for accessing those attributes by field name based on the schema. + NamedTuple is a native type, and we're trying to get out of the business of defining our own type hints (I think). On the con side: - When using the class-based version of NamedTuple in python 3 a user might be tempted to add more functionality to their class (for example, define a method) rather than just defining a schema - but I'm not sure we're prepared to guarantee that we will always produce an instance of their class, just something that has the defined attributes. This concern can potentially be alleviated once we have support for logical types. Unless there are any objections I think it would make sense to start with this implementation (documenting the limitations), and then iterate on it. Please take a look at the PR [1] and let me know what you think about this proposal. Thanks, Brian [1] https://github.com/apache/beam/pull/9188 [2] https://docs.google.com/a/google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?disco=AAAADSP8gx8 [3] https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/Row.java