[PROPOSAL] An initial Schema API in Python

Brian Hulette Wed, 31 Jul 2019 14:51:41 -0700

tl;dr: I have a PR at [1] that defines an initial Schema API in python
based on the typing module, and uses typing.NamedTuple to represent a
Schema. There are some risks with that approach but I propose we move
forward with it as a first draft and iterate.



I've opened up a PR [1] that implements RowCoder in the Python SDK and
verifies it's compatibility with the Java implementation via tests in
standard_coders.yaml. A lot of miscellaneous changes are required to get
that point, including a pretty significant one: providing some native
python representation for schemas.

As discussed in the PR description I opted to fully embrace the typing
module for the native representation of schema types:
- Primitive types all map to numpy types (e.g. np.int16, np.unicode).
- Arrays map to typing.List. In https://s.apache.org/beam-schemas we
settled on typing.Collection, but unfortunately this doesn't seem to be
supported in python 2, I'm open to other suggestions here.
- Map maps to typing.Mapping.
- Rows map to typing.NamedTuple.
- nullability is indicated with typing.Optional. Note there's no
distinction between Optional[Optional[T]] and Optional[T] in typing, both
map to Union[T, None] - so this is actually a good analog for the nullable
flag on FieldType in schema.proto.

With this approach a schema in Python might look like:
```
class Movie(NamedTuple):
  name: np.unicode
  year: Optional[np.int16]

# The class/type annotation syntax doesn't work in Python 2. Instead you
can use:
# Movie = NamedTuple('Movie', [('name', np.unicode), ('year',
Optional[np.int16])]

# DoFns annotated with_output_types(Movie) will use RowCoder
coders.registry.register_coder(Movie, coders.RowCoder)
```

I think the choice to use typing.NamedTuple as a row type is potentially
controversial - Udi, Robert Bradshaw and I were already discussing it a bit
in a comment on the portable schemas doc [2], but I wanted to bring that
discussion to the ML.

On the pro side:
+ NamedTuple is a pretty great analog for Java's Row type [3]. Both store
attributes internally as an ordered collection (List<Object> in Row, a
tuple in NamedTuple) and provide shortcuts for accessing those attributes
by field name based on the schema.
+  NamedTuple is a native type, and we're trying to get out of the business
of defining our own type hints (I think).

On the con side:
- When using the class-based version of NamedTuple in python 3 a user might
be tempted to add more functionality to their class (for example, define a
method) rather than just defining a schema - but I'm not sure we're
prepared to guarantee that we will always produce an instance of their
class, just something that has the defined attributes. This concern can
potentially be alleviated once we have support for logical types.

Unless there are any objections I think it would make sense to start with
this implementation (documenting the limitations), and then iterate on it.
Please take a look at the PR [1] and let me know what you think about this
proposal.

Thanks,
Brian

[1] https://github.com/apache/beam/pull/9188
[2]
https://docs.google.com/a/google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?disco=AAAADSP8gx8
[3]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/Row.java

[PROPOSAL] An initial Schema API in Python

Reply via email to