PTransform Annotations Proposal

Mirac Vuslat Basaran Thu, 12 Nov 2020 09:42:10 -0800

Hi all,

We would like to propose adding functionality to add annotations to Beam
transforms. These annotations would be readable by the runner, and the
runner could then act on this information; for example by doing some
special resource allocation. There have been discussions around annotations
(or hints as they are sometimes called) in the past (
https://lists.apache.org/thread.html/rdf247cfa3a509f80578f03b2454ea1e50474ee3576a059486d58fdf4%40%3Cdev.beam.apache.org%3E,
https://lists.apache.org/thread.html/fc090d8acd96c4cf2d23071b5d99f538165d3ff7fbe6f65297655309%40%3Cdev.beam.apache.org%3E).
This proposal aims to come up with an accepted lightweight solution with a
follow-up Pull Request to implement it in Go.

By annotations, we refer to optional information / hints provided to the
runner. This proposal explicitly excludes “required” annotations that could
cause incorrect output. A runner that does not understand the annotations
and ignores them must still produce correct output, with perhaps a
degradation in performance or other nonfunctional requirements. Supporting
only “optional” annotations allows for compatibility with runners that do
not recognize those annotations.

A good example of an optional annotation is marking a transform to be run
on GPU or TPU or that it needs a certain amount of RAM. If the runner knows
about this annotation, it can then allocate the requested resources for
that transform only to improve performance and avoid using these scarce
resources for other transforms.

Another example of an optional annotation is marking a transform to run on
secure hardware, or to give hints to profiling/dynamic analysis tools.

In all these cases, the runner can run the pipeline with or without the
annotation, and in both cases the same output would be produced. There
would be differences in nonfunctional requirements (performance, security,
ease of profiling), hence the optional part.

A counter-example that this proposal explicitly excludes is marking a
transform as requiring sorted input. For example, on a transform that
expects time-sorted input in order to produce the correct output. If the
runner ignores this requirement, it would risk producing an incorrect
output. In order to avoid this, we exclude these required annotations.

Implementation-wise, we propose to add a field:
- map<string, bytes> annotations = 8;
to PTransform proto (
https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L127).
The key would be a URN that uniquely identifies the type of annotation. The
value is an opaque byte array (e.g., a serialized protocol buffer) to allow
for maximum flexibility to the implementation of that specific type of
annotation.

We have a specific interest in adding this to the Go SDK. In Go, the user
would specify the annotations to a structural ParDo as follows, by defining
a field:
- Annotations map[string][]byte
and filling it out. For simplicity, we will only support structural doFns
in Go for the time being.

The runners could then read the annotations from the PTransform proto and
support the annotations that they would like to in the way they want.

Please let me know what you think, and what would be the best way to
proceed, e.g., we can share a small design doc or, in case there are no
major objections, directly create a pull request for Go where we can
discuss the implementation details.

Best,
Mirac and team

PTransform Annotations Proposal

Reply via email to