Hi all, We would like to propose adding functionality to add annotations to Beam transforms. These annotations would be readable by the runner, and the runner could then act on this information; for example by doing some special resource allocation. There have been discussions around annotations (or hints as they are sometimes called) in the past ( https://lists.apache.org/thread.html/rdf247cfa3a509f80578f03b2454ea1e50474ee3576a059486d58fdf4%40%3Cdev.beam.apache.org%3E, https://lists.apache.org/thread.html/fc090d8acd96c4cf2d23071b5d99f538165d3ff7fbe6f65297655309%40%3Cdev.beam.apache.org%3E). This proposal aims to come up with an accepted lightweight solution with a follow-up Pull Request to implement it in Go.
By annotations, we refer to optional information / hints provided to the runner. This proposal explicitly excludes “required” annotations that could cause incorrect output. A runner that does not understand the annotations and ignores them must still produce correct output, with perhaps a degradation in performance or other nonfunctional requirements. Supporting only “optional” annotations allows for compatibility with runners that do not recognize those annotations. A good example of an optional annotation is marking a transform to be run on GPU or TPU or that it needs a certain amount of RAM. If the runner knows about this annotation, it can then allocate the requested resources for that transform only to improve performance and avoid using these scarce resources for other transforms. Another example of an optional annotation is marking a transform to run on secure hardware, or to give hints to profiling/dynamic analysis tools. In all these cases, the runner can run the pipeline with or without the annotation, and in both cases the same output would be produced. There would be differences in nonfunctional requirements (performance, security, ease of profiling), hence the optional part. A counter-example that this proposal explicitly excludes is marking a transform as requiring sorted input. For example, on a transform that expects time-sorted input in order to produce the correct output. If the runner ignores this requirement, it would risk producing an incorrect output. In order to avoid this, we exclude these required annotations. Implementation-wise, we propose to add a field: - map<string, bytes> annotations = 8; to PTransform proto ( https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L127). The key would be a URN that uniquely identifies the type of annotation. The value is an opaque byte array (e.g., a serialized protocol buffer) to allow for maximum flexibility to the implementation of that specific type of annotation. We have a specific interest in adding this to the Go SDK. In Go, the user would specify the annotations to a structural ParDo as follows, by defining a field: - Annotations map[string][]byte and filling it out. For simplicity, we will only support structural doFns in Go for the time being. The runners could then read the annotations from the PTransform proto and support the annotations that they would like to in the way they want. Please let me know what you think, and what would be the best way to proceed, e.g., we can share a small design doc or, in case there are no major objections, directly create a pull request for Go where we can discuss the implementation details. Best, Mirac and team