Re: PTransform Annotations Proposal

Robert Burke Thu, 12 Nov 2020 13:58:10 -0800

(Disclaimer, Mirac and their team did approach me about this beforehand as
their interest is in the Go SDK.)


+1 I think it's a good idea. As you've pointed out, there are many
opportunities for optional pipeline analysis here as well.

A strawman counter point would be to re-used the static DisplayData for
this kind of thing, but I think that's not necessarily the same thing. It's
very hard to get something that's purely intended for Human consumption to
also be suitable for machine consumption, without various adapters and
such, and it would be an awful hack. Having something specifically for
Machines to understand is valuable in and of itself.

I appreciate the versatility of simply using known URNs and their defined
formats, and especially keeping the proposal to optional annotations that
don't affect correctness. This will work well with most DoFns that need
specialized hardware. They can usually be emulated on ordinary CPUs, which
is good for testing, but can perform much better if the hardware is
available. This also allows the runners to move execution of specific DoFns
to the machines with the specialized hardware, for better scheduling of
resources.

I look forward to the PR, and before then, all the discussion the community
has about this new field in the model proto.





On Thu, 12 Nov 2020 at 09:41, Mirac Vuslat Basaran <mir...@google.com>
wrote:

> Hi all,
>
> We would like to propose adding functionality to add annotations to Beam
> transforms. These annotations would be readable by the runner, and the
> runner could then act on this information; for example by doing some
> special resource allocation. There have been discussions around annotations
> (or hints as they are sometimes called) in the past (
> https://lists.apache.org/thread.html/rdf247cfa3a509f80578f03b2454ea1e50474ee3576a059486d58fdf4%40%3Cdev.beam.apache.org%3E,
>
> https://lists.apache.org/thread.html/fc090d8acd96c4cf2d23071b5d99f538165d3ff7fbe6f65297655309%40%3Cdev.beam.apache.org%3E).
> This proposal aims to come up with an accepted lightweight solution with a
> follow-up Pull Request to implement it in Go.
>
> By annotations, we refer to optional information / hints provided to the
> runner. This proposal explicitly excludes “required” annotations that could
> cause incorrect output. A runner that does not understand the annotations
> and ignores them must still produce correct output, with perhaps a
> degradation in performance or other nonfunctional requirements. Supporting
> only “optional” annotations allows for compatibility with runners that do
> not recognize those annotations.
>
> A good example of an optional annotation is marking a transform to be run
> on GPU or TPU or that it needs a certain amount of RAM. If the runner knows
> about this annotation, it can then allocate the requested resources for
> that transform only to improve performance and avoid using these scarce
> resources for other transforms.
>
> Another example of an optional annotation is marking a transform to run on
> secure hardware, or to give hints to profiling/dynamic analysis tools.
>
> In all these cases, the runner can run the pipeline with or without the
> annotation, and in both cases the same output would be produced. There
> would be differences in nonfunctional requirements (performance, security,
> ease of profiling), hence the optional part.
>
> A counter-example that this proposal explicitly excludes is marking a
> transform as requiring sorted input. For example, on a transform that
> expects time-sorted input in order to produce the correct output. If the
> runner ignores this requirement, it would risk producing an incorrect
> output. In order to avoid this, we exclude these required annotations.
>
> Implementation-wise, we propose to add a field:
>  - map<string, bytes> annotations = 8;
> to PTransform proto (
> https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L127).
> The key would be a URN that uniquely identifies the type of annotation. The
> value is an opaque byte array (e.g., a serialized protocol buffer) to allow
> for maximum flexibility to the implementation of that specific type of
> annotation.
>
> We have a specific interest in adding this to the Go SDK. In Go, the user
> would specify the annotations to a structural ParDo as follows, by defining
> a field:
>  - Annotations map[string][]byte
> and filling it out. For simplicity, we will only support structural doFns
> in Go for the time being.
>
> The runners could then read the annotations from the PTransform proto and
> support the annotations that they would like to in the way they want.
>
> Please let me know what you think, and what would be the best way to
> proceed, e.g., we can share a small design doc or, in case there are no
> major objections, directly create a pull request for Go where we can
> discuss the implementation details.
>
> Best,
> Mirac and team
>

Re: PTransform Annotations Proposal

Reply via email to