Re: PTransform Annotations Proposal

Reza Ardeshir Rokni Mon, 16 Nov 2020 03:26:22 -0800

+1 having a NeedsRam(x) annotation would be incredibly helpful.

On Fri, 13 Nov 2020 at 05:57, Robert Burke <[email protected]> wrote:


> (Disclaimer, Mirac and their team did approach me about this beforehand as
> their interest is in the Go SDK.)
>
> +1 I think it's a good idea. As you've pointed out, there are many
> opportunities for optional pipeline analysis here as well.
>
> A strawman counter point would be to re-used the static DisplayData for
> this kind of thing, but I think that's not necessarily the same thing. It's
> very hard to get something that's purely intended for Human consumption to
> also be suitable for machine consumption, without various adapters and
> such, and it would be an awful hack. Having something specifically for
> Machines to understand is valuable in and of itself.
>
> I appreciate the versatility of simply using known URNs and their defined
> formats, and especially keeping the proposal to optional annotations that
> don't affect correctness. This will work well with most DoFns that need
> specialized hardware. They can usually be emulated on ordinary CPUs, which
> is good for testing, but can perform much better if the hardware is
> available. This also allows the runners to move execution of specific DoFns
> to the machines with the specialized hardware, for better scheduling of
> resources.
>
> I look forward to the PR, and before then, all the discussion the
> community has about this new field in the model proto.
>
>
>
>
>
> On Thu, 12 Nov 2020 at 09:41, Mirac Vuslat Basaran <[email protected]>
> wrote:
>
>> Hi all,
>>
>> We would like to propose adding functionality to add annotations to Beam
>> transforms. These annotations would be readable by the runner, and the
>> runner could then act on this information; for example by doing some
>> special resource allocation. There have been discussions around annotations
>> (or hints as they are sometimes called) in the past (
>> https://lists.apache.org/thread.html/rdf247cfa3a509f80578f03b2454ea1e50474ee3576a059486d58fdf4%40%3Cdev.beam.apache.org%3E,
>>
>> https://lists.apache.org/thread.html/fc090d8acd96c4cf2d23071b5d99f538165d3ff7fbe6f65297655309%40%3Cdev.beam.apache.org%3E).
>> This proposal aims to come up with an accepted lightweight solution with a
>> follow-up Pull Request to implement it in Go.
>>
>> By annotations, we refer to optional information / hints provided to the
>> runner. This proposal explicitly excludes “required” annotations that could
>> cause incorrect output. A runner that does not understand the annotations
>> and ignores them must still produce correct output, with perhaps a
>> degradation in performance or other nonfunctional requirements. Supporting
>> only “optional” annotations allows for compatibility with runners that do
>> not recognize those annotations.
>>
>> A good example of an optional annotation is marking a transform to be run
>> on GPU or TPU or that it needs a certain amount of RAM. If the runner knows
>> about this annotation, it can then allocate the requested resources for
>> that transform only to improve performance and avoid using these scarce
>> resources for other transforms.
>>
>> Another example of an optional annotation is marking a transform to run
>> on secure hardware, or to give hints to profiling/dynamic analysis tools.
>>
>> In all these cases, the runner can run the pipeline with or without the
>> annotation, and in both cases the same output would be produced. There
>> would be differences in nonfunctional requirements (performance, security,
>> ease of profiling), hence the optional part.
>>
>> A counter-example that this proposal explicitly excludes is marking a
>> transform as requiring sorted input. For example, on a transform that
>> expects time-sorted input in order to produce the correct output. If the
>> runner ignores this requirement, it would risk producing an incorrect
>> output. In order to avoid this, we exclude these required annotations.
>>
>> Implementation-wise, we propose to add a field:
>>  - map<string, bytes> annotations = 8;
>> to PTransform proto (
>> https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L127).
>> The key would be a URN that uniquely identifies the type of annotation. The
>> value is an opaque byte array (e.g., a serialized protocol buffer) to allow
>> for maximum flexibility to the implementation of that specific type of
>> annotation.
>>
>> We have a specific interest in adding this to the Go SDK. In Go, the user
>> would specify the annotations to a structural ParDo as follows, by defining
>> a field:
>>  - Annotations map[string][]byte
>> and filling it out. For simplicity, we will only support structural doFns
>> in Go for the time being.
>>
>> The runners could then read the annotations from the PTransform proto and
>> support the annotations that they would like to in the way they want.
>>
>> Please let me know what you think, and what would be the best way to
>> proceed, e.g., we can share a small design doc or, in case there are no
>> major objections, directly create a pull request for Go where we can
>> discuss the implementation details.
>>
>> Best,
>> Mirac and team
>>
>

Re: PTransform Annotations Proposal

Reply via email to