PTransform Annotations Proposal
Hi all, We would like to propose adding functionality to add annotations to Beam transforms. These annotations would be readable by the runner, and the runner could then act on this information; for example by doing some special resource allocation. There have been discussions around annotations (or hints as they are sometimes called) in the past ( https://lists.apache.org/thread.html/rdf247cfa3a509f80578f03b2454ea1e50474ee3576a059486d58fdf4%40%3Cdev.beam.apache.org%3E, https://lists.apache.org/thread.html/fc090d8acd96c4cf2d23071b5d99f538165d3ff7fbe6f65297655309%40%3Cdev.beam.apache.org%3E). This proposal aims to come up with an accepted lightweight solution with a follow-up Pull Request to implement it in Go. By annotations, we refer to optional information / hints provided to the runner. This proposal explicitly excludes “required” annotations that could cause incorrect output. A runner that does not understand the annotations and ignores them must still produce correct output, with perhaps a degradation in performance or other nonfunctional requirements. Supporting only “optional” annotations allows for compatibility with runners that do not recognize those annotations. A good example of an optional annotation is marking a transform to be run on GPU or TPU or that it needs a certain amount of RAM. If the runner knows about this annotation, it can then allocate the requested resources for that transform only to improve performance and avoid using these scarce resources for other transforms. Another example of an optional annotation is marking a transform to run on secure hardware, or to give hints to profiling/dynamic analysis tools. In all these cases, the runner can run the pipeline with or without the annotation, and in both cases the same output would be produced. There would be differences in nonfunctional requirements (performance, security, ease of profiling), hence the optional part. A counter-example that this proposal explicitly excludes is marking a transform as requiring sorted input. For example, on a transform that expects time-sorted input in order to produce the correct output. If the runner ignores this requirement, it would risk producing an incorrect output. In order to avoid this, we exclude these required annotations. Implementation-wise, we propose to add a field: - map annotations = 8; to PTransform proto ( https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L127). The key would be a URN that uniquely identifies the type of annotation. The value is an opaque byte array (e.g., a serialized protocol buffer) to allow for maximum flexibility to the implementation of that specific type of annotation. We have a specific interest in adding this to the Go SDK. In Go, the user would specify the annotations to a structural ParDo as follows, by defining a field: - Annotations map[string][]byte and filling it out. For simplicity, we will only support structural doFns in Go for the time being. The runners could then read the annotations from the PTransform proto and support the annotations that they would like to in the way they want. Please let me know what you think, and what would be the best way to proceed, e.g., we can share a small design doc or, in case there are no major objections, directly create a pull request for Go where we can discuss the implementation details. Best, Mirac and team
Re: PTransform Annotations Proposal
Thanks everyone so much for their input and for the insightful discussion. Not being knowledgeable about Beam's internals, I have to say I am a bit lost on the PTransform vs. environment discussion. I do agree with Burke's notion that merge rules are very annotation dependent, I don't think we can find a one-size-fits-all solution for that. So this might be actually be an argument in favour of having annotations on PTransforms, since it avoids the conflation with environments. Also in general, I feel that having annotations per single transform (rather than composite) and on PTransforms could lead to a simpler design. Seeing as there are valuable arguments in favour of both (PTransform and environments) with no clear(?) "best solution", I would propose moving forward with the initial (PTransform) design to ship the feature and unblock teams asking for it. If it turns out that there was indeed a need to have annotations in environments, we could always refactor it. On 2020/11/17 19:07:22, Robert Bradshaw wrote: > So far we have two distinct usecases for annotations: resource hints > and privacy directives, and I've been trying to figure out how to > reconcile them, but they seem to have very different characteristics. > (It would be nice to come up with other uses as well to see if we're > really coming up with a generally useful mode--I think display data > could fit into this as a new kind of annotation rather than being a > top-level property, and it could make sense on both leaf and composite > transforms.) > > To me, resource hints like GPU are inextricably tied to the > environment. A transform tagged with GPU should reference a Fn that > invokes GPU-accelerated code that lives in a particular environment. > Something like high-mem is a bit squishier. Some DoFns take a lot of > memory, but on the other hand one could imagine labeling a CoGBK as > high-mem due to knowing that, in this particular usage, there will be > lots of values with the same key. Ideally runners would be intelligent > enough to automatically learn memory usage, but even in this case it > may be a good hint to try and learn the requirements for DoFn A and > DoFn B separately (which is difficult if they are always colocated, > but valuable if, e.g. A takes a huge amount of memory and B takes a > huge amount of wall time). > > Note that tying things to the environment does not preclude using them > in non-portable runners as they'll still have an SDK-level > representation (though I don't think we should have an explicit goal > of feature parity for non-portable runners, e.g. multi-language isn't > happening, and hope that non-portable runners go away soon anyway). > > Now let's consider privacy annotations. To make things very concrete, > imagine a transform AverageSpendPerZipCode which takes as input (user, > zip, spend), all users unique, and returns (zip, avg(spend)). In > Python, this is GroupBy('zip').aggregate_field('spend', > MeanCombineFn()). This is not very privacy preserving to those users > who are the only (or one of a few) in a zip code. So we could define a > transform PrivacyPreservingAverageSpendPerZipCode as > > @ptransform_fn > def PrivacyPreservingAverageSpendPerZipCode(spend_per_user, threshold) > counts_per_zip = spend_per_user | > GroupBy('zip').aggregate_field('user', CountCombineFn()) > spend_per_zip = spend_per_user | > GroupBy('zip').aggregate_field('spend', MeanCombineFn()) > filtered = spend_per_zip | beam.Filter( > lambda x, counts: counts[x.zip] > threshold, > counts=AsMap(counts_per_zip)) > return filtered > > We now have a composite that has privacy preserving properties (i.e. > the input may be quite sensitive, but the output is not, depending on > the value of threshold). What is interesting here is that it is only > the composite that has this property--no individual sub-transform is > itself privacy preserving. Furthermore, an optimizer may notice we're > doing aggregation on the same key twice and rewrite this using > (logically) > > GroupBy('zip').aggregate_field('user', > CountCombineFn()).aggregate_field('spend', MeanCombineFn()) > > and then applying the filter, which is semantically equivalent and > satisfies the privacy annotations (and notably that does not even > require the optimizer to interpret the annotations, just pass them > on). To me, this implies that these annotations belong on the > composites, and *not* on the leaf nodes (where they would be > incorrect). > > I'll leave aside most questions of API until we figure out the model > semantics, but wanted to throw one possible idea out (though I am > ambivalent about it). Instead of attaching things to transforms, we > can just wrap transforms in composites that have no role other than > declaring information about their contents. E.g. we could have a > composite transform whose payload is simply an assertion of the > privacy (or resource?) properties of its inner structure. This
Re: PTransform Annotations Proposal
Created a PR without unit tests at https://github.com/apache/beam/pull/13434. Please have a look. Thanks, Mirac On 2020/11/25 18:50:19, Robert Burke wrote: > Hmmm. Fair. I'm mostly concerned about the pathological case where we end > up with a distinct Environment per transform, but there are likely > practical cases where that's reasonable (High mem to GPU to TPU, to ARM) > > On Wed, Nov 25, 2020, 10:42 AM Robert Bradshaw wrote: > > > I'd like to continue the discussion *and* see an implementation for > > the part we've settled on. I was asking why not have "every distinct > > physical concern means a distinct environment?" > > > > On Wed, Nov 25, 2020 at 10:38 AM Robert Burke wrote: > > > > > > Mostly because perfect is the enemy of good enough. We have a proposal, > > we have clear boundaries for it. It's fine if the discussion continues, but > > I see no evidence of concerns that should prevent starting an > > implementation, because it seems we'll need both anyway. > > > > > > On Wed, Nov 25, 2020, 10:25 AM Robert Bradshaw > > wrote: > > >> > > >> On Wed, Nov 25, 2020 at 10:15 AM Robert Burke > > wrote: > > >> > > > >> > It sounds like we've come to the position that non-correctness > > affecting Ptransform Annotations are valuable at both leaf and composite > > levels, and don't remove the potential need for a similar feature on > > Environments, to handle physical concerns equirements for worker processes > > to have (such as Ram, CPU, or GPU requirements.) > > >> > > > >> > Kenn, it's not clear what part of the solution (an annotation field > > on the Ptransform proto message) would need to change to satisfy your scope > > concern, beyond documenting unambiguously that these may not be used for > > physical concerns or things that affect correctness. > > >> > > >> I'll let Kenn answer as well, but from my point of view, explicitly > > >> having somewhere better to put these things would help. > > >> > > >> > I'm also unclear your scope concern not matching, given the above. > > Your first paragraph reads very supportive of logical annotations on > > Ptransforms, and that matches 1-1 with the current proposed solution. Can > > you clarify your concern? > > >> > > > >> > I don't wish to scope creep on the physical requirements issue at > > this time. It seems we are agreed they should end up on environments, but > > I'm not seeing proposals on the right way to execute them at this time.They > > seem to be a fruitful topic of discussion, in particular > > unifying/consolidating them for efficient use of resources. I don't think > > we want to end up in a state where every distinct physical concern means a > > distinct environment. > > >> > > >> Why not? Assuming, of course, that runners are free to merge > > >> environments (merging those resource hints they understand and are > > >> otherwise compatible, and discarding those they don't) for efficient > > >> execution. > > >> > > >> > I for one am ready to see a PR. > > >> > > >> +1 > > >> > > >> > On Mon, Nov 23, 2020, 6:02 PM Kenneth Knowles > > wrote: > > >> >> > > >> >> > > >> >> > > >> >> On Mon, Nov 23, 2020 at 3:04 PM Robert Bradshaw > > wrote: > > >> >>> > > >> >>> On Fri, Nov 20, 2020 at 11:08 AM Mirac Vuslat Basaran < > > mir...@google.com> wrote: > > >> >>> > > > >> >>> > Thanks everyone so much for their input and for the insightful > > discussion. > > >> >>> > > > >> >>> > Not being knowledgeable about Beam's internals, I have to say I > > am a bit lost on the PTransform vs. environment discussion. > > >> >>> > > > >> >>> > I do agree with Burke's notion that merge rules are very > > annotation dependent, I don't think we can find a one-size-fits-all > > solution for that. So this might be actually be an argument in favour of > > having annotations on PTransforms, since it avoids the conflation with > > environments. > > >> >>> > > > >> >>> > Also in general, I feel that having annotations per single > > transform (rather than composite)
Re: PTransform Annotations Proposal
Wrote a design draft for resource-related annotations. Please have a look: https://docs.google.com/document/d/1phExeGD1gdDI9M8LK4ZG57UGa7dswpB8Aj6jxWj4uQk/edit?usp=sharing Cheers, Mirac On 2020/11/26 20:20:09, Mirac Vuslat Basaran wrote: > Created a PR without unit tests at https://github.com/apache/beam/pull/13434. > Please have a look. > > Thanks, > Mirac > > > On 2020/11/25 18:50:19, Robert Burke wrote: > > Hmmm. Fair. I'm mostly concerned about the pathological case where we end > > up with a distinct Environment per transform, but there are likely > > practical cases where that's reasonable (High mem to GPU to TPU, to ARM) > > > > On Wed, Nov 25, 2020, 10:42 AM Robert Bradshaw wrote: > > > > > I'd like to continue the discussion *and* see an implementation for > > > the part we've settled on. I was asking why not have "every distinct > > > physical concern means a distinct environment?" > > > > > > On Wed, Nov 25, 2020 at 10:38 AM Robert Burke wrote: > > > > > > > > Mostly because perfect is the enemy of good enough. We have a proposal, > > > we have clear boundaries for it. It's fine if the discussion continues, > > > but > > > I see no evidence of concerns that should prevent starting an > > > implementation, because it seems we'll need both anyway. > > > > > > > > On Wed, Nov 25, 2020, 10:25 AM Robert Bradshaw > > > wrote: > > > >> > > > >> On Wed, Nov 25, 2020 at 10:15 AM Robert Burke > > > wrote: > > > >> > > > > >> > It sounds like we've come to the position that non-correctness > > > affecting Ptransform Annotations are valuable at both leaf and composite > > > levels, and don't remove the potential need for a similar feature on > > > Environments, to handle physical concerns equirements for worker processes > > > to have (such as Ram, CPU, or GPU requirements.) > > > >> > > > > >> > Kenn, it's not clear what part of the solution (an annotation field > > > on the Ptransform proto message) would need to change to satisfy your > > > scope > > > concern, beyond documenting unambiguously that these may not be used for > > > physical concerns or things that affect correctness. > > > >> > > > >> I'll let Kenn answer as well, but from my point of view, explicitly > > > >> having somewhere better to put these things would help. > > > >> > > > >> > I'm also unclear your scope concern not matching, given the above. > > > Your first paragraph reads very supportive of logical annotations on > > > Ptransforms, and that matches 1-1 with the current proposed solution. Can > > > you clarify your concern? > > > >> > > > > >> > I don't wish to scope creep on the physical requirements issue at > > > this time. It seems we are agreed they should end up on environments, but > > > I'm not seeing proposals on the right way to execute them at this > > > time.They > > > seem to be a fruitful topic of discussion, in particular > > > unifying/consolidating them for efficient use of resources. I don't think > > > we want to end up in a state where every distinct physical concern means a > > > distinct environment. > > > >> > > > >> Why not? Assuming, of course, that runners are free to merge > > > >> environments (merging those resource hints they understand and are > > > >> otherwise compatible, and discarding those they don't) for efficient > > > >> execution. > > > >> > > > >> > I for one am ready to see a PR. > > > >> > > > >> +1 > > > >> > > > >> > On Mon, Nov 23, 2020, 6:02 PM Kenneth Knowles > > > wrote: > > > >> >> > > > >> >> > > > >> >> > > > >> >> On Mon, Nov 23, 2020 at 3:04 PM Robert Bradshaw > > > >> >> > > > wrote: > > > >> >>> > > > >> >>> On Fri, Nov 20, 2020 at 11:08 AM Mirac Vuslat Basaran < > > > mir...@google.com> wrote: > > > >> >>> > > > > >> >>> > Thanks everyone so much for their input and for the insightful > > > discussion. > > > >> >>> > > > > >> >>&g