Thanks everyone so much for their input and for the insightful discussion. Not being knowledgeable about Beam's internals, I have to say I am a bit lost on the PTransform vs. environment discussion.
I do agree with Burke's notion that merge rules are very annotation dependent, I don't think we can find a one-size-fits-all solution for that. So this might be actually be an argument in favour of having annotations on PTransforms, since it avoids the conflation with environments. Also in general, I feel that having annotations per single transform (rather than composite) and on PTransforms could lead to a simpler design. Seeing as there are valuable arguments in favour of both (PTransform and environments) with no clear(?) "best solution", I would propose moving forward with the initial (PTransform) design to ship the feature and unblock teams asking for it. If it turns out that there was indeed a need to have annotations in environments, we could always refactor it. On 2020/11/17 19:07:22, Robert Bradshaw <[email protected]> wrote: > So far we have two distinct usecases for annotations: resource hints > and privacy directives, and I've been trying to figure out how to > reconcile them, but they seem to have very different characteristics. > (It would be nice to come up with other uses as well to see if we're > really coming up with a generally useful mode--I think display data > could fit into this as a new kind of annotation rather than being a > top-level property, and it could make sense on both leaf and composite > transforms.) > > To me, resource hints like GPU are inextricably tied to the > environment. A transform tagged with GPU should reference a Fn that > invokes GPU-accelerated code that lives in a particular environment. > Something like high-mem is a bit squishier. Some DoFns take a lot of > memory, but on the other hand one could imagine labeling a CoGBK as > high-mem due to knowing that, in this particular usage, there will be > lots of values with the same key. Ideally runners would be intelligent > enough to automatically learn memory usage, but even in this case it > may be a good hint to try and learn the requirements for DoFn A and > DoFn B separately (which is difficult if they are always colocated, > but valuable if, e.g. A takes a huge amount of memory and B takes a > huge amount of wall time). > > Note that tying things to the environment does not preclude using them > in non-portable runners as they'll still have an SDK-level > representation (though I don't think we should have an explicit goal > of feature parity for non-portable runners, e.g. multi-language isn't > happening, and hope that non-portable runners go away soon anyway). > > Now let's consider privacy annotations. To make things very concrete, > imagine a transform AverageSpendPerZipCode which takes as input (user, > zip, spend), all users unique, and returns (zip, avg(spend)). In > Python, this is GroupBy('zip').aggregate_field('spend', > MeanCombineFn()). This is not very privacy preserving to those users > who are the only (or one of a few) in a zip code. So we could define a > transform PrivacyPreservingAverageSpendPerZipCode as > > @ptransform_fn > def PrivacyPreservingAverageSpendPerZipCode(spend_per_user, threshold) > counts_per_zip = spend_per_user | > GroupBy('zip').aggregate_field('user', CountCombineFn()) > spend_per_zip = spend_per_user | > GroupBy('zip').aggregate_field('spend', MeanCombineFn()) > filtered = spend_per_zip | beam.Filter( > lambda x, counts: counts[x.zip] > threshold, > counts=AsMap(counts_per_zip)) > return filtered > > We now have a composite that has privacy preserving properties (i.e. > the input may be quite sensitive, but the output is not, depending on > the value of threshold). What is interesting here is that it is only > the composite that has this property--no individual sub-transform is > itself privacy preserving. Furthermore, an optimizer may notice we're > doing aggregation on the same key twice and rewrite this using > (logically) > > GroupBy('zip').aggregate_field('user', > CountCombineFn()).aggregate_field('spend', MeanCombineFn()) > > and then applying the filter, which is semantically equivalent and > satisfies the privacy annotations (and notably that does not even > require the optimizer to interpret the annotations, just pass them > on). To me, this implies that these annotations belong on the > composites, and *not* on the leaf nodes (where they would be > incorrect). > > I'll leave aside most questions of API until we figure out the model > semantics, but wanted to throw one possible idea out (though I am > ambivalent about it). Instead of attaching things to transforms, we > can just wrap transforms in composites that have no role other than > declaring information about their contents. E.g. we could have a > composite transform whose payload is simply an assertion of the > privacy (or resource?) properties of its inner structure. This would > be just as expressive as adding new properties to transforms > themselves (but would add an extra level of nesting, and make > respecting the precice nesting more important). > > On Tue, Nov 17, 2020 at 8:12 AM Robert Burke <[email protected]> wrote: > > > > +1 to discussing PCollection annotations on a separate thread. It would be > > confusing to mix them up. > > > > ----------- > > > > The question around conflicts is interesting, but confusing to me. I don't > > think they exist in general. I keep coming back around to that it depends > > on the annotation and the purpose of composites. Optionality saves us here > > too. > > > > Composites are nothing without their internal hypergraph structure. > > Eventually it comes down to executing the leaf nodes. The alternative to > > executing the leaf nodes is when the composite represents a known transform > > and is replaced by the runner on submission time. Lets look at each. > > > > If there's a property that only exists on the leaf nodes, then it's not > > possible to bubble up that property to the composite in all cases. > > Afterall, it's not necessarily the case that a privacy preserving transform > > maintains the property for all output edges as not all such edges pass > > through the preserving transform. > > > > On the other hand, with memory or gpu recommendations, that might set a low > > bar on the composite level. > > > > But, composites (any transform really) can be runner replaced. I think it's > > fair to say that a runner replaced composite is not beholden to the > > annotations of the original leaf transforms, especially around physical > > requirements. The implementations are different. If a known composite at > > the composite level requires GPUs and it's known replacement doesn't, I'd > > posit that replacement was a choice the runner made since it can't > > provision machines with GPUs. > > > > But, crucially around privacy annotated transforms, a runner likely > > shouldn't replace a given subgraph that contains a privacy annotationed > > transform unless the replacements provide the same level of privacy. > > However, such replacements only happens with well known transforms with > > known properties anyway, so this can serve as an additional layer of > > validation for a runner aware of the properties. > > > > This brings me back to my position: that the notion of conflicts is very > > annotation dependant, and that defining them as optional is the most > > important feature to avoid issues. Conflicts don't exist as an inherent > > property of annotations on ptransform of the hypergraph structure. Am i > > wrong? No one has come up with an actual example of a conflict as far as i > > understand the thread. > > > > Even Reuven's original question is more about whether the runner is forced > > to look at leaf bodes rather than only looking at the composite. Assuming > > the composite isn't replaced, the runner needs to look at the leaf nodes > > regardless. And as discussed above there's no generalized semantics that > > fit for all kinds of annotations, once replacements are also considered. > > > > On Tue, Nov 17, 2020, 6:35 AM Ismaël Mejía <[email protected]> wrote: > >> > >> +1 Nice to see there is finally interest on this. Annotations for > >> PTransforms make total sense! > >> > >> The semantics should be strictly optional for runners and correct > >> execution should not be affected by lack of support of any annotation. > >> We should however keep the set of annotations small. > >> > >> > PTransforms are hierarchical - namely a PTransform contains other > >> > PTransforms, and so on. Is the runner expected to resolve all > >> > annotations down to leaf nodes? What happens if that results in > >> > conflicting annotations? > >> > >> +1 to this question, This needs to be detailed. > >> > >> I am curious about how the end user APIs of this will look maybe in > >> Java or Python, just an extra method to inject a Map or via Java > >> annotations/Python decorators? > >> > >> We might prefer not to mix the concepts of annotations and > >> environments because this will limit the scope of annotations. > >> Annotations are different from environments because they serve a more > >> general idea: to express an intention and it is up to the runner to > >> choose the strategy to accomplish this, for example in the GPU > >> assignation case it could be to rewrite resource allocation via > >> Environments but it could also just delegate this to a resource > >> manager which is what we could do for example for GPU (or data > >> locality) cases on the Spark/Flink classic runners. If we tie this to > >> environments we will leave classic runners out of the loop for no > >> major reason and also not cover use cases not related to resource > >> allocation. > >> > >> I do not understand the use case to justify PCollection annotations > >> but to not mix this thread with them, would you be interested to > >> elaborate more about them in a different thread Jan? > >> > >> On Tue, Nov 17, 2020 at 2:28 AM Robert Bradshaw <[email protected]> > >> wrote: > >> > > >> > I agree things like GPU, high-mem, etc. belong to the environment. If > >> > annotations are truly advisory, one can imagine merging environments > >> > by taking the union of annotations and still producing a correct > >> > pipeline. (This would mean that annotations would have to be a > >> > multi-map...) > >> > > >> > On the other hand, this doesn't seem to handle the case of privacy > >> > analysis, which could apply to composites without applying to any > >> > individual component, and don't really make sense as part of a > >> > fusion/execution story. > >> > > >> > On Mon, Nov 16, 2020 at 4:00 PM Robert Burke <[email protected]> wrote: > >> > > > >> > > That's good historical context. > >> > > > >> > > But then we'd still need to codify the annotation would need to be > >> > > optional, and not affect correctness. > >> > > > >> > > Conflicts become easier to manage, (as environments with conflicting > >> > > annotations simply don't get merged, and stay as distinct > >> > > environments) but are still notionally annotation dependant. Do most > >> > > runners handle environments so individually though? > >> > > > >> > > Reuven's question is a good one though. For the Go SDK, and the > >> > > proposed implementation i saw, they only applied to leaf nodes. This > >> > > is an artifact of how the Go SDK handles composites. Nothing stops it > >> > > from being implemented on the composites Go has, but it didn't make > >> > > sense otherwise. AFAICT Composites are generally for organizational > >> > > convenience and not for functional aspects. Is this wrong? Afterall, > >> > > does it make sense for environments to be on leaves and composites > >> > > either? It's the same issue there. > >> > > > >> > > > >> > > On Mon, Nov 16, 2020, 3:45 PM Kenneth Knowles <[email protected]> wrote: > >> > >> > >> > >> I am +1 to the proposal but believe it should be moved to the > >> > >> Environment. I could be convinced otherwise, but would want to really > >> > >> understand the details. > >> > >> > >> > >> I think we haven't done a great job communicating the purpose of the > >> > >> Environment proto. It was explicitly created for this purpose. > >> > >> > >> > >> 1. It tells the runner things it needs to know to interpret the DoFn > >> > >> (or other UDF). So these are the existing proto fields like docker > >> > >> image (in the payload) and required artifacts that were staged. > >> > >> 2. It is also the place for additional requirements or hints like > >> > >> "high mem" or "GPU" etc. > >> > >> > >> > >> Every user function has an associated environment, and environments > >> > >> exist only for the purpose of executing user functions. In fact, > >> > >> Environment originated as inline requirements/attributes for a user > >> > >> function proto message and was separated just to make the proto > >> > >> smaller. > >> > >> > >> > >> A PTransform is an abstract concept for organizing the graph, not an > >> > >> executable thing. So a hint/capability/requirement on a PTransform > >> > >> only really makes sense as a scoping mechanism for applying a hint to > >> > >> a bunch of functions within a subgraph. This seems like a user > >> > >> interface concern and the SDK should own propagating the hints. If > >> > >> the hint truly applies to the whole PTransform and *not* the parts, > >> > >> then I am interested in learning about that. > >> > >> > >> > >> Kenn > >> > >> > >> > >> On Mon, Nov 16, 2020 at 10:54 AM Robert Burke <[email protected]> > >> > >> wrote: > >> > >>> > >> > >>> That's a good question. > >> > >>> > >> > >>> I think the main difference is a matter of scope. Annotations would > >> > >>> apply to a PTransform while an environment applies to sets of > >> > >>> transforms. A difference is the optional nature of the annotations > >> > >>> they don't affect correctness. Runners don't need to do anything > >> > >>> with them and still execute the pipeline correctly. > >> > >>> > >> > >>> Consider a privacy analysis on a pipeline graph. An annotation > >> > >>> indicating that a transform provides a certain level of > >> > >>> anonymization can be used in an analysis to determine if the > >> > >>> downstream transforms are encountering raw data or not. > >> > >>> > >> > >>> From my understanding (which can be wrong) environments are rigid. > >> > >>> Transforms in different environments can't be fused. "This is the > >> > >>> python env", "this is the java env" can't be merged together. It's > >> > >>> not clear to me that we have defined when environments are safely > >> > >>> fuseable outside of equality. There's value in that simplicity. > >> > >>> > >> > >>> AFIACT environment has less to do with the machines a pipeline is > >> > >>> executing on than it does about the kinds of SDK pipelines it > >> > >>> understands and can execute. > >> > >>> > >> > >>> > >> > >>> > >> > >>> On Mon, Nov 16, 2020, 10:36 AM Chad Dombrova <[email protected]> > >> > >>> wrote: > >> > >>>>> > >> > >>>>> > >> > >>>>> Another example of an optional annotation is marking a transform > >> > >>>>> to run on secure hardware, or to give hints to profiling/dynamic > >> > >>>>> analysis tools. > >> > >>>> > >> > >>>> > >> > >>>> There seems to be a lot of overlap between this idea and > >> > >>>> Environments. Can you talk about how you feel they may be > >> > >>>> different or related? For example, I could see annotations as a > >> > >>>> way of tagging transforms with an Environment, or I could see > >> > >>>> Environments becoming a specialized form of annotation. > >> > >>>> > >> > >>>> -chad > >> > >>>> >
