On Fri, Nov 20, 2020 at 11:08 AM Mirac Vuslat Basaran <[email protected]> wrote: > > Thanks everyone so much for their input and for the insightful discussion. > > Not being knowledgeable about Beam's internals, I have to say I am a bit lost > on the PTransform vs. environment discussion. > > I do agree with Burke's notion that merge rules are very annotation > dependent, I don't think we can find a one-size-fits-all solution for that. > So this might be actually be an argument in favour of having annotations on > PTransforms, since it avoids the conflation with environments. > > Also in general, I feel that having annotations per single transform (rather > than composite) and on PTransforms could lead to a simpler design.
If we want to use these for privacy, I don't see how attaching them to leaf transforms alone could work. (Even CombinePerKey is a composite.) > Seeing as there are valuable arguments in favour of both (PTransform and > environments) with no clear(?) "best solution", I would propose moving > forward with the initial (PTransform) design to ship the feature and unblock > teams asking for it. If it turns out that there was indeed a need to have > annotations in environments, we could always refactor it. I have yet to see any arguments that resource-level hints, such as memory or GPU, don't better belong on the environment. But moving forward on PTransform-level ones for logical statements like privacy declarations makes sense. > On 2020/11/17 19:07:22, Robert Bradshaw <[email protected]> wrote: > > So far we have two distinct usecases for annotations: resource hints > > and privacy directives, and I've been trying to figure out how to > > reconcile them, but they seem to have very different characteristics. > > (It would be nice to come up with other uses as well to see if we're > > really coming up with a generally useful mode--I think display data > > could fit into this as a new kind of annotation rather than being a > > top-level property, and it could make sense on both leaf and composite > > transforms.) > > > > To me, resource hints like GPU are inextricably tied to the > > environment. A transform tagged with GPU should reference a Fn that > > invokes GPU-accelerated code that lives in a particular environment. > > Something like high-mem is a bit squishier. Some DoFns take a lot of > > memory, but on the other hand one could imagine labeling a CoGBK as > > high-mem due to knowing that, in this particular usage, there will be > > lots of values with the same key. Ideally runners would be intelligent > > enough to automatically learn memory usage, but even in this case it > > may be a good hint to try and learn the requirements for DoFn A and > > DoFn B separately (which is difficult if they are always colocated, > > but valuable if, e.g. A takes a huge amount of memory and B takes a > > huge amount of wall time). > > > > Note that tying things to the environment does not preclude using them > > in non-portable runners as they'll still have an SDK-level > > representation (though I don't think we should have an explicit goal > > of feature parity for non-portable runners, e.g. multi-language isn't > > happening, and hope that non-portable runners go away soon anyway). > > > > Now let's consider privacy annotations. To make things very concrete, > > imagine a transform AverageSpendPerZipCode which takes as input (user, > > zip, spend), all users unique, and returns (zip, avg(spend)). In > > Python, this is GroupBy('zip').aggregate_field('spend', > > MeanCombineFn()). This is not very privacy preserving to those users > > who are the only (or one of a few) in a zip code. So we could define a > > transform PrivacyPreservingAverageSpendPerZipCode as > > > > @ptransform_fn > > def PrivacyPreservingAverageSpendPerZipCode(spend_per_user, threshold) > > counts_per_zip = spend_per_user | > > GroupBy('zip').aggregate_field('user', CountCombineFn()) > > spend_per_zip = spend_per_user | > > GroupBy('zip').aggregate_field('spend', MeanCombineFn()) > > filtered = spend_per_zip | beam.Filter( > > lambda x, counts: counts[x.zip] > threshold, > > counts=AsMap(counts_per_zip)) > > return filtered > > > > We now have a composite that has privacy preserving properties (i.e. > > the input may be quite sensitive, but the output is not, depending on > > the value of threshold). What is interesting here is that it is only > > the composite that has this property--no individual sub-transform is > > itself privacy preserving. Furthermore, an optimizer may notice we're > > doing aggregation on the same key twice and rewrite this using > > (logically) > > > > GroupBy('zip').aggregate_field('user', > > CountCombineFn()).aggregate_field('spend', MeanCombineFn()) > > > > and then applying the filter, which is semantically equivalent and > > satisfies the privacy annotations (and notably that does not even > > require the optimizer to interpret the annotations, just pass them > > on). To me, this implies that these annotations belong on the > > composites, and *not* on the leaf nodes (where they would be > > incorrect). > > > > I'll leave aside most questions of API until we figure out the model > > semantics, but wanted to throw one possible idea out (though I am > > ambivalent about it). Instead of attaching things to transforms, we > > can just wrap transforms in composites that have no role other than > > declaring information about their contents. E.g. we could have a > > composite transform whose payload is simply an assertion of the > > privacy (or resource?) properties of its inner structure. This would > > be just as expressive as adding new properties to transforms > > themselves (but would add an extra level of nesting, and make > > respecting the precice nesting more important). > > > > On Tue, Nov 17, 2020 at 8:12 AM Robert Burke <[email protected]> wrote: > > > > > > +1 to discussing PCollection annotations on a separate thread. It would > > > be confusing to mix them up. > > > > > > ----------- > > > > > > The question around conflicts is interesting, but confusing to me. I > > > don't think they exist in general. I keep coming back around to that it > > > depends on the annotation and the purpose of composites. Optionality > > > saves us here too. > > > > > > Composites are nothing without their internal hypergraph structure. > > > Eventually it comes down to executing the leaf nodes. The alternative to > > > executing the leaf nodes is when the composite represents a known > > > transform and is replaced by the runner on submission time. Lets look at > > > each. > > > > > > If there's a property that only exists on the leaf nodes, then it's not > > > possible to bubble up that property to the composite in all cases. > > > Afterall, it's not necessarily the case that a privacy preserving > > > transform maintains the property for all output edges as not all such > > > edges pass through the preserving transform. > > > > > > On the other hand, with memory or gpu recommendations, that might set a > > > low bar on the composite level. > > > > > > But, composites (any transform really) can be runner replaced. I think > > > it's fair to say that a runner replaced composite is not beholden to the > > > annotations of the original leaf transforms, especially around physical > > > requirements. The implementations are different. If a known composite at > > > the composite level requires GPUs and it's known replacement doesn't, I'd > > > posit that replacement was a choice the runner made since it can't > > > provision machines with GPUs. > > > > > > But, crucially around privacy annotated transforms, a runner likely > > > shouldn't replace a given subgraph that contains a privacy annotationed > > > transform unless the replacements provide the same level of privacy. > > > However, such replacements only happens with well known transforms with > > > known properties anyway, so this can serve as an additional layer of > > > validation for a runner aware of the properties. > > > > > > This brings me back to my position: that the notion of conflicts is very > > > annotation dependant, and that defining them as optional is the most > > > important feature to avoid issues. Conflicts don't exist as an inherent > > > property of annotations on ptransform of the hypergraph structure. Am i > > > wrong? No one has come up with an actual example of a conflict as far as > > > i understand the thread. > > > > > > Even Reuven's original question is more about whether the runner is > > > forced to look at leaf bodes rather than only looking at the composite. > > > Assuming the composite isn't replaced, the runner needs to look at the > > > leaf nodes regardless. And as discussed above there's no generalized > > > semantics that fit for all kinds of annotations, once replacements are > > > also considered. > > > > > > On Tue, Nov 17, 2020, 6:35 AM Ismaël Mejía <[email protected]> wrote: > > >> > > >> +1 Nice to see there is finally interest on this. Annotations for > > >> PTransforms make total sense! > > >> > > >> The semantics should be strictly optional for runners and correct > > >> execution should not be affected by lack of support of any annotation. > > >> We should however keep the set of annotations small. > > >> > > >> > PTransforms are hierarchical - namely a PTransform contains other > > >> > PTransforms, and so on. Is the runner expected to resolve all > > >> > annotations down to leaf nodes? What happens if that results in > > >> > conflicting annotations? > > >> > > >> +1 to this question, This needs to be detailed. > > >> > > >> I am curious about how the end user APIs of this will look maybe in > > >> Java or Python, just an extra method to inject a Map or via Java > > >> annotations/Python decorators? > > >> > > >> We might prefer not to mix the concepts of annotations and > > >> environments because this will limit the scope of annotations. > > >> Annotations are different from environments because they serve a more > > >> general idea: to express an intention and it is up to the runner to > > >> choose the strategy to accomplish this, for example in the GPU > > >> assignation case it could be to rewrite resource allocation via > > >> Environments but it could also just delegate this to a resource > > >> manager which is what we could do for example for GPU (or data > > >> locality) cases on the Spark/Flink classic runners. If we tie this to > > >> environments we will leave classic runners out of the loop for no > > >> major reason and also not cover use cases not related to resource > > >> allocation. > > >> > > >> I do not understand the use case to justify PCollection annotations > > >> but to not mix this thread with them, would you be interested to > > >> elaborate more about them in a different thread Jan? > > >> > > >> On Tue, Nov 17, 2020 at 2:28 AM Robert Bradshaw <[email protected]> > > >> wrote: > > >> > > > >> > I agree things like GPU, high-mem, etc. belong to the environment. If > > >> > annotations are truly advisory, one can imagine merging environments > > >> > by taking the union of annotations and still producing a correct > > >> > pipeline. (This would mean that annotations would have to be a > > >> > multi-map...) > > >> > > > >> > On the other hand, this doesn't seem to handle the case of privacy > > >> > analysis, which could apply to composites without applying to any > > >> > individual component, and don't really make sense as part of a > > >> > fusion/execution story. > > >> > > > >> > On Mon, Nov 16, 2020 at 4:00 PM Robert Burke <[email protected]> > > >> > wrote: > > >> > > > > >> > > That's good historical context. > > >> > > > > >> > > But then we'd still need to codify the annotation would need to be > > >> > > optional, and not affect correctness. > > >> > > > > >> > > Conflicts become easier to manage, (as environments with conflicting > > >> > > annotations simply don't get merged, and stay as distinct > > >> > > environments) but are still notionally annotation dependant. Do most > > >> > > runners handle environments so individually though? > > >> > > > > >> > > Reuven's question is a good one though. For the Go SDK, and the > > >> > > proposed implementation i saw, they only applied to leaf nodes. This > > >> > > is an artifact of how the Go SDK handles composites. Nothing stops > > >> > > it from being implemented on the composites Go has, but it didn't > > >> > > make sense otherwise. AFAICT Composites are generally for > > >> > > organizational convenience and not for functional aspects. Is this > > >> > > wrong? Afterall, does it make sense for environments to be on leaves > > >> > > and composites either? It's the same issue there. > > >> > > > > >> > > > > >> > > On Mon, Nov 16, 2020, 3:45 PM Kenneth Knowles <[email protected]> > > >> > > wrote: > > >> > >> > > >> > >> I am +1 to the proposal but believe it should be moved to the > > >> > >> Environment. I could be convinced otherwise, but would want to > > >> > >> really understand the details. > > >> > >> > > >> > >> I think we haven't done a great job communicating the purpose of > > >> > >> the Environment proto. It was explicitly created for this purpose. > > >> > >> > > >> > >> 1. It tells the runner things it needs to know to interpret the > > >> > >> DoFn (or other UDF). So these are the existing proto fields like > > >> > >> docker image (in the payload) and required artifacts that were > > >> > >> staged. > > >> > >> 2. It is also the place for additional requirements or hints like > > >> > >> "high mem" or "GPU" etc. > > >> > >> > > >> > >> Every user function has an associated environment, and environments > > >> > >> exist only for the purpose of executing user functions. In fact, > > >> > >> Environment originated as inline requirements/attributes for a user > > >> > >> function proto message and was separated just to make the proto > > >> > >> smaller. > > >> > >> > > >> > >> A PTransform is an abstract concept for organizing the graph, not > > >> > >> an executable thing. So a hint/capability/requirement on a > > >> > >> PTransform only really makes sense as a scoping mechanism for > > >> > >> applying a hint to a bunch of functions within a subgraph. This > > >> > >> seems like a user interface concern and the SDK should own > > >> > >> propagating the hints. If the hint truly applies to the whole > > >> > >> PTransform and *not* the parts, then I am interested in learning > > >> > >> about that. > > >> > >> > > >> > >> Kenn > > >> > >> > > >> > >> On Mon, Nov 16, 2020 at 10:54 AM Robert Burke <[email protected]> > > >> > >> wrote: > > >> > >>> > > >> > >>> That's a good question. > > >> > >>> > > >> > >>> I think the main difference is a matter of scope. Annotations > > >> > >>> would apply to a PTransform while an environment applies to sets > > >> > >>> of transforms. A difference is the optional nature of the > > >> > >>> annotations they don't affect correctness. Runners don't need to > > >> > >>> do anything with them and still execute the pipeline correctly. > > >> > >>> > > >> > >>> Consider a privacy analysis on a pipeline graph. An annotation > > >> > >>> indicating that a transform provides a certain level of > > >> > >>> anonymization can be used in an analysis to determine if the > > >> > >>> downstream transforms are encountering raw data or not. > > >> > >>> > > >> > >>> From my understanding (which can be wrong) environments are rigid. > > >> > >>> Transforms in different environments can't be fused. "This is the > > >> > >>> python env", "this is the java env" can't be merged together. It's > > >> > >>> not clear to me that we have defined when environments are safely > > >> > >>> fuseable outside of equality. There's value in that simplicity. > > >> > >>> > > >> > >>> AFIACT environment has less to do with the machines a pipeline is > > >> > >>> executing on than it does about the kinds of SDK pipelines it > > >> > >>> understands and can execute. > > >> > >>> > > >> > >>> > > >> > >>> > > >> > >>> On Mon, Nov 16, 2020, 10:36 AM Chad Dombrova <[email protected]> > > >> > >>> wrote: > > >> > >>>>> > > >> > >>>>> > > >> > >>>>> Another example of an optional annotation is marking a transform > > >> > >>>>> to run on secure hardware, or to give hints to profiling/dynamic > > >> > >>>>> analysis tools. > > >> > >>>> > > >> > >>>> > > >> > >>>> There seems to be a lot of overlap between this idea and > > >> > >>>> Environments. Can you talk about how you feel they may be > > >> > >>>> different or related? For example, I could see annotations as a > > >> > >>>> way of tagging transforms with an Environment, or I could see > > >> > >>>> Environments becoming a specialized form of annotation. > > >> > >>>> > > >> > >>>> -chad > > >> > >>>> > >
