Re: PTransform Annotations Proposal

Robert Bradshaw Mon, 23 Nov 2020 15:04:36 -0800

On Fri, Nov 20, 2020 at 11:08 AM Mirac Vuslat Basaran <mir...@google.com> wrote:
>
> Thanks everyone so much for their input and for the insightful discussion.
>
> Not being knowledgeable about Beam's internals, I have to say I am a bit lost 
> on the PTransform vs. environment discussion.
>
> I do agree with Burke's notion that merge rules are very annotation 
> dependent, I don't think we can find a one-size-fits-all solution for that. 
> So this might be actually be an argument in favour of having annotations on 
> PTransforms, since it avoids the conflation with environments.
>
> Also in general, I feel that having annotations per single transform (rather 
> than composite) and on PTransforms could lead to a simpler design.


If we want to use these for privacy, I don't see how attaching them to
leaf transforms alone could work. (Even CombinePerKey is a composite.)

> Seeing as there are valuable arguments in favour of both (PTransform and 
> environments) with no clear(?) "best solution", I would propose moving 
> forward with the initial (PTransform) design to ship the feature and unblock 
> teams asking for it. If it turns out that there was indeed a need to have 
> annotations in environments, we could always refactor it.

I have yet to see any arguments that resource-level hints, such as
memory or GPU, don't better belong on the environment. But moving
forward on PTransform-level ones for logical statements like privacy
declarations makes sense.

> On 2020/11/17 19:07:22, Robert Bradshaw <rober...@google.com> wrote:
> > So far we have two distinct usecases for annotations: resource hints
> > and privacy directives, and I've been trying to figure out how to
> > reconcile them, but they seem to have very different characteristics.
> > (It would be nice to come up with other uses as well to see if we're
> > really coming up with a generally useful mode--I think display data
> > could fit into this as a new kind of annotation rather than being a
> > top-level property, and it could make sense on both leaf and composite
> > transforms.)
> >
> > To me, resource hints like GPU are inextricably tied to the
> > environment. A transform tagged with GPU should reference a Fn that
> > invokes GPU-accelerated code that lives in a particular environment.
> > Something like high-mem is a bit squishier. Some DoFns take a lot of
> > memory, but on the other hand one could imagine labeling a CoGBK as
> > high-mem due to knowing that, in this particular usage, there will be
> > lots of values with the same key. Ideally runners would be intelligent
> > enough to automatically learn memory usage, but even in this case it
> > may be a good hint to try and learn the requirements for DoFn A and
> > DoFn B separately (which is difficult if they are always colocated,
> > but valuable if, e.g. A takes a huge amount of memory and B takes a
> > huge amount of wall time).
> >
> > Note that tying things to the environment does not preclude using them
> > in non-portable runners as they'll still have an SDK-level
> > representation (though I don't think we should have an explicit goal
> > of feature parity for non-portable runners, e.g. multi-language isn't
> > happening, and hope that non-portable runners go away soon anyway).
> >
> > Now let's consider privacy annotations. To make things very concrete,
> > imagine a transform AverageSpendPerZipCode which takes as input (user,
> > zip, spend), all users unique, and returns (zip, avg(spend)). In
> > Python, this is GroupBy('zip').aggregate_field('spend',
> > MeanCombineFn()). This is not very privacy preserving to those users
> > who are the only (or one of a few) in a zip code. So we could define a
> > transform PrivacyPreservingAverageSpendPerZipCode as
> >
> > @ptransform_fn
> > def PrivacyPreservingAverageSpendPerZipCode(spend_per_user, threshold)
> >     counts_per_zip = spend_per_user |
> > GroupBy('zip').aggregate_field('user', CountCombineFn())
> >     spend_per_zip = spend_per_user |
> > GroupBy('zip').aggregate_field('spend', MeanCombineFn())
> >     filtered = spend_per_zip | beam.Filter(
> >         lambda x, counts: counts[x.zip] > threshold,
> > counts=AsMap(counts_per_zip))
> >     return filtered
> >
> > We now have a composite that has privacy preserving properties (i.e.
> > the input may be quite sensitive, but the output is not, depending on
> > the value of threshold). What is interesting here is that it is only
> > the composite that has this property--no individual sub-transform is
> > itself privacy preserving. Furthermore, an optimizer may notice we're
> > doing aggregation on the same key twice and rewrite this using
> > (logically)
> >
> >     GroupBy('zip').aggregate_field('user',
> > CountCombineFn()).aggregate_field('spend', MeanCombineFn())
> >
> > and then applying the filter, which is semantically equivalent and
> > satisfies the privacy annotations (and notably that does not even
> > require the optimizer to interpret the annotations, just pass them
> > on). To me, this implies that these annotations belong on the
> > composites, and *not* on the leaf nodes (where they would be
> > incorrect).
> >
> > I'll leave aside most questions of API until we figure out the model
> > semantics, but wanted to throw one possible idea out (though I am
> > ambivalent about it). Instead of attaching things to transforms, we
> > can just wrap transforms in composites that have no role other than
> > declaring information about their contents. E.g. we could have a
> > composite transform whose payload is simply an assertion of the
> > privacy (or resource?) properties of its inner structure. This would
> > be just as expressive as adding new properties to transforms
> > themselves (but would add an extra level of nesting, and make
> > respecting the precice nesting more important).
> >
> > On Tue, Nov 17, 2020 at 8:12 AM Robert Burke <rob...@frantil.com> wrote:
> > >
> > > +1 to discussing PCollection annotations on a separate thread. It would 
> > > be confusing to mix them up.
> > >
> > > -----------
> > >
> > > The question around conflicts is interesting, but confusing to me. I 
> > > don't think they exist in general. I keep coming back around to that it 
> > > depends on the annotation and the purpose of composites. Optionality 
> > > saves us here too.
> > >
> > > Composites are nothing without their internal hypergraph structure. 
> > > Eventually it comes down to executing the leaf nodes. The alternative to 
> > > executing the leaf nodes is when the composite represents a known 
> > > transform and is replaced by the runner on submission time.  Lets look at 
> > > each.
> > >
> > > If there's a property that only exists on the leaf nodes, then it's not 
> > > possible to bubble up that property to the composite in all cases. 
> > > Afterall, it's not necessarily the case that a privacy preserving 
> > > transform maintains the property for all output edges as not all such 
> > > edges pass through the preserving transform.
> > >
> > > On the other hand, with memory or gpu recommendations, that might set a 
> > > low bar on the composite level.
> > >
> > > But, composites (any transform really) can be runner replaced. I think 
> > > it's fair to say that a runner replaced composite is not beholden to the 
> > > annotations of the original leaf transforms, especially around physical 
> > > requirements. The implementations are different. If a known composite at 
> > > the composite level requires GPUs and it's known replacement doesn't, I'd 
> > > posit that replacement was a choice the runner made since it can't 
> > > provision machines with GPUs.
> > >
> > > But, crucially around privacy annotated transforms, a runner likely 
> > > shouldn't replace a given subgraph that contains a privacy annotationed 
> > > transform unless the replacements provide the same level of privacy. 
> > > However, such replacements only happens with well known transforms with 
> > > known properties anyway, so this can serve as an additional layer of 
> > > validation for a runner aware of the properties.
> > >
> > > This brings me back to my position: that the notion of conflicts is very 
> > > annotation dependant, and that defining them as optional is the most 
> > > important feature to avoid issues. Conflicts don't exist as an inherent 
> > > property of annotations on ptransform of the hypergraph structure. Am i 
> > > wrong? No one has come up with an actual example of a conflict as far as 
> > > i understand the thread.
> > >
> > > Even Reuven's original question is more about whether the runner is 
> > > forced to look at leaf bodes rather than only looking at the composite. 
> > > Assuming the composite isn't replaced, the runner needs to look at the 
> > > leaf nodes regardless. And as discussed above there's no generalized 
> > > semantics that fit for all kinds of annotations, once replacements are 
> > > also considered.
> > >
> > > On Tue, Nov 17, 2020, 6:35 AM Ismaël Mejía <ieme...@gmail.com> wrote:
> > >>
> > >> +1 Nice to see there is finally interest on this. Annotations for
> > >> PTransforms make total sense!
> > >>
> > >> The semantics should be strictly optional for runners and correct
> > >> execution should not be affected by lack of support of any annotation.
> > >> We should however keep the set of annotations small.
> > >>
> > >> > PTransforms are hierarchical - namely a PTransform contains other 
> > >> > PTransforms, and so on. Is the runner expected to resolve all 
> > >> > annotations down to leaf nodes? What happens if that results in 
> > >> > conflicting annotations?
> > >>
> > >> +1 to this question, This needs to be detailed.
> > >>
> > >> I am curious about how the end user APIs of this will look maybe in
> > >> Java or Python, just an extra method to inject a Map or via Java
> > >> annotations/Python decorators?
> > >>
> > >> We might prefer not to mix the concepts of annotations and
> > >> environments because this will limit the scope of annotations.
> > >> Annotations are different from environments because they serve a more
> > >> general idea: to express an intention and it is up to the runner to
> > >> choose the strategy to accomplish this, for example in the GPU
> > >> assignation case it could be to rewrite resource allocation via
> > >> Environments but it could also just delegate this to a resource
> > >> manager which is what we could do for example for GPU (or data
> > >> locality) cases on the Spark/Flink classic runners. If we tie this to
> > >> environments we will leave classic runners out of the loop for no
> > >> major reason and also not cover use cases not related to resource
> > >> allocation.
> > >>
> > >> I do not understand the use case to justify PCollection annotations
> > >> but to not mix this thread with them, would you be interested to
> > >> elaborate more about them in a different thread Jan?
> > >>
> > >> On Tue, Nov 17, 2020 at 2:28 AM Robert Bradshaw <rober...@google.com> 
> > >> wrote:
> > >> >
> > >> > I agree things like GPU, high-mem, etc. belong to the environment. If
> > >> > annotations are truly advisory, one can imagine merging environments
> > >> > by taking the union of annotations and still producing a correct
> > >> > pipeline. (This would mean that annotations would have to be a
> > >> > multi-map...)
> > >> >
> > >> > On the other hand, this doesn't seem to handle the case of privacy
> > >> > analysis, which could apply to composites without applying to any
> > >> > individual component, and don't really make sense as part of a
> > >> > fusion/execution story.
> > >> >
> > >> > On Mon, Nov 16, 2020 at 4:00 PM Robert Burke <rob...@frantil.com> 
> > >> > wrote:
> > >> > >
> > >> > > That's good historical context.
> > >> > >
> > >> > > But then we'd still need to codify the annotation would need to be 
> > >> > > optional, and not affect correctness.
> > >> > >
> > >> > > Conflicts become easier to manage, (as environments with conflicting 
> > >> > > annotations simply don't get merged, and stay as distinct 
> > >> > > environments) but are still notionally annotation dependant. Do most 
> > >> > > runners handle environments so individually though?
> > >> > >
> > >> > > Reuven's question is a good one though. For the Go SDK, and the 
> > >> > > proposed implementation i saw, they only applied to leaf nodes. This 
> > >> > > is an artifact of how the Go SDK handles composites. Nothing stops 
> > >> > > it from being implemented on the composites Go has, but it didn't 
> > >> > > make sense otherwise. AFAICT Composites are generally for 
> > >> > > organizational convenience and not for functional aspects. Is this 
> > >> > > wrong? Afterall, does it make sense for environments to be on leaves 
> > >> > > and composites either? It's the same issue there.
> > >> > >
> > >> > >
> > >> > > On Mon, Nov 16, 2020, 3:45 PM Kenneth Knowles <k...@apache.org> 
> > >> > > wrote:
> > >> > >>
> > >> > >> I am +1 to the proposal but believe it should be moved to the 
> > >> > >> Environment. I could be convinced otherwise, but would want to 
> > >> > >> really understand the details.
> > >> > >>
> > >> > >> I think we haven't done a great job communicating the purpose of 
> > >> > >> the Environment proto. It was explicitly created for this purpose.
> > >> > >>
> > >> > >> 1. It tells the runner things it needs to know to interpret the 
> > >> > >> DoFn (or other UDF). So these are the existing proto fields like 
> > >> > >> docker image (in the payload) and required artifacts that were 
> > >> > >> staged.
> > >> > >> 2. It is also the place for additional requirements or hints like 
> > >> > >> "high mem" or "GPU" etc.
> > >> > >>
> > >> > >> Every user function has an associated environment, and environments 
> > >> > >> exist only for the purpose of executing user functions. In fact, 
> > >> > >> Environment originated as inline requirements/attributes for a user 
> > >> > >> function proto message and was separated just to make the proto 
> > >> > >> smaller.
> > >> > >>
> > >> > >> A PTransform is an abstract concept for organizing the graph, not 
> > >> > >> an executable thing. So a hint/capability/requirement on a 
> > >> > >> PTransform only really makes sense as a scoping mechanism for 
> > >> > >> applying a hint to a bunch of functions within a subgraph. This 
> > >> > >> seems like a user interface concern and the SDK should own 
> > >> > >> propagating the hints. If the hint truly applies to the whole 
> > >> > >> PTransform and *not* the parts, then I am interested in learning 
> > >> > >> about that.
> > >> > >>
> > >> > >> Kenn
> > >> > >>
> > >> > >> On Mon, Nov 16, 2020 at 10:54 AM Robert Burke <rob...@frantil.com> 
> > >> > >> wrote:
> > >> > >>>
> > >> > >>> That's a good question.
> > >> > >>>
> > >> > >>> I think the main difference is a matter of scope. Annotations 
> > >> > >>> would apply to a PTransform while an environment applies to sets 
> > >> > >>> of transforms. A difference is the optional nature of the 
> > >> > >>> annotations they don't affect correctness. Runners don't need to 
> > >> > >>> do anything with them and still execute the pipeline correctly.
> > >> > >>>
> > >> > >>> Consider a privacy analysis on a pipeline graph. An annotation 
> > >> > >>> indicating that a transform provides a certain level of 
> > >> > >>> anonymization can be used in an analysis to determine if the 
> > >> > >>> downstream transforms are encountering raw data or not.
> > >> > >>>
> > >> > >>> From my understanding (which can be wrong) environments are rigid. 
> > >> > >>> Transforms in different environments can't be fused. "This is the 
> > >> > >>> python env", "this is the java env" can't be merged together. It's 
> > >> > >>> not clear to me that we have defined when environments are safely 
> > >> > >>> fuseable outside of equality. There's value in that simplicity.
> > >> > >>>
> > >> > >>> AFIACT environment has less to do with the machines a pipeline is 
> > >> > >>> executing on than it does about the kinds of SDK pipelines it 
> > >> > >>> understands and can execute.
> > >> > >>>
> > >> > >>>
> > >> > >>>
> > >> > >>> On Mon, Nov 16, 2020, 10:36 AM Chad Dombrova <chad...@gmail.com> 
> > >> > >>> wrote:
> > >> > >>>>>
> > >> > >>>>>
> > >> > >>>>> Another example of an optional annotation is marking a transform 
> > >> > >>>>> to run on secure hardware, or to give hints to profiling/dynamic 
> > >> > >>>>> analysis tools.
> > >> > >>>>
> > >> > >>>>
> > >> > >>>> There seems to be a lot of overlap between this idea and 
> > >> > >>>> Environments.  Can you talk about how you feel they may be 
> > >> > >>>> different or related?  For example, I could see annotations as a 
> > >> > >>>> way of tagging transforms with an Environment, or I could see 
> > >> > >>>> Environments becoming a specialized form of annotation.
> > >> > >>>>
> > >> > >>>> -chad
> > >> > >>>>
> >

Re: PTransform Annotations Proposal

Reply via email to