Re: PTransform Annotations Proposal

Mirac Vuslat Basaran Fri, 20 Nov 2020 15:09:05 -0800

Thanks everyone so much for their input and for the insightful discussion.

Not being knowledgeable about Beam's internals, I have to say I am a bit lost 
on the PTransform vs. environment discussion.


I do agree with Burke's notion that merge rules are very annotation dependent, 
I don't think we can find a one-size-fits-all solution for that. So this might 
be actually be an argument in favour of having annotations on PTransforms, 
since it avoids the conflation with environments.

Also in general, I feel that having annotations per single transform (rather 
than composite) and on PTransforms could lead to a simpler design.

Seeing as there are valuable arguments in favour of both (PTransform and 
environments) with no clear(?) "best solution", I would propose moving forward 
with the initial (PTransform) design to ship the feature and unblock teams 
asking for it. If it turns out that there was indeed a need to have annotations 
in environments, we could always refactor it.

On 2020/11/17 19:07:22, Robert Bradshaw <rober...@google.com> wrote: 
> So far we have two distinct usecases for annotations: resource hints
> and privacy directives, and I've been trying to figure out how to
> reconcile them, but they seem to have very different characteristics.
> (It would be nice to come up with other uses as well to see if we're
> really coming up with a generally useful mode--I think display data
> could fit into this as a new kind of annotation rather than being a
> top-level property, and it could make sense on both leaf and composite
> transforms.)
> 
> To me, resource hints like GPU are inextricably tied to the
> environment. A transform tagged with GPU should reference a Fn that
> invokes GPU-accelerated code that lives in a particular environment.
> Something like high-mem is a bit squishier. Some DoFns take a lot of
> memory, but on the other hand one could imagine labeling a CoGBK as
> high-mem due to knowing that, in this particular usage, there will be
> lots of values with the same key. Ideally runners would be intelligent
> enough to automatically learn memory usage, but even in this case it
> may be a good hint to try and learn the requirements for DoFn A and
> DoFn B separately (which is difficult if they are always colocated,
> but valuable if, e.g. A takes a huge amount of memory and B takes a
> huge amount of wall time).
> 
> Note that tying things to the environment does not preclude using them
> in non-portable runners as they'll still have an SDK-level
> representation (though I don't think we should have an explicit goal
> of feature parity for non-portable runners, e.g. multi-language isn't
> happening, and hope that non-portable runners go away soon anyway).
> 
> Now let's consider privacy annotations. To make things very concrete,
> imagine a transform AverageSpendPerZipCode which takes as input (user,
> zip, spend), all users unique, and returns (zip, avg(spend)). In
> Python, this is GroupBy('zip').aggregate_field('spend',
> MeanCombineFn()). This is not very privacy preserving to those users
> who are the only (or one of a few) in a zip code. So we could define a
> transform PrivacyPreservingAverageSpendPerZipCode as
> 
> @ptransform_fn
> def PrivacyPreservingAverageSpendPerZipCode(spend_per_user, threshold)
>     counts_per_zip = spend_per_user |
> GroupBy('zip').aggregate_field('user', CountCombineFn())
>     spend_per_zip = spend_per_user |
> GroupBy('zip').aggregate_field('spend', MeanCombineFn())
>     filtered = spend_per_zip | beam.Filter(
>         lambda x, counts: counts[x.zip] > threshold,
> counts=AsMap(counts_per_zip))
>     return filtered
> 
> We now have a composite that has privacy preserving properties (i.e.
> the input may be quite sensitive, but the output is not, depending on
> the value of threshold). What is interesting here is that it is only
> the composite that has this property--no individual sub-transform is
> itself privacy preserving. Furthermore, an optimizer may notice we're
> doing aggregation on the same key twice and rewrite this using
> (logically)
> 
>     GroupBy('zip').aggregate_field('user',
> CountCombineFn()).aggregate_field('spend', MeanCombineFn())
> 
> and then applying the filter, which is semantically equivalent and
> satisfies the privacy annotations (and notably that does not even
> require the optimizer to interpret the annotations, just pass them
> on). To me, this implies that these annotations belong on the
> composites, and *not* on the leaf nodes (where they would be
> incorrect).
> 
> I'll leave aside most questions of API until we figure out the model
> semantics, but wanted to throw one possible idea out (though I am
> ambivalent about it). Instead of attaching things to transforms, we
> can just wrap transforms in composites that have no role other than
> declaring information about their contents. E.g. we could have a
> composite transform whose payload is simply an assertion of the
> privacy (or resource?) properties of its inner structure. This would
> be just as expressive as adding new properties to transforms
> themselves (but would add an extra level of nesting, and make
> respecting the precice nesting more important).
> 
> On Tue, Nov 17, 2020 at 8:12 AM Robert Burke <rob...@frantil.com> wrote:
> >
> > +1 to discussing PCollection annotations on a separate thread. It would be 
> > confusing to mix them up.
> >
> > -----------
> >
> > The question around conflicts is interesting, but confusing to me. I don't 
> > think they exist in general. I keep coming back around to that it depends 
> > on the annotation and the purpose of composites. Optionality saves us here 
> > too.
> >
> > Composites are nothing without their internal hypergraph structure. 
> > Eventually it comes down to executing the leaf nodes. The alternative to 
> > executing the leaf nodes is when the composite represents a known transform 
> > and is replaced by the runner on submission time.  Lets look at each.
> >
> > If there's a property that only exists on the leaf nodes, then it's not 
> > possible to bubble up that property to the composite in all cases. 
> > Afterall, it's not necessarily the case that a privacy preserving transform 
> > maintains the property for all output edges as not all such edges pass 
> > through the preserving transform.
> >
> > On the other hand, with memory or gpu recommendations, that might set a low 
> > bar on the composite level.
> >
> > But, composites (any transform really) can be runner replaced. I think it's 
> > fair to say that a runner replaced composite is not beholden to the 
> > annotations of the original leaf transforms, especially around physical 
> > requirements. The implementations are different. If a known composite at 
> > the composite level requires GPUs and it's known replacement doesn't, I'd 
> > posit that replacement was a choice the runner made since it can't 
> > provision machines with GPUs.
> >
> > But, crucially around privacy annotated transforms, a runner likely 
> > shouldn't replace a given subgraph that contains a privacy annotationed 
> > transform unless the replacements provide the same level of privacy. 
> > However, such replacements only happens with well known transforms with 
> > known properties anyway, so this can serve as an additional layer of 
> > validation for a runner aware of the properties.
> >
> > This brings me back to my position: that the notion of conflicts is very 
> > annotation dependant, and that defining them as optional is the most 
> > important feature to avoid issues. Conflicts don't exist as an inherent 
> > property of annotations on ptransform of the hypergraph structure. Am i 
> > wrong? No one has come up with an actual example of a conflict as far as i 
> > understand the thread.
> >
> > Even Reuven's original question is more about whether the runner is forced 
> > to look at leaf bodes rather than only looking at the composite. Assuming 
> > the composite isn't replaced, the runner needs to look at the leaf nodes 
> > regardless. And as discussed above there's no generalized semantics that 
> > fit for all kinds of annotations, once replacements are also considered.
> >
> > On Tue, Nov 17, 2020, 6:35 AM Ismaël Mejía <ieme...@gmail.com> wrote:
> >>
> >> +1 Nice to see there is finally interest on this. Annotations for
> >> PTransforms make total sense!
> >>
> >> The semantics should be strictly optional for runners and correct
> >> execution should not be affected by lack of support of any annotation.
> >> We should however keep the set of annotations small.
> >>
> >> > PTransforms are hierarchical - namely a PTransform contains other 
> >> > PTransforms, and so on. Is the runner expected to resolve all 
> >> > annotations down to leaf nodes? What happens if that results in 
> >> > conflicting annotations?
> >>
> >> +1 to this question, This needs to be detailed.
> >>
> >> I am curious about how the end user APIs of this will look maybe in
> >> Java or Python, just an extra method to inject a Map or via Java
> >> annotations/Python decorators?
> >>
> >> We might prefer not to mix the concepts of annotations and
> >> environments because this will limit the scope of annotations.
> >> Annotations are different from environments because they serve a more
> >> general idea: to express an intention and it is up to the runner to
> >> choose the strategy to accomplish this, for example in the GPU
> >> assignation case it could be to rewrite resource allocation via
> >> Environments but it could also just delegate this to a resource
> >> manager which is what we could do for example for GPU (or data
> >> locality) cases on the Spark/Flink classic runners. If we tie this to
> >> environments we will leave classic runners out of the loop for no
> >> major reason and also not cover use cases not related to resource
> >> allocation.
> >>
> >> I do not understand the use case to justify PCollection annotations
> >> but to not mix this thread with them, would you be interested to
> >> elaborate more about them in a different thread Jan?
> >>
> >> On Tue, Nov 17, 2020 at 2:28 AM Robert Bradshaw <rober...@google.com> 
> >> wrote:
> >> >
> >> > I agree things like GPU, high-mem, etc. belong to the environment. If
> >> > annotations are truly advisory, one can imagine merging environments
> >> > by taking the union of annotations and still producing a correct
> >> > pipeline. (This would mean that annotations would have to be a
> >> > multi-map...)
> >> >
> >> > On the other hand, this doesn't seem to handle the case of privacy
> >> > analysis, which could apply to composites without applying to any
> >> > individual component, and don't really make sense as part of a
> >> > fusion/execution story.
> >> >
> >> > On Mon, Nov 16, 2020 at 4:00 PM Robert Burke <rob...@frantil.com> wrote:
> >> > >
> >> > > That's good historical context.
> >> > >
> >> > > But then we'd still need to codify the annotation would need to be 
> >> > > optional, and not affect correctness.
> >> > >
> >> > > Conflicts become easier to manage, (as environments with conflicting 
> >> > > annotations simply don't get merged, and stay as distinct 
> >> > > environments) but are still notionally annotation dependant. Do most 
> >> > > runners handle environments so individually though?
> >> > >
> >> > > Reuven's question is a good one though. For the Go SDK, and the 
> >> > > proposed implementation i saw, they only applied to leaf nodes. This 
> >> > > is an artifact of how the Go SDK handles composites. Nothing stops it 
> >> > > from being implemented on the composites Go has, but it didn't make 
> >> > > sense otherwise. AFAICT Composites are generally for organizational 
> >> > > convenience and not for functional aspects. Is this wrong? Afterall, 
> >> > > does it make sense for environments to be on leaves and composites 
> >> > > either? It's the same issue there.
> >> > >
> >> > >
> >> > > On Mon, Nov 16, 2020, 3:45 PM Kenneth Knowles <k...@apache.org> wrote:
> >> > >>
> >> > >> I am +1 to the proposal but believe it should be moved to the 
> >> > >> Environment. I could be convinced otherwise, but would want to really 
> >> > >> understand the details.
> >> > >>
> >> > >> I think we haven't done a great job communicating the purpose of the 
> >> > >> Environment proto. It was explicitly created for this purpose.
> >> > >>
> >> > >> 1. It tells the runner things it needs to know to interpret the DoFn 
> >> > >> (or other UDF). So these are the existing proto fields like docker 
> >> > >> image (in the payload) and required artifacts that were staged.
> >> > >> 2. It is also the place for additional requirements or hints like 
> >> > >> "high mem" or "GPU" etc.
> >> > >>
> >> > >> Every user function has an associated environment, and environments 
> >> > >> exist only for the purpose of executing user functions. In fact, 
> >> > >> Environment originated as inline requirements/attributes for a user 
> >> > >> function proto message and was separated just to make the proto 
> >> > >> smaller.
> >> > >>
> >> > >> A PTransform is an abstract concept for organizing the graph, not an 
> >> > >> executable thing. So a hint/capability/requirement on a PTransform 
> >> > >> only really makes sense as a scoping mechanism for applying a hint to 
> >> > >> a bunch of functions within a subgraph. This seems like a user 
> >> > >> interface concern and the SDK should own propagating the hints. If 
> >> > >> the hint truly applies to the whole PTransform and *not* the parts, 
> >> > >> then I am interested in learning about that.
> >> > >>
> >> > >> Kenn
> >> > >>
> >> > >> On Mon, Nov 16, 2020 at 10:54 AM Robert Burke <rob...@frantil.com> 
> >> > >> wrote:
> >> > >>>
> >> > >>> That's a good question.
> >> > >>>
> >> > >>> I think the main difference is a matter of scope. Annotations would 
> >> > >>> apply to a PTransform while an environment applies to sets of 
> >> > >>> transforms. A difference is the optional nature of the annotations 
> >> > >>> they don't affect correctness. Runners don't need to do anything 
> >> > >>> with them and still execute the pipeline correctly.
> >> > >>>
> >> > >>> Consider a privacy analysis on a pipeline graph. An annotation 
> >> > >>> indicating that a transform provides a certain level of 
> >> > >>> anonymization can be used in an analysis to determine if the 
> >> > >>> downstream transforms are encountering raw data or not.
> >> > >>>
> >> > >>> From my understanding (which can be wrong) environments are rigid. 
> >> > >>> Transforms in different environments can't be fused. "This is the 
> >> > >>> python env", "this is the java env" can't be merged together. It's 
> >> > >>> not clear to me that we have defined when environments are safely 
> >> > >>> fuseable outside of equality. There's value in that simplicity.
> >> > >>>
> >> > >>> AFIACT environment has less to do with the machines a pipeline is 
> >> > >>> executing on than it does about the kinds of SDK pipelines it 
> >> > >>> understands and can execute.
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>> On Mon, Nov 16, 2020, 10:36 AM Chad Dombrova <chad...@gmail.com> 
> >> > >>> wrote:
> >> > >>>>>
> >> > >>>>>
> >> > >>>>> Another example of an optional annotation is marking a transform 
> >> > >>>>> to run on secure hardware, or to give hints to profiling/dynamic 
> >> > >>>>> analysis tools.
> >> > >>>>
> >> > >>>>
> >> > >>>> There seems to be a lot of overlap between this idea and 
> >> > >>>> Environments.  Can you talk about how you feel they may be 
> >> > >>>> different or related?  For example, I could see annotations as a 
> >> > >>>> way of tagging transforms with an Environment, or I could see 
> >> > >>>> Environments becoming a specialized form of annotation.
> >> > >>>>
> >> > >>>> -chad
> >> > >>>>
>

Re: PTransform Annotations Proposal

Reply via email to