Re: PTransform Annotations Proposal

Robert Bradshaw Wed, 25 Nov 2020 10:42:35 -0800

I'd like to continue the discussion *and* see an implementation for
the part we've settled on. I was asking why not have "every distinct
physical concern means a distinct environment?"


On Wed, Nov 25, 2020 at 10:38 AM Robert Burke <[email protected]> wrote:
>
> Mostly because perfect is the enemy of good enough. We have a proposal, we 
> have clear boundaries for it. It's fine if the discussion continues, but I 
> see no evidence of concerns that should prevent starting an implementation, 
> because it seems we'll need both anyway.
>
> On Wed, Nov 25, 2020, 10:25 AM Robert Bradshaw <[email protected]> wrote:
>>
>> On Wed, Nov 25, 2020 at 10:15 AM Robert Burke <[email protected]> wrote:
>> >
>> > It sounds like we've come to the position that non-correctness affecting 
>> > Ptransform Annotations are valuable at both leaf and composite levels, and 
>> > don't remove the potential need for a similar feature on Environments, to 
>> > handle physical concerns equirements for worker processes to have (such as 
>> > Ram, CPU, or GPU requirements.)
>> >
>> > Kenn, it's not clear what part of the solution (an annotation field on the 
>> > Ptransform proto message) would need to change to satisfy your scope 
>> > concern, beyond documenting unambiguously that these may not be used for 
>> > physical concerns or things that affect correctness.
>>
>> I'll let Kenn answer as well, but from my point of view, explicitly
>> having somewhere better to put these things would help.
>>
>> > I'm also unclear your scope concern not matching, given the above. Your 
>> > first paragraph reads very supportive of logical annotations on 
>> > Ptransforms, and that matches 1-1 with the current proposed solution. Can 
>> > you clarify your concern?
>> >
>> > I don't wish to scope creep on the physical requirements issue at this 
>> > time. It seems we are agreed they should end up on environments, but I'm 
>> > not seeing proposals on the right way to execute them at this time.They 
>> > seem to be a fruitful topic of discussion, in particular 
>> > unifying/consolidating them for efficient use of resources. I don't think 
>> > we want to end up in a state where every distinct physical concern means a 
>> > distinct environment.
>>
>> Why not? Assuming, of course, that runners are free to merge
>> environments (merging those resource hints they understand and are
>> otherwise compatible, and discarding those they don't) for efficient
>> execution.
>>
>> > I for one am ready to see a PR.
>>
>> +1
>>
>> > On Mon, Nov 23, 2020, 6:02 PM Kenneth Knowles <[email protected]> wrote:
>> >>
>> >>
>> >>
>> >> On Mon, Nov 23, 2020 at 3:04 PM Robert Bradshaw <[email protected]> 
>> >> wrote:
>> >>>
>> >>> On Fri, Nov 20, 2020 at 11:08 AM Mirac Vuslat Basaran 
>> >>> <[email protected]> wrote:
>> >>> >
>> >>> > Thanks everyone so much for their input and for the insightful 
>> >>> > discussion.
>> >>> >
>> >>> > Not being knowledgeable about Beam's internals, I have to say I am a 
>> >>> > bit lost on the PTransform vs. environment discussion.
>> >>> >
>> >>> > I do agree with Burke's notion that merge rules are very annotation 
>> >>> > dependent, I don't think we can find a one-size-fits-all solution for 
>> >>> > that. So this might be actually be an argument in favour of having 
>> >>> > annotations on PTransforms, since it avoids the conflation with 
>> >>> > environments.
>> >>> >
>> >>> > Also in general, I feel that having annotations per single transform 
>> >>> > (rather than composite) and on PTransforms could lead to a simpler 
>> >>> > design.
>> >>>
>> >>> If we want to use these for privacy, I don't see how attaching them to
>> >>> leaf transforms alone could work. (Even CombinePerKey is a composite.)
>> >>>
>> >>> > Seeing as there are valuable arguments in favour of both (PTransform 
>> >>> > and environments) with no clear(?) "best solution", I would propose 
>> >>> > moving forward with the initial (PTransform) design to ship the 
>> >>> > feature and unblock teams asking for it. If it turns out that there 
>> >>> > was indeed a need to have annotations in environments, we could always 
>> >>> > refactor it.
>> >>>
>> >>> I have yet to see any arguments that resource-level hints, such as
>> >>> memory or GPU, don't better belong on the environment. But moving
>> >>> forward on PTransform-level ones for logical statements like privacy
>> >>> declarations makes sense.
>> >>
>> >>
>> >> Exactly this. Properties of transforms make sense. The properties may 
>> >> hold only of the whole subgraph. Even something as simple as "preserves 
>> >> keys". This is analogous (but converse) to requirements like "requires 
>> >> sorted input" which were explicitly excluded from the scope, which was 
>> >> about hardware environment for execution.
>> >>
>> >> The proposed scope and the proposed solution do not match and need to be 
>> >> reconciled.
>> >>
>> >> Kenn
>> >>
>> >>> > On 2020/11/17 19:07:22, Robert Bradshaw <[email protected]> wrote:
>> >>> > > So far we have two distinct usecases for annotations: resource hints
>> >>> > > and privacy directives, and I've been trying to figure out how to
>> >>> > > reconcile them, but they seem to have very different characteristics.
>> >>> > > (It would be nice to come up with other uses as well to see if we're
>> >>> > > really coming up with a generally useful mode--I think display data
>> >>> > > could fit into this as a new kind of annotation rather than being a
>> >>> > > top-level property, and it could make sense on both leaf and 
>> >>> > > composite
>> >>> > > transforms.)
>> >>> > >
>> >>> > > To me, resource hints like GPU are inextricably tied to the
>> >>> > > environment. A transform tagged with GPU should reference a Fn that
>> >>> > > invokes GPU-accelerated code that lives in a particular environment.
>> >>> > > Something like high-mem is a bit squishier. Some DoFns take a lot of
>> >>> > > memory, but on the other hand one could imagine labeling a CoGBK as
>> >>> > > high-mem due to knowing that, in this particular usage, there will be
>> >>> > > lots of values with the same key. Ideally runners would be 
>> >>> > > intelligent
>> >>> > > enough to automatically learn memory usage, but even in this case it
>> >>> > > may be a good hint to try and learn the requirements for DoFn A and
>> >>> > > DoFn B separately (which is difficult if they are always colocated,
>> >>> > > but valuable if, e.g. A takes a huge amount of memory and B takes a
>> >>> > > huge amount of wall time).
>> >>> > >
>> >>> > > Note that tying things to the environment does not preclude using 
>> >>> > > them
>> >>> > > in non-portable runners as they'll still have an SDK-level
>> >>> > > representation (though I don't think we should have an explicit goal
>> >>> > > of feature parity for non-portable runners, e.g. multi-language isn't
>> >>> > > happening, and hope that non-portable runners go away soon anyway).
>> >>> > >
>> >>> > > Now let's consider privacy annotations. To make things very concrete,
>> >>> > > imagine a transform AverageSpendPerZipCode which takes as input 
>> >>> > > (user,
>> >>> > > zip, spend), all users unique, and returns (zip, avg(spend)). In
>> >>> > > Python, this is GroupBy('zip').aggregate_field('spend',
>> >>> > > MeanCombineFn()). This is not very privacy preserving to those users
>> >>> > > who are the only (or one of a few) in a zip code. So we could define 
>> >>> > > a
>> >>> > > transform PrivacyPreservingAverageSpendPerZipCode as
>> >>> > >
>> >>> > > @ptransform_fn
>> >>> > > def PrivacyPreservingAverageSpendPerZipCode(spend_per_user, 
>> >>> > > threshold)
>> >>> > >     counts_per_zip = spend_per_user |
>> >>> > > GroupBy('zip').aggregate_field('user', CountCombineFn())
>> >>> > >     spend_per_zip = spend_per_user |
>> >>> > > GroupBy('zip').aggregate_field('spend', MeanCombineFn())
>> >>> > >     filtered = spend_per_zip | beam.Filter(
>> >>> > >         lambda x, counts: counts[x.zip] > threshold,
>> >>> > > counts=AsMap(counts_per_zip))
>> >>> > >     return filtered
>> >>> > >
>> >>> > > We now have a composite that has privacy preserving properties (i.e.
>> >>> > > the input may be quite sensitive, but the output is not, depending on
>> >>> > > the value of threshold). What is interesting here is that it is only
>> >>> > > the composite that has this property--no individual sub-transform is
>> >>> > > itself privacy preserving. Furthermore, an optimizer may notice we're
>> >>> > > doing aggregation on the same key twice and rewrite this using
>> >>> > > (logically)
>> >>> > >
>> >>> > >     GroupBy('zip').aggregate_field('user',
>> >>> > > CountCombineFn()).aggregate_field('spend', MeanCombineFn())
>> >>> > >
>> >>> > > and then applying the filter, which is semantically equivalent and
>> >>> > > satisfies the privacy annotations (and notably that does not even
>> >>> > > require the optimizer to interpret the annotations, just pass them
>> >>> > > on). To me, this implies that these annotations belong on the
>> >>> > > composites, and *not* on the leaf nodes (where they would be
>> >>> > > incorrect).
>> >>> > >
>> >>> > > I'll leave aside most questions of API until we figure out the model
>> >>> > > semantics, but wanted to throw one possible idea out (though I am
>> >>> > > ambivalent about it). Instead of attaching things to transforms, we
>> >>> > > can just wrap transforms in composites that have no role other than
>> >>> > > declaring information about their contents. E.g. we could have a
>> >>> > > composite transform whose payload is simply an assertion of the
>> >>> > > privacy (or resource?) properties of its inner structure. This would
>> >>> > > be just as expressive as adding new properties to transforms
>> >>> > > themselves (but would add an extra level of nesting, and make
>> >>> > > respecting the precice nesting more important).
>> >>> > >
>> >>> > > On Tue, Nov 17, 2020 at 8:12 AM Robert Burke <[email protected]> 
>> >>> > > wrote:
>> >>> > > >
>> >>> > > > +1 to discussing PCollection annotations on a separate thread. It 
>> >>> > > > would be confusing to mix them up.
>> >>> > > >
>> >>> > > > -----------
>> >>> > > >
>> >>> > > > The question around conflicts is interesting, but confusing to me. 
>> >>> > > > I don't think they exist in general. I keep coming back around to 
>> >>> > > > that it depends on the annotation and the purpose of composites. 
>> >>> > > > Optionality saves us here too.
>> >>> > > >
>> >>> > > > Composites are nothing without their internal hypergraph 
>> >>> > > > structure. Eventually it comes down to executing the leaf nodes. 
>> >>> > > > The alternative to executing the leaf nodes is when the composite 
>> >>> > > > represents a known transform and is replaced by the runner on 
>> >>> > > > submission time.  Lets look at each.
>> >>> > > >
>> >>> > > > If there's a property that only exists on the leaf nodes, then 
>> >>> > > > it's not possible to bubble up that property to the composite in 
>> >>> > > > all cases. Afterall, it's not necessarily the case that a privacy 
>> >>> > > > preserving transform maintains the property for all output edges 
>> >>> > > > as not all such edges pass through the preserving transform.
>> >>> > > >
>> >>> > > > On the other hand, with memory or gpu recommendations, that might 
>> >>> > > > set a low bar on the composite level.
>> >>> > > >
>> >>> > > > But, composites (any transform really) can be runner replaced. I 
>> >>> > > > think it's fair to say that a runner replaced composite is not 
>> >>> > > > beholden to the annotations of the original leaf transforms, 
>> >>> > > > especially around physical requirements. The implementations are 
>> >>> > > > different. If a known composite at the composite level requires 
>> >>> > > > GPUs and it's known replacement doesn't, I'd posit that 
>> >>> > > > replacement was a choice the runner made since it can't provision 
>> >>> > > > machines with GPUs.
>> >>> > > >
>> >>> > > > But, crucially around privacy annotated transforms, a runner 
>> >>> > > > likely shouldn't replace a given subgraph that contains a privacy 
>> >>> > > > annotationed transform unless the replacements provide the same 
>> >>> > > > level of privacy. However, such replacements only happens with 
>> >>> > > > well known transforms with known properties anyway, so this can 
>> >>> > > > serve as an additional layer of validation for a runner aware of 
>> >>> > > > the properties.
>> >>> > > >
>> >>> > > > This brings me back to my position: that the notion of conflicts 
>> >>> > > > is very annotation dependant, and that defining them as optional 
>> >>> > > > is the most important feature to avoid issues. Conflicts don't 
>> >>> > > > exist as an inherent property of annotations on ptransform of the 
>> >>> > > > hypergraph structure. Am i wrong? No one has come up with an 
>> >>> > > > actual example of a conflict as far as i understand the thread.
>> >>> > > >
>> >>> > > > Even Reuven's original question is more about whether the runner 
>> >>> > > > is forced to look at leaf bodes rather than only looking at the 
>> >>> > > > composite. Assuming the composite isn't replaced, the runner needs 
>> >>> > > > to look at the leaf nodes regardless. And as discussed above 
>> >>> > > > there's no generalized semantics that fit for all kinds of 
>> >>> > > > annotations, once replacements are also considered.
>> >>> > > >
>> >>> > > > On Tue, Nov 17, 2020, 6:35 AM Ismaël Mejía <[email protected]> 
>> >>> > > > wrote:
>> >>> > > >>
>> >>> > > >> +1 Nice to see there is finally interest on this. Annotations for
>> >>> > > >> PTransforms make total sense!
>> >>> > > >>
>> >>> > > >> The semantics should be strictly optional for runners and correct
>> >>> > > >> execution should not be affected by lack of support of any 
>> >>> > > >> annotation.
>> >>> > > >> We should however keep the set of annotations small.
>> >>> > > >>
>> >>> > > >> > PTransforms are hierarchical - namely a PTransform contains 
>> >>> > > >> > other PTransforms, and so on. Is the runner expected to resolve 
>> >>> > > >> > all annotations down to leaf nodes? What happens if that 
>> >>> > > >> > results in conflicting annotations?
>> >>> > > >>
>> >>> > > >> +1 to this question, This needs to be detailed.
>> >>> > > >>
>> >>> > > >> I am curious about how the end user APIs of this will look maybe 
>> >>> > > >> in
>> >>> > > >> Java or Python, just an extra method to inject a Map or via Java
>> >>> > > >> annotations/Python decorators?
>> >>> > > >>
>> >>> > > >> We might prefer not to mix the concepts of annotations and
>> >>> > > >> environments because this will limit the scope of annotations.
>> >>> > > >> Annotations are different from environments because they serve a 
>> >>> > > >> more
>> >>> > > >> general idea: to express an intention and it is up to the runner 
>> >>> > > >> to
>> >>> > > >> choose the strategy to accomplish this, for example in the GPU
>> >>> > > >> assignation case it could be to rewrite resource allocation via
>> >>> > > >> Environments but it could also just delegate this to a resource
>> >>> > > >> manager which is what we could do for example for GPU (or data
>> >>> > > >> locality) cases on the Spark/Flink classic runners. If we tie 
>> >>> > > >> this to
>> >>> > > >> environments we will leave classic runners out of the loop for no
>> >>> > > >> major reason and also not cover use cases not related to resource
>> >>> > > >> allocation.
>> >>> > > >>
>> >>> > > >> I do not understand the use case to justify PCollection 
>> >>> > > >> annotations
>> >>> > > >> but to not mix this thread with them, would you be interested to
>> >>> > > >> elaborate more about them in a different thread Jan?
>> >>> > > >>
>> >>> > > >> On Tue, Nov 17, 2020 at 2:28 AM Robert Bradshaw 
>> >>> > > >> <[email protected]> wrote:
>> >>> > > >> >
>> >>> > > >> > I agree things like GPU, high-mem, etc. belong to the 
>> >>> > > >> > environment. If
>> >>> > > >> > annotations are truly advisory, one can imagine merging 
>> >>> > > >> > environments
>> >>> > > >> > by taking the union of annotations and still producing a correct
>> >>> > > >> > pipeline. (This would mean that annotations would have to be a
>> >>> > > >> > multi-map...)
>> >>> > > >> >
>> >>> > > >> > On the other hand, this doesn't seem to handle the case of 
>> >>> > > >> > privacy
>> >>> > > >> > analysis, which could apply to composites without applying to 
>> >>> > > >> > any
>> >>> > > >> > individual component, and don't really make sense as part of a
>> >>> > > >> > fusion/execution story.
>> >>> > > >> >
>> >>> > > >> > On Mon, Nov 16, 2020 at 4:00 PM Robert Burke 
>> >>> > > >> > <[email protected]> wrote:
>> >>> > > >> > >
>> >>> > > >> > > That's good historical context.
>> >>> > > >> > >
>> >>> > > >> > > But then we'd still need to codify the annotation would need 
>> >>> > > >> > > to be optional, and not affect correctness.
>> >>> > > >> > >
>> >>> > > >> > > Conflicts become easier to manage, (as environments with 
>> >>> > > >> > > conflicting annotations simply don't get merged, and stay as 
>> >>> > > >> > > distinct environments) but are still notionally annotation 
>> >>> > > >> > > dependant. Do most runners handle environments so 
>> >>> > > >> > > individually though?
>> >>> > > >> > >
>> >>> > > >> > > Reuven's question is a good one though. For the Go SDK, and 
>> >>> > > >> > > the proposed implementation i saw, they only applied to leaf 
>> >>> > > >> > > nodes. This is an artifact of how the Go SDK handles 
>> >>> > > >> > > composites. Nothing stops it from being implemented on the 
>> >>> > > >> > > composites Go has, but it didn't make sense otherwise. AFAICT 
>> >>> > > >> > > Composites are generally for organizational convenience and 
>> >>> > > >> > > not for functional aspects. Is this wrong? Afterall, does it 
>> >>> > > >> > > make sense for environments to be on leaves and composites 
>> >>> > > >> > > either? It's the same issue there.
>> >>> > > >> > >
>> >>> > > >> > >
>> >>> > > >> > > On Mon, Nov 16, 2020, 3:45 PM Kenneth Knowles 
>> >>> > > >> > > <[email protected]> wrote:
>> >>> > > >> > >>
>> >>> > > >> > >> I am +1 to the proposal but believe it should be moved to 
>> >>> > > >> > >> the Environment. I could be convinced otherwise, but would 
>> >>> > > >> > >> want to really understand the details.
>> >>> > > >> > >>
>> >>> > > >> > >> I think we haven't done a great job communicating the 
>> >>> > > >> > >> purpose of the Environment proto. It was explicitly created 
>> >>> > > >> > >> for this purpose.
>> >>> > > >> > >>
>> >>> > > >> > >> 1. It tells the runner things it needs to know to interpret 
>> >>> > > >> > >> the DoFn (or other UDF). So these are the existing proto 
>> >>> > > >> > >> fields like docker image (in the payload) and required 
>> >>> > > >> > >> artifacts that were staged.
>> >>> > > >> > >> 2. It is also the place for additional requirements or hints 
>> >>> > > >> > >> like "high mem" or "GPU" etc.
>> >>> > > >> > >>
>> >>> > > >> > >> Every user function has an associated environment, and 
>> >>> > > >> > >> environments exist only for the purpose of executing user 
>> >>> > > >> > >> functions. In fact, Environment originated as inline 
>> >>> > > >> > >> requirements/attributes for a user function proto message 
>> >>> > > >> > >> and was separated just to make the proto smaller.
>> >>> > > >> > >>
>> >>> > > >> > >> A PTransform is an abstract concept for organizing the 
>> >>> > > >> > >> graph, not an executable thing. So a 
>> >>> > > >> > >> hint/capability/requirement on a PTransform only really 
>> >>> > > >> > >> makes sense as a scoping mechanism for applying a hint to a 
>> >>> > > >> > >> bunch of functions within a subgraph. This seems like a user 
>> >>> > > >> > >> interface concern and the SDK should own propagating the 
>> >>> > > >> > >> hints. If the hint truly applies to the whole PTransform and 
>> >>> > > >> > >> *not* the parts, then I am interested in learning about that.
>> >>> > > >> > >>
>> >>> > > >> > >> Kenn
>> >>> > > >> > >>
>> >>> > > >> > >> On Mon, Nov 16, 2020 at 10:54 AM Robert Burke 
>> >>> > > >> > >> <[email protected]> wrote:
>> >>> > > >> > >>>
>> >>> > > >> > >>> That's a good question.
>> >>> > > >> > >>>
>> >>> > > >> > >>> I think the main difference is a matter of scope. 
>> >>> > > >> > >>> Annotations would apply to a PTransform while an 
>> >>> > > >> > >>> environment applies to sets of transforms. A difference is 
>> >>> > > >> > >>> the optional nature of the annotations they don't affect 
>> >>> > > >> > >>> correctness. Runners don't need to do anything with them 
>> >>> > > >> > >>> and still execute the pipeline correctly.
>> >>> > > >> > >>>
>> >>> > > >> > >>> Consider a privacy analysis on a pipeline graph. An 
>> >>> > > >> > >>> annotation indicating that a transform provides a certain 
>> >>> > > >> > >>> level of anonymization can be used in an analysis to 
>> >>> > > >> > >>> determine if the downstream transforms are encountering raw 
>> >>> > > >> > >>> data or not.
>> >>> > > >> > >>>
>> >>> > > >> > >>> From my understanding (which can be wrong) environments are 
>> >>> > > >> > >>> rigid. Transforms in different environments can't be fused. 
>> >>> > > >> > >>> "This is the python env", "this is the java env" can't be 
>> >>> > > >> > >>> merged together. It's not clear to me that we have defined 
>> >>> > > >> > >>> when environments are safely fuseable outside of equality. 
>> >>> > > >> > >>> There's value in that simplicity.
>> >>> > > >> > >>>
>> >>> > > >> > >>> AFIACT environment has less to do with the machines a 
>> >>> > > >> > >>> pipeline is executing on than it does about the kinds of 
>> >>> > > >> > >>> SDK pipelines it understands and can execute.
>> >>> > > >> > >>>
>> >>> > > >> > >>>
>> >>> > > >> > >>>
>> >>> > > >> > >>> On Mon, Nov 16, 2020, 10:36 AM Chad Dombrova 
>> >>> > > >> > >>> <[email protected]> wrote:
>> >>> > > >> > >>>>>
>> >>> > > >> > >>>>>
>> >>> > > >> > >>>>> Another example of an optional annotation is marking a 
>> >>> > > >> > >>>>> transform to run on secure hardware, or to give hints to 
>> >>> > > >> > >>>>> profiling/dynamic analysis tools.
>> >>> > > >> > >>>>
>> >>> > > >> > >>>>
>> >>> > > >> > >>>> There seems to be a lot of overlap between this idea and 
>> >>> > > >> > >>>> Environments.  Can you talk about how you feel they may be 
>> >>> > > >> > >>>> different or related?  For example, I could see 
>> >>> > > >> > >>>> annotations as a way of tagging transforms with an 
>> >>> > > >> > >>>> Environment, or I could see Environments becoming a 
>> >>> > > >> > >>>> specialized form of annotation.
>> >>> > > >> > >>>>
>> >>> > > >> > >>>> -chad
>> >>> > > >> > >>>>
>> >>> > >

Re: PTransform Annotations Proposal

Reply via email to