Re: PTransform Annotations Proposal

Mirac Vuslat Basaran Wed, 13 Jan 2021 13:21:50 -0800

Wrote a design draft for resource-related annotations. Please have a look: 
https://docs.google.com/document/d/1phExeGD1gdDI9M8LK4ZG57UGa7dswpB8Aj6jxWj4uQk/edit?usp=sharing


Cheers,
Mirac
On 2020/11/26 20:20:09, Mirac Vuslat Basaran <[email protected]> wrote: 
> Created a PR without unit tests at https://github.com/apache/beam/pull/13434. 
> Please have a look.
> 
> Thanks,
> Mirac
> 
> 
> On 2020/11/25 18:50:19, Robert Burke <[email protected]> wrote: 
> > Hmmm. Fair. I'm mostly concerned about the pathological case where we end
> > up with a distinct Environment per transform, but there are likely
> > practical cases where that's reasonable (High mem to GPU to TPU, to ARM....)
> > 
> > On Wed, Nov 25, 2020, 10:42 AM Robert Bradshaw <[email protected]> wrote:
> > 
> > > I'd like to continue the discussion *and* see an implementation for
> > > the part we've settled on. I was asking why not have "every distinct
> > > physical concern means a distinct environment?"
> > >
> > > On Wed, Nov 25, 2020 at 10:38 AM Robert Burke <[email protected]> wrote:
> > > >
> > > > Mostly because perfect is the enemy of good enough. We have a proposal,
> > > we have clear boundaries for it. It's fine if the discussion continues, 
> > > but
> > > I see no evidence of concerns that should prevent starting an
> > > implementation, because it seems we'll need both anyway.
> > > >
> > > > On Wed, Nov 25, 2020, 10:25 AM Robert Bradshaw <[email protected]>
> > > wrote:
> > > >>
> > > >> On Wed, Nov 25, 2020 at 10:15 AM Robert Burke <[email protected]>
> > > wrote:
> > > >> >
> > > >> > It sounds like we've come to the position that non-correctness
> > > affecting Ptransform Annotations are valuable at both leaf and composite
> > > levels, and don't remove the potential need for a similar feature on
> > > Environments, to handle physical concerns equirements for worker processes
> > > to have (such as Ram, CPU, or GPU requirements.)
> > > >> >
> > > >> > Kenn, it's not clear what part of the solution (an annotation field
> > > on the Ptransform proto message) would need to change to satisfy your 
> > > scope
> > > concern, beyond documenting unambiguously that these may not be used for
> > > physical concerns or things that affect correctness.
> > > >>
> > > >> I'll let Kenn answer as well, but from my point of view, explicitly
> > > >> having somewhere better to put these things would help.
> > > >>
> > > >> > I'm also unclear your scope concern not matching, given the above.
> > > Your first paragraph reads very supportive of logical annotations on
> > > Ptransforms, and that matches 1-1 with the current proposed solution. Can
> > > you clarify your concern?
> > > >> >
> > > >> > I don't wish to scope creep on the physical requirements issue at
> > > this time. It seems we are agreed they should end up on environments, but
> > > I'm not seeing proposals on the right way to execute them at this 
> > > time.They
> > > seem to be a fruitful topic of discussion, in particular
> > > unifying/consolidating them for efficient use of resources. I don't think
> > > we want to end up in a state where every distinct physical concern means a
> > > distinct environment.
> > > >>
> > > >> Why not? Assuming, of course, that runners are free to merge
> > > >> environments (merging those resource hints they understand and are
> > > >> otherwise compatible, and discarding those they don't) for efficient
> > > >> execution.
> > > >>
> > > >> > I for one am ready to see a PR.
> > > >>
> > > >> +1
> > > >>
> > > >> > On Mon, Nov 23, 2020, 6:02 PM Kenneth Knowles <[email protected]>
> > > wrote:
> > > >> >>
> > > >> >>
> > > >> >>
> > > >> >> On Mon, Nov 23, 2020 at 3:04 PM Robert Bradshaw 
> > > >> >> <[email protected]>
> > > wrote:
> > > >> >>>
> > > >> >>> On Fri, Nov 20, 2020 at 11:08 AM Mirac Vuslat Basaran <
> > > [email protected]> wrote:
> > > >> >>> >
> > > >> >>> > Thanks everyone so much for their input and for the insightful
> > > discussion.
> > > >> >>> >
> > > >> >>> > Not being knowledgeable about Beam's internals, I have to say I
> > > am a bit lost on the PTransform vs. environment discussion.
> > > >> >>> >
> > > >> >>> > I do agree with Burke's notion that merge rules are very
> > > annotation dependent, I don't think we can find a one-size-fits-all
> > > solution for that. So this might be actually be an argument in favour of
> > > having annotations on PTransforms, since it avoids the conflation with
> > > environments.
> > > >> >>> >
> > > >> >>> > Also in general, I feel that having annotations per single
> > > transform (rather than composite) and on PTransforms could lead to a
> > > simpler design.
> > > >> >>>
> > > >> >>> If we want to use these for privacy, I don't see how attaching them
> > > to
> > > >> >>> leaf transforms alone could work. (Even CombinePerKey is a
> > > composite.)
> > > >> >>>
> > > >> >>> > Seeing as there are valuable arguments in favour of both
> > > (PTransform and environments) with no clear(?) "best solution", I would
> > > propose moving forward with the initial (PTransform) design to ship the
> > > feature and unblock teams asking for it. If it turns out that there was
> > > indeed a need to have annotations in environments, we could always 
> > > refactor
> > > it.
> > > >> >>>
> > > >> >>> I have yet to see any arguments that resource-level hints, such as
> > > >> >>> memory or GPU, don't better belong on the environment. But moving
> > > >> >>> forward on PTransform-level ones for logical statements like 
> > > >> >>> privacy
> > > >> >>> declarations makes sense.
> > > >> >>
> > > >> >>
> > > >> >> Exactly this. Properties of transforms make sense. The properties
> > > may hold only of the whole subgraph. Even something as simple as 
> > > "preserves
> > > keys". This is analogous (but converse) to requirements like "requires
> > > sorted input" which were explicitly excluded from the scope, which was
> > > about hardware environment for execution.
> > > >> >>
> > > >> >> The proposed scope and the proposed solution do not match and need
> > > to be reconciled.
> > > >> >>
> > > >> >> Kenn
> > > >> >>
> > > >> >>> > On 2020/11/17 19:07:22, Robert Bradshaw <[email protected]>
> > > wrote:
> > > >> >>> > > So far we have two distinct usecases for annotations: resource
> > > hints
> > > >> >>> > > and privacy directives, and I've been trying to figure out how
> > > to
> > > >> >>> > > reconcile them, but they seem to have very different
> > > characteristics.
> > > >> >>> > > (It would be nice to come up with other uses as well to see if
> > > we're
> > > >> >>> > > really coming up with a generally useful mode--I think display
> > > data
> > > >> >>> > > could fit into this as a new kind of annotation rather than
> > > being a
> > > >> >>> > > top-level property, and it could make sense on both leaf and
> > > composite
> > > >> >>> > > transforms.)
> > > >> >>> > >
> > > >> >>> > > To me, resource hints like GPU are inextricably tied to the
> > > >> >>> > > environment. A transform tagged with GPU should reference a Fn
> > > that
> > > >> >>> > > invokes GPU-accelerated code that lives in a particular
> > > environment.
> > > >> >>> > > Something like high-mem is a bit squishier. Some DoFns take a
> > > lot of
> > > >> >>> > > memory, but on the other hand one could imagine labeling a
> > > CoGBK as
> > > >> >>> > > high-mem due to knowing that, in this particular usage, there
> > > will be
> > > >> >>> > > lots of values with the same key. Ideally runners would be
> > > intelligent
> > > >> >>> > > enough to automatically learn memory usage, but even in this
> > > case it
> > > >> >>> > > may be a good hint to try and learn the requirements for DoFn A
> > > and
> > > >> >>> > > DoFn B separately (which is difficult if they are always
> > > colocated,
> > > >> >>> > > but valuable if, e.g. A takes a huge amount of memory and B
> > > takes a
> > > >> >>> > > huge amount of wall time).
> > > >> >>> > >
> > > >> >>> > > Note that tying things to the environment does not preclude
> > > using them
> > > >> >>> > > in non-portable runners as they'll still have an SDK-level
> > > >> >>> > > representation (though I don't think we should have an explicit
> > > goal
> > > >> >>> > > of feature parity for non-portable runners, e.g. multi-language
> > > isn't
> > > >> >>> > > happening, and hope that non-portable runners go away soon
> > > anyway).
> > > >> >>> > >
> > > >> >>> > > Now let's consider privacy annotations. To make things very
> > > concrete,
> > > >> >>> > > imagine a transform AverageSpendPerZipCode which takes as input
> > > (user,
> > > >> >>> > > zip, spend), all users unique, and returns (zip, avg(spend)). 
> > > >> >>> > > In
> > > >> >>> > > Python, this is GroupBy('zip').aggregate_field('spend',
> > > >> >>> > > MeanCombineFn()). This is not very privacy preserving to those
> > > users
> > > >> >>> > > who are the only (or one of a few) in a zip code. So we could
> > > define a
> > > >> >>> > > transform PrivacyPreservingAverageSpendPerZipCode as
> > > >> >>> > >
> > > >> >>> > > @ptransform_fn
> > > >> >>> > > def PrivacyPreservingAverageSpendPerZipCode(spend_per_user,
> > > threshold)
> > > >> >>> > >     counts_per_zip = spend_per_user |
> > > >> >>> > > GroupBy('zip').aggregate_field('user', CountCombineFn())
> > > >> >>> > >     spend_per_zip = spend_per_user |
> > > >> >>> > > GroupBy('zip').aggregate_field('spend', MeanCombineFn())
> > > >> >>> > >     filtered = spend_per_zip | beam.Filter(
> > > >> >>> > >         lambda x, counts: counts[x.zip] > threshold,
> > > >> >>> > > counts=AsMap(counts_per_zip))
> > > >> >>> > >     return filtered
> > > >> >>> > >
> > > >> >>> > > We now have a composite that has privacy preserving properties
> > > (i.e.
> > > >> >>> > > the input may be quite sensitive, but the output is not,
> > > depending on
> > > >> >>> > > the value of threshold). What is interesting here is that it is
> > > only
> > > >> >>> > > the composite that has this property--no individual
> > > sub-transform is
> > > >> >>> > > itself privacy preserving. Furthermore, an optimizer may notice
> > > we're
> > > >> >>> > > doing aggregation on the same key twice and rewrite this using
> > > >> >>> > > (logically)
> > > >> >>> > >
> > > >> >>> > >     GroupBy('zip').aggregate_field('user',
> > > >> >>> > > CountCombineFn()).aggregate_field('spend', MeanCombineFn())
> > > >> >>> > >
> > > >> >>> > > and then applying the filter, which is semantically equivalent
> > > and
> > > >> >>> > > satisfies the privacy annotations (and notably that does not
> > > even
> > > >> >>> > > require the optimizer to interpret the annotations, just pass
> > > them
> > > >> >>> > > on). To me, this implies that these annotations belong on the
> > > >> >>> > > composites, and *not* on the leaf nodes (where they would be
> > > >> >>> > > incorrect).
> > > >> >>> > >
> > > >> >>> > > I'll leave aside most questions of API until we figure out the
> > > model
> > > >> >>> > > semantics, but wanted to throw one possible idea out (though I
> > > am
> > > >> >>> > > ambivalent about it). Instead of attaching things to
> > > transforms, we
> > > >> >>> > > can just wrap transforms in composites that have no role other
> > > than
> > > >> >>> > > declaring information about their contents. E.g. we could have 
> > > >> >>> > > a
> > > >> >>> > > composite transform whose payload is simply an assertion of the
> > > >> >>> > > privacy (or resource?) properties of its inner structure. This
> > > would
> > > >> >>> > > be just as expressive as adding new properties to transforms
> > > >> >>> > > themselves (but would add an extra level of nesting, and make
> > > >> >>> > > respecting the precice nesting more important).
> > > >> >>> > >
> > > >> >>> > > On Tue, Nov 17, 2020 at 8:12 AM Robert Burke <
> > > [email protected]> wrote:
> > > >> >>> > > >
> > > >> >>> > > > +1 to discussing PCollection annotations on a separate
> > > thread. It would be confusing to mix them up.
> > > >> >>> > > >
> > > >> >>> > > > -----------
> > > >> >>> > > >
> > > >> >>> > > > The question around conflicts is interesting, but confusing
> > > to me. I don't think they exist in general. I keep coming back around to
> > > that it depends on the annotation and the purpose of composites.
> > > Optionality saves us here too.
> > > >> >>> > > >
> > > >> >>> > > > Composites are nothing without their internal hypergraph
> > > structure. Eventually it comes down to executing the leaf nodes. The
> > > alternative to executing the leaf nodes is when the composite represents a
> > > known transform and is replaced by the runner on submission time.  Lets
> > > look at each.
> > > >> >>> > > >
> > > >> >>> > > > If there's a property that only exists on the leaf nodes,
> > > then it's not possible to bubble up that property to the composite in all
> > > cases. Afterall, it's not necessarily the case that a privacy preserving
> > > transform maintains the property for all output edges as not all such 
> > > edges
> > > pass through the preserving transform.
> > > >> >>> > > >
> > > >> >>> > > > On the other hand, with memory or gpu recommendations, that
> > > might set a low bar on the composite level.
> > > >> >>> > > >
> > > >> >>> > > > But, composites (any transform really) can be runner
> > > replaced. I think it's fair to say that a runner replaced composite is not
> > > beholden to the annotations of the original leaf transforms, especially
> > > around physical requirements. The implementations are different. If a 
> > > known
> > > composite at the composite level requires GPUs and it's known replacement
> > > doesn't, I'd posit that replacement was a choice the runner made since it
> > > can't provision machines with GPUs.
> > > >> >>> > > >
> > > >> >>> > > > But, crucially around privacy annotated transforms, a runner
> > > likely shouldn't replace a given subgraph that contains a privacy
> > > annotationed transform unless the replacements provide the same level of
> > > privacy. However, such replacements only happens with well known 
> > > transforms
> > > with known properties anyway, so this can serve as an additional layer of
> > > validation for a runner aware of the properties.
> > > >> >>> > > >
> > > >> >>> > > > This brings me back to my position: that the notion of
> > > conflicts is very annotation dependant, and that defining them as optional
> > > is the most important feature to avoid issues. Conflicts don't exist as an
> > > inherent property of annotations on ptransform of the hypergraph 
> > > structure.
> > > Am i wrong? No one has come up with an actual example of a conflict as far
> > > as i understand the thread.
> > > >> >>> > > >
> > > >> >>> > > > Even Reuven's original question is more about whether the
> > > runner is forced to look at leaf bodes rather than only looking at the
> > > composite. Assuming the composite isn't replaced, the runner needs to look
> > > at the leaf nodes regardless. And as discussed above there's no 
> > > generalized
> > > semantics that fit for all kinds of annotations, once replacements are 
> > > also
> > > considered.
> > > >> >>> > > >
> > > >> >>> > > > On Tue, Nov 17, 2020, 6:35 AM Ismaël Mejía 
> > > >> >>> > > > <[email protected]>
> > > wrote:
> > > >> >>> > > >>
> > > >> >>> > > >> +1 Nice to see there is finally interest on this.
> > > Annotations for
> > > >> >>> > > >> PTransforms make total sense!
> > > >> >>> > > >>
> > > >> >>> > > >> The semantics should be strictly optional for runners and
> > > correct
> > > >> >>> > > >> execution should not be affected by lack of support of any
> > > annotation.
> > > >> >>> > > >> We should however keep the set of annotations small.
> > > >> >>> > > >>
> > > >> >>> > > >> > PTransforms are hierarchical - namely a PTransform
> > > contains other PTransforms, and so on. Is the runner expected to resolve
> > > all annotations down to leaf nodes? What happens if that results in
> > > conflicting annotations?
> > > >> >>> > > >>
> > > >> >>> > > >> +1 to this question, This needs to be detailed.
> > > >> >>> > > >>
> > > >> >>> > > >> I am curious about how the end user APIs of this will look
> > > maybe in
> > > >> >>> > > >> Java or Python, just an extra method to inject a Map or via
> > > Java
> > > >> >>> > > >> annotations/Python decorators?
> > > >> >>> > > >>
> > > >> >>> > > >> We might prefer not to mix the concepts of annotations and
> > > >> >>> > > >> environments because this will limit the scope of
> > > annotations.
> > > >> >>> > > >> Annotations are different from environments because they
> > > serve a more
> > > >> >>> > > >> general idea: to express an intention and it is up to the
> > > runner to
> > > >> >>> > > >> choose the strategy to accomplish this, for example in the
> > > GPU
> > > >> >>> > > >> assignation case it could be to rewrite resource allocation
> > > via
> > > >> >>> > > >> Environments but it could also just delegate this to a
> > > resource
> > > >> >>> > > >> manager which is what we could do for example for GPU (or
> > > data
> > > >> >>> > > >> locality) cases on the Spark/Flink classic runners. If we
> > > tie this to
> > > >> >>> > > >> environments we will leave classic runners out of the loop
> > > for no
> > > >> >>> > > >> major reason and also not cover use cases not related to
> > > resource
> > > >> >>> > > >> allocation.
> > > >> >>> > > >>
> > > >> >>> > > >> I do not understand the use case to justify PCollection
> > > annotations
> > > >> >>> > > >> but to not mix this thread with them, would you be
> > > interested to
> > > >> >>> > > >> elaborate more about them in a different thread Jan?
> > > >> >>> > > >>
> > > >> >>> > > >> On Tue, Nov 17, 2020 at 2:28 AM Robert Bradshaw <
> > > [email protected]> wrote:
> > > >> >>> > > >> >
> > > >> >>> > > >> > I agree things like GPU, high-mem, etc. belong to the
> > > environment. If
> > > >> >>> > > >> > annotations are truly advisory, one can imagine merging
> > > environments
> > > >> >>> > > >> > by taking the union of annotations and still producing a
> > > correct
> > > >> >>> > > >> > pipeline. (This would mean that annotations would have to
> > > be a
> > > >> >>> > > >> > multi-map...)
> > > >> >>> > > >> >
> > > >> >>> > > >> > On the other hand, this doesn't seem to handle the case of
> > > privacy
> > > >> >>> > > >> > analysis, which could apply to composites without applying
> > > to any
> > > >> >>> > > >> > individual component, and don't really make sense as part
> > > of a
> > > >> >>> > > >> > fusion/execution story.
> > > >> >>> > > >> >
> > > >> >>> > > >> > On Mon, Nov 16, 2020 at 4:00 PM Robert Burke <
> > > [email protected]> wrote:
> > > >> >>> > > >> > >
> > > >> >>> > > >> > > That's good historical context.
> > > >> >>> > > >> > >
> > > >> >>> > > >> > > But then we'd still need to codify the annotation would
> > > need to be optional, and not affect correctness.
> > > >> >>> > > >> > >
> > > >> >>> > > >> > > Conflicts become easier to manage, (as environments with
> > > conflicting annotations simply don't get merged, and stay as distinct
> > > environments) but are still notionally annotation dependant. Do most
> > > runners handle environments so individually though?
> > > >> >>> > > >> > >
> > > >> >>> > > >> > > Reuven's question is a good one though. For the Go SDK,
> > > and the proposed implementation i saw, they only applied to leaf nodes.
> > > This is an artifact of how the Go SDK handles composites. Nothing stops it
> > > from being implemented on the composites Go has, but it didn't make sense
> > > otherwise. AFAICT Composites are generally for organizational convenience
> > > and not for functional aspects. Is this wrong? Afterall, does it make 
> > > sense
> > > for environments to be on leaves and composites either? It's the same 
> > > issue
> > > there.
> > > >> >>> > > >> > >
> > > >> >>> > > >> > >
> > > >> >>> > > >> > > On Mon, Nov 16, 2020, 3:45 PM Kenneth Knowles <
> > > [email protected]> wrote:
> > > >> >>> > > >> > >>
> > > >> >>> > > >> > >> I am +1 to the proposal but believe it should be moved
> > > to the Environment. I could be convinced otherwise, but would want to
> > > really understand the details.
> > > >> >>> > > >> > >>
> > > >> >>> > > >> > >> I think we haven't done a great job communicating the
> > > purpose of the Environment proto. It was explicitly created for this
> > > purpose.
> > > >> >>> > > >> > >>
> > > >> >>> > > >> > >> 1. It tells the runner things it needs to know to
> > > interpret the DoFn (or other UDF). So these are the existing proto fields
> > > like docker image (in the payload) and required artifacts that were 
> > > staged.
> > > >> >>> > > >> > >> 2. It is also the place for additional requirements or
> > > hints like "high mem" or "GPU" etc.
> > > >> >>> > > >> > >>
> > > >> >>> > > >> > >> Every user function has an associated environment, and
> > > environments exist only for the purpose of executing user functions. In
> > > fact, Environment originated as inline requirements/attributes for a user
> > > function proto message and was separated just to make the proto smaller.
> > > >> >>> > > >> > >>
> > > >> >>> > > >> > >> A PTransform is an abstract concept for organizing the
> > > graph, not an executable thing. So a hint/capability/requirement on a
> > > PTransform only really makes sense as a scoping mechanism for applying a
> > > hint to a bunch of functions within a subgraph. This seems like a user
> > > interface concern and the SDK should own propagating the hints. If the 
> > > hint
> > > truly applies to the whole PTransform and *not* the parts, then I am
> > > interested in learning about that.
> > > >> >>> > > >> > >>
> > > >> >>> > > >> > >> Kenn
> > > >> >>> > > >> > >>
> > > >> >>> > > >> > >> On Mon, Nov 16, 2020 at 10:54 AM Robert Burke <
> > > [email protected]> wrote:
> > > >> >>> > > >> > >>>
> > > >> >>> > > >> > >>> That's a good question.
> > > >> >>> > > >> > >>>
> > > >> >>> > > >> > >>> I think the main difference is a matter of scope.
> > > Annotations would apply to a PTransform while an environment applies to
> > > sets of transforms. A difference is the optional nature of the annotations
> > > they don't affect correctness. Runners don't need to do anything with them
> > > and still execute the pipeline correctly.
> > > >> >>> > > >> > >>>
> > > >> >>> > > >> > >>> Consider a privacy analysis on a pipeline graph. An
> > > annotation indicating that a transform provides a certain level of
> > > anonymization can be used in an analysis to determine if the downstream
> > > transforms are encountering raw data or not.
> > > >> >>> > > >> > >>>
> > > >> >>> > > >> > >>> From my understanding (which can be wrong)
> > > environments are rigid. Transforms in different environments can't be
> > > fused. "This is the python env", "this is the java env" can't be merged
> > > together. It's not clear to me that we have defined when environments are
> > > safely fuseable outside of equality. There's value in that simplicity.
> > > >> >>> > > >> > >>>
> > > >> >>> > > >> > >>> AFIACT environment has less to do with the machines a
> > > pipeline is executing on than it does about the kinds of SDK pipelines it
> > > understands and can execute.
> > > >> >>> > > >> > >>>
> > > >> >>> > > >> > >>>
> > > >> >>> > > >> > >>>
> > > >> >>> > > >> > >>> On Mon, Nov 16, 2020, 10:36 AM Chad Dombrova <
> > > [email protected]> wrote:
> > > >> >>> > > >> > >>>>>
> > > >> >>> > > >> > >>>>>
> > > >> >>> > > >> > >>>>> Another example of an optional annotation is marking
> > > a transform to run on secure hardware, or to give hints to
> > > profiling/dynamic analysis tools.
> > > >> >>> > > >> > >>>>
> > > >> >>> > > >> > >>>>
> > > >> >>> > > >> > >>>> There seems to be a lot of overlap between this idea
> > > and Environments.  Can you talk about how you feel they may be different 
> > > or
> > > related?  For example, I could see annotations as a way of tagging
> > > transforms with an Environment, or I could see Environments becoming a
> > > specialized form of annotation.
> > > >> >>> > > >> > >>>>
> > > >> >>> > > >> > >>>> -chad
> > > >> >>> > > >> > >>>>
> > > >> >>> > >
> > >
> > 
>

Re: PTransform Annotations Proposal

Reply via email to