Re: PTransform Annotations Proposal

Sunil Pedapudi Tue, 02 Feb 2021 13:01:34 -0800

Thanks very much, Mirac. I added a few comments as well, and the proposal looks 
good to me. Thanks for all the discussion.


On 2021/01/14 20:24:19, Robert Bradshaw <rober...@google.com> wrote: 
> Thanks. I added some comments on the proposal.
> 
> On Wed, Jan 13, 2021 at 9:21 AM Mirac Vuslat Basaran <mir...@google.com>
> wrote:
> 
> > Wrote a design draft for resource-related annotations. Please have a look:
> > https://docs.google.com/document/d/1phExeGD1gdDI9M8LK4ZG57UGa7dswpB8Aj6jxWj4uQk/edit?usp=sharing
> >
> > Cheers,
> > Mirac
> > On 2020/11/26 20:20:09, Mirac Vuslat Basaran <mir...@google.com> wrote:
> > > Created a PR without unit tests at
> > https://github.com/apache/beam/pull/13434. Please have a look.
> > >
> > > Thanks,
> > > Mirac
> > >
> > >
> > > On 2020/11/25 18:50:19, Robert Burke <rob...@frantil.com> wrote:
> > > > Hmmm. Fair. I'm mostly concerned about the pathological case where we
> > end
> > > > up with a distinct Environment per transform, but there are likely
> > > > practical cases where that's reasonable (High mem to GPU to TPU, to
> > ARM....)
> > > >
> > > > On Wed, Nov 25, 2020, 10:42 AM Robert Bradshaw <rober...@google.com>
> > wrote:
> > > >
> > > > > I'd like to continue the discussion *and* see an implementation for
> > > > > the part we've settled on. I was asking why not have "every distinct
> > > > > physical concern means a distinct environment?"
> > > > >
> > > > > On Wed, Nov 25, 2020 at 10:38 AM Robert Burke <rob...@frantil.com>
> > wrote:
> > > > > >
> > > > > > Mostly because perfect is the enemy of good enough. We have a
> > proposal,
> > > > > we have clear boundaries for it. It's fine if the discussion
> > continues, but
> > > > > I see no evidence of concerns that should prevent starting an
> > > > > implementation, because it seems we'll need both anyway.
> > > > > >
> > > > > > On Wed, Nov 25, 2020, 10:25 AM Robert Bradshaw <
> > rober...@google.com>
> > > > > wrote:
> > > > > >>
> > > > > >> On Wed, Nov 25, 2020 at 10:15 AM Robert Burke <rob...@frantil.com
> > >
> > > > > wrote:
> > > > > >> >
> > > > > >> > It sounds like we've come to the position that non-correctness
> > > > > affecting Ptransform Annotations are valuable at both leaf and
> > composite
> > > > > levels, and don't remove the potential need for a similar feature on
> > > > > Environments, to handle physical concerns equirements for worker
> > processes
> > > > > to have (such as Ram, CPU, or GPU requirements.)
> > > > > >> >
> > > > > >> > Kenn, it's not clear what part of the solution (an annotation
> > field
> > > > > on the Ptransform proto message) would need to change to satisfy
> > your scope
> > > > > concern, beyond documenting unambiguously that these may not be used
> > for
> > > > > physical concerns or things that affect correctness.
> > > > > >>
> > > > > >> I'll let Kenn answer as well, but from my point of view,
> > explicitly
> > > > > >> having somewhere better to put these things would help.
> > > > > >>
> > > > > >> > I'm also unclear your scope concern not matching, given the
> > above.
> > > > > Your first paragraph reads very supportive of logical annotations on
> > > > > Ptransforms, and that matches 1-1 with the current proposed
> > solution. Can
> > > > > you clarify your concern?
> > > > > >> >
> > > > > >> > I don't wish to scope creep on the physical requirements issue
> > at
> > > > > this time. It seems we are agreed they should end up on
> > environments, but
> > > > > I'm not seeing proposals on the right way to execute them at this
> > time.They
> > > > > seem to be a fruitful topic of discussion, in particular
> > > > > unifying/consolidating them for efficient use of resources. I don't
> > think
> > > > > we want to end up in a state where every distinct physical concern
> > means a
> > > > > distinct environment.
> > > > > >>
> > > > > >> Why not? Assuming, of course, that runners are free to merge
> > > > > >> environments (merging those resource hints they understand and are
> > > > > >> otherwise compatible, and discarding those they don't) for
> > efficient
> > > > > >> execution.
> > > > > >>
> > > > > >> > I for one am ready to see a PR.
> > > > > >>
> > > > > >> +1
> > > > > >>
> > > > > >> > On Mon, Nov 23, 2020, 6:02 PM Kenneth Knowles <k...@apache.org>
> > > > > wrote:
> > > > > >> >>
> > > > > >> >>
> > > > > >> >>
> > > > > >> >> On Mon, Nov 23, 2020 at 3:04 PM Robert Bradshaw <
> > rober...@google.com>
> > > > > wrote:
> > > > > >> >>>
> > > > > >> >>> On Fri, Nov 20, 2020 at 11:08 AM Mirac Vuslat Basaran <
> > > > > mir...@google.com> wrote:
> > > > > >> >>> >
> > > > > >> >>> > Thanks everyone so much for their input and for the
> > insightful
> > > > > discussion.
> > > > > >> >>> >
> > > > > >> >>> > Not being knowledgeable about Beam's internals, I have to
> > say I
> > > > > am a bit lost on the PTransform vs. environment discussion.
> > > > > >> >>> >
> > > > > >> >>> > I do agree with Burke's notion that merge rules are very
> > > > > annotation dependent, I don't think we can find a one-size-fits-all
> > > > > solution for that. So this might be actually be an argument in
> > favour of
> > > > > having annotations on PTransforms, since it avoids the conflation
> > with
> > > > > environments.
> > > > > >> >>> >
> > > > > >> >>> > Also in general, I feel that having annotations per single
> > > > > transform (rather than composite) and on PTransforms could lead to a
> > > > > simpler design.
> > > > > >> >>>
> > > > > >> >>> If we want to use these for privacy, I don't see how
> > attaching them
> > > > > to
> > > > > >> >>> leaf transforms alone could work. (Even CombinePerKey is a
> > > > > composite.)
> > > > > >> >>>
> > > > > >> >>> > Seeing as there are valuable arguments in favour of both
> > > > > (PTransform and environments) with no clear(?) "best solution", I
> > would
> > > > > propose moving forward with the initial (PTransform) design to ship
> > the
> > > > > feature and unblock teams asking for it. If it turns out that there
> > was
> > > > > indeed a need to have annotations in environments, we could always
> > refactor
> > > > > it.
> > > > > >> >>>
> > > > > >> >>> I have yet to see any arguments that resource-level hints,
> > such as
> > > > > >> >>> memory or GPU, don't better belong on the environment. But
> > moving
> > > > > >> >>> forward on PTransform-level ones for logical statements like
> > privacy
> > > > > >> >>> declarations makes sense.
> > > > > >> >>
> > > > > >> >>
> > > > > >> >> Exactly this. Properties of transforms make sense. The
> > properties
> > > > > may hold only of the whole subgraph. Even something as simple as
> > "preserves
> > > > > keys". This is analogous (but converse) to requirements like
> > "requires
> > > > > sorted input" which were explicitly excluded from the scope, which
> > was
> > > > > about hardware environment for execution.
> > > > > >> >>
> > > > > >> >> The proposed scope and the proposed solution do not match and
> > need
> > > > > to be reconciled.
> > > > > >> >>
> > > > > >> >> Kenn
> > > > > >> >>
> > > > > >> >>> > On 2020/11/17 19:07:22, Robert Bradshaw <
> > rober...@google.com>
> > > > > wrote:
> > > > > >> >>> > > So far we have two distinct usecases for annotations:
> > resource
> > > > > hints
> > > > > >> >>> > > and privacy directives, and I've been trying to figure
> > out how
> > > > > to
> > > > > >> >>> > > reconcile them, but they seem to have very different
> > > > > characteristics.
> > > > > >> >>> > > (It would be nice to come up with other uses as well to
> > see if
> > > > > we're
> > > > > >> >>> > > really coming up with a generally useful mode--I think
> > display
> > > > > data
> > > > > >> >>> > > could fit into this as a new kind of annotation rather
> > than
> > > > > being a
> > > > > >> >>> > > top-level property, and it could make sense on both leaf
> > and
> > > > > composite
> > > > > >> >>> > > transforms.)
> > > > > >> >>> > >
> > > > > >> >>> > > To me, resource hints like GPU are inextricably tied to
> > the
> > > > > >> >>> > > environment. A transform tagged with GPU should reference
> > a Fn
> > > > > that
> > > > > >> >>> > > invokes GPU-accelerated code that lives in a particular
> > > > > environment.
> > > > > >> >>> > > Something like high-mem is a bit squishier. Some DoFns
> > take a
> > > > > lot of
> > > > > >> >>> > > memory, but on the other hand one could imagine labeling a
> > > > > CoGBK as
> > > > > >> >>> > > high-mem due to knowing that, in this particular usage,
> > there
> > > > > will be
> > > > > >> >>> > > lots of values with the same key. Ideally runners would be
> > > > > intelligent
> > > > > >> >>> > > enough to automatically learn memory usage, but even in
> > this
> > > > > case it
> > > > > >> >>> > > may be a good hint to try and learn the requirements for
> > DoFn A
> > > > > and
> > > > > >> >>> > > DoFn B separately (which is difficult if they are always
> > > > > colocated,
> > > > > >> >>> > > but valuable if, e.g. A takes a huge amount of memory and
> > B
> > > > > takes a
> > > > > >> >>> > > huge amount of wall time).
> > > > > >> >>> > >
> > > > > >> >>> > > Note that tying things to the environment does not
> > preclude
> > > > > using them
> > > > > >> >>> > > in non-portable runners as they'll still have an SDK-level
> > > > > >> >>> > > representation (though I don't think we should have an
> > explicit
> > > > > goal
> > > > > >> >>> > > of feature parity for non-portable runners, e.g.
> > multi-language
> > > > > isn't
> > > > > >> >>> > > happening, and hope that non-portable runners go away soon
> > > > > anyway).
> > > > > >> >>> > >
> > > > > >> >>> > > Now let's consider privacy annotations. To make things
> > very
> > > > > concrete,
> > > > > >> >>> > > imagine a transform AverageSpendPerZipCode which takes as
> > input
> > > > > (user,
> > > > > >> >>> > > zip, spend), all users unique, and returns (zip,
> > avg(spend)). In
> > > > > >> >>> > > Python, this is GroupBy('zip').aggregate_field('spend',
> > > > > >> >>> > > MeanCombineFn()). This is not very privacy preserving to
> > those
> > > > > users
> > > > > >> >>> > > who are the only (or one of a few) in a zip code. So we
> > could
> > > > > define a
> > > > > >> >>> > > transform PrivacyPreservingAverageSpendPerZipCode as
> > > > > >> >>> > >
> > > > > >> >>> > > @ptransform_fn
> > > > > >> >>> > > def
> > PrivacyPreservingAverageSpendPerZipCode(spend_per_user,
> > > > > threshold)
> > > > > >> >>> > >     counts_per_zip = spend_per_user |
> > > > > >> >>> > > GroupBy('zip').aggregate_field('user', CountCombineFn())
> > > > > >> >>> > >     spend_per_zip = spend_per_user |
> > > > > >> >>> > > GroupBy('zip').aggregate_field('spend', MeanCombineFn())
> > > > > >> >>> > >     filtered = spend_per_zip | beam.Filter(
> > > > > >> >>> > >         lambda x, counts: counts[x.zip] > threshold,
> > > > > >> >>> > > counts=AsMap(counts_per_zip))
> > > > > >> >>> > >     return filtered
> > > > > >> >>> > >
> > > > > >> >>> > > We now have a composite that has privacy preserving
> > properties
> > > > > (i.e.
> > > > > >> >>> > > the input may be quite sensitive, but the output is not,
> > > > > depending on
> > > > > >> >>> > > the value of threshold). What is interesting here is that
> > it is
> > > > > only
> > > > > >> >>> > > the composite that has this property--no individual
> > > > > sub-transform is
> > > > > >> >>> > > itself privacy preserving. Furthermore, an optimizer may
> > notice
> > > > > we're
> > > > > >> >>> > > doing aggregation on the same key twice and rewrite this
> > using
> > > > > >> >>> > > (logically)
> > > > > >> >>> > >
> > > > > >> >>> > >     GroupBy('zip').aggregate_field('user',
> > > > > >> >>> > > CountCombineFn()).aggregate_field('spend',
> > MeanCombineFn())
> > > > > >> >>> > >
> > > > > >> >>> > > and then applying the filter, which is semantically
> > equivalent
> > > > > and
> > > > > >> >>> > > satisfies the privacy annotations (and notably that does
> > not
> > > > > even
> > > > > >> >>> > > require the optimizer to interpret the annotations, just
> > pass
> > > > > them
> > > > > >> >>> > > on). To me, this implies that these annotations belong on
> > the
> > > > > >> >>> > > composites, and *not* on the leaf nodes (where they would
> > be
> > > > > >> >>> > > incorrect).
> > > > > >> >>> > >
> > > > > >> >>> > > I'll leave aside most questions of API until we figure
> > out the
> > > > > model
> > > > > >> >>> > > semantics, but wanted to throw one possible idea out
> > (though I
> > > > > am
> > > > > >> >>> > > ambivalent about it). Instead of attaching things to
> > > > > transforms, we
> > > > > >> >>> > > can just wrap transforms in composites that have no role
> > other
> > > > > than
> > > > > >> >>> > > declaring information about their contents. E.g. we could
> > have a
> > > > > >> >>> > > composite transform whose payload is simply an assertion
> > of the
> > > > > >> >>> > > privacy (or resource?) properties of its inner structure.
> > This
> > > > > would
> > > > > >> >>> > > be just as expressive as adding new properties to
> > transforms
> > > > > >> >>> > > themselves (but would add an extra level of nesting, and
> > make
> > > > > >> >>> > > respecting the precice nesting more important).
> > > > > >> >>> > >
> > > > > >> >>> > > On Tue, Nov 17, 2020 at 8:12 AM Robert Burke <
> > > > > rob...@frantil.com> wrote:
> > > > > >> >>> > > >
> > > > > >> >>> > > > +1 to discussing PCollection annotations on a separate
> > > > > thread. It would be confusing to mix them up.
> > > > > >> >>> > > >
> > > > > >> >>> > > > -----------
> > > > > >> >>> > > >
> > > > > >> >>> > > > The question around conflicts is interesting, but
> > confusing
> > > > > to me. I don't think they exist in general. I keep coming back
> > around to
> > > > > that it depends on the annotation and the purpose of composites.
> > > > > Optionality saves us here too.
> > > > > >> >>> > > >
> > > > > >> >>> > > > Composites are nothing without their internal hypergraph
> > > > > structure. Eventually it comes down to executing the leaf nodes. The
> > > > > alternative to executing the leaf nodes is when the composite
> > represents a
> > > > > known transform and is replaced by the runner on submission time.
> > Lets
> > > > > look at each.
> > > > > >> >>> > > >
> > > > > >> >>> > > > If there's a property that only exists on the leaf
> > nodes,
> > > > > then it's not possible to bubble up that property to the composite
> > in all
> > > > > cases. Afterall, it's not necessarily the case that a privacy
> > preserving
> > > > > transform maintains the property for all output edges as not all
> > such edges
> > > > > pass through the preserving transform.
> > > > > >> >>> > > >
> > > > > >> >>> > > > On the other hand, with memory or gpu recommendations,
> > that
> > > > > might set a low bar on the composite level.
> > > > > >> >>> > > >
> > > > > >> >>> > > > But, composites (any transform really) can be runner
> > > > > replaced. I think it's fair to say that a runner replaced composite
> > is not
> > > > > beholden to the annotations of the original leaf transforms,
> > especially
> > > > > around physical requirements. The implementations are different. If
> > a known
> > > > > composite at the composite level requires GPUs and it's known
> > replacement
> > > > > doesn't, I'd posit that replacement was a choice the runner made
> > since it
> > > > > can't provision machines with GPUs.
> > > > > >> >>> > > >
> > > > > >> >>> > > > But, crucially around privacy annotated transforms, a
> > runner
> > > > > likely shouldn't replace a given subgraph that contains a privacy
> > > > > annotationed transform unless the replacements provide the same
> > level of
> > > > > privacy. However, such replacements only happens with well known
> > transforms
> > > > > with known properties anyway, so this can serve as an additional
> > layer of
> > > > > validation for a runner aware of the properties.
> > > > > >> >>> > > >
> > > > > >> >>> > > > This brings me back to my position: that the notion of
> > > > > conflicts is very annotation dependant, and that defining them as
> > optional
> > > > > is the most important feature to avoid issues. Conflicts don't exist
> > as an
> > > > > inherent property of annotations on ptransform of the hypergraph
> > structure.
> > > > > Am i wrong? No one has come up with an actual example of a conflict
> > as far
> > > > > as i understand the thread.
> > > > > >> >>> > > >
> > > > > >> >>> > > > Even Reuven's original question is more about whether
> > the
> > > > > runner is forced to look at leaf bodes rather than only looking at
> > the
> > > > > composite. Assuming the composite isn't replaced, the runner needs
> > to look
> > > > > at the leaf nodes regardless. And as discussed above there's no
> > generalized
> > > > > semantics that fit for all kinds of annotations, once replacements
> > are also
> > > > > considered.
> > > > > >> >>> > > >
> > > > > >> >>> > > > On Tue, Nov 17, 2020, 6:35 AM Ismaël Mejía <
> > ieme...@gmail.com>
> > > > > wrote:
> > > > > >> >>> > > >>
> > > > > >> >>> > > >> +1 Nice to see there is finally interest on this.
> > > > > Annotations for
> > > > > >> >>> > > >> PTransforms make total sense!
> > > > > >> >>> > > >>
> > > > > >> >>> > > >> The semantics should be strictly optional for runners
> > and
> > > > > correct
> > > > > >> >>> > > >> execution should not be affected by lack of support of
> > any
> > > > > annotation.
> > > > > >> >>> > > >> We should however keep the set of annotations small.
> > > > > >> >>> > > >>
> > > > > >> >>> > > >> > PTransforms are hierarchical - namely a PTransform
> > > > > contains other PTransforms, and so on. Is the runner expected to
> > resolve
> > > > > all annotations down to leaf nodes? What happens if that results in
> > > > > conflicting annotations?
> > > > > >> >>> > > >>
> > > > > >> >>> > > >> +1 to this question, This needs to be detailed.
> > > > > >> >>> > > >>
> > > > > >> >>> > > >> I am curious about how the end user APIs of this will
> > look
> > > > > maybe in
> > > > > >> >>> > > >> Java or Python, just an extra method to inject a Map
> > or via
> > > > > Java
> > > > > >> >>> > > >> annotations/Python decorators?
> > > > > >> >>> > > >>
> > > > > >> >>> > > >> We might prefer not to mix the concepts of annotations
> > and
> > > > > >> >>> > > >> environments because this will limit the scope of
> > > > > annotations.
> > > > > >> >>> > > >> Annotations are different from environments because
> > they
> > > > > serve a more
> > > > > >> >>> > > >> general idea: to express an intention and it is up to
> > the
> > > > > runner to
> > > > > >> >>> > > >> choose the strategy to accomplish this, for example in
> > the
> > > > > GPU
> > > > > >> >>> > > >> assignation case it could be to rewrite resource
> > allocation
> > > > > via
> > > > > >> >>> > > >> Environments but it could also just delegate this to a
> > > > > resource
> > > > > >> >>> > > >> manager which is what we could do for example for GPU
> > (or
> > > > > data
> > > > > >> >>> > > >> locality) cases on the Spark/Flink classic runners. If
> > we
> > > > > tie this to
> > > > > >> >>> > > >> environments we will leave classic runners out of the
> > loop
> > > > > for no
> > > > > >> >>> > > >> major reason and also not cover use cases not related
> > to
> > > > > resource
> > > > > >> >>> > > >> allocation.
> > > > > >> >>> > > >>
> > > > > >> >>> > > >> I do not understand the use case to justify PCollection
> > > > > annotations
> > > > > >> >>> > > >> but to not mix this thread with them, would you be
> > > > > interested to
> > > > > >> >>> > > >> elaborate more about them in a different thread Jan?
> > > > > >> >>> > > >>
> > > > > >> >>> > > >> On Tue, Nov 17, 2020 at 2:28 AM Robert Bradshaw <
> > > > > rober...@google.com> wrote:
> > > > > >> >>> > > >> >
> > > > > >> >>> > > >> > I agree things like GPU, high-mem, etc. belong to the
> > > > > environment. If
> > > > > >> >>> > > >> > annotations are truly advisory, one can imagine
> > merging
> > > > > environments
> > > > > >> >>> > > >> > by taking the union of annotations and still
> > producing a
> > > > > correct
> > > > > >> >>> > > >> > pipeline. (This would mean that annotations would
> > have to
> > > > > be a
> > > > > >> >>> > > >> > multi-map...)
> > > > > >> >>> > > >> >
> > > > > >> >>> > > >> > On the other hand, this doesn't seem to handle the
> > case of
> > > > > privacy
> > > > > >> >>> > > >> > analysis, which could apply to composites without
> > applying
> > > > > to any
> > > > > >> >>> > > >> > individual component, and don't really make sense as
> > part
> > > > > of a
> > > > > >> >>> > > >> > fusion/execution story.
> > > > > >> >>> > > >> >
> > > > > >> >>> > > >> > On Mon, Nov 16, 2020 at 4:00 PM Robert Burke <
> > > > > rob...@frantil.com> wrote:
> > > > > >> >>> > > >> > >
> > > > > >> >>> > > >> > > That's good historical context.
> > > > > >> >>> > > >> > >
> > > > > >> >>> > > >> > > But then we'd still need to codify the annotation
> > would
> > > > > need to be optional, and not affect correctness.
> > > > > >> >>> > > >> > >
> > > > > >> >>> > > >> > > Conflicts become easier to manage, (as
> > environments with
> > > > > conflicting annotations simply don't get merged, and stay as distinct
> > > > > environments) but are still notionally annotation dependant. Do most
> > > > > runners handle environments so individually though?
> > > > > >> >>> > > >> > >
> > > > > >> >>> > > >> > > Reuven's question is a good one though. For the Go
> > SDK,
> > > > > and the proposed implementation i saw, they only applied to leaf
> > nodes.
> > > > > This is an artifact of how the Go SDK handles composites. Nothing
> > stops it
> > > > > from being implemented on the composites Go has, but it didn't make
> > sense
> > > > > otherwise. AFAICT Composites are generally for organizational
> > convenience
> > > > > and not for functional aspects. Is this wrong? Afterall, does it
> > make sense
> > > > > for environments to be on leaves and composites either? It's the
> > same issue
> > > > > there.
> > > > > >> >>> > > >> > >
> > > > > >> >>> > > >> > >
> > > > > >> >>> > > >> > > On Mon, Nov 16, 2020, 3:45 PM Kenneth Knowles <
> > > > > k...@apache.org> wrote:
> > > > > >> >>> > > >> > >>
> > > > > >> >>> > > >> > >> I am +1 to the proposal but believe it should be
> > moved
> > > > > to the Environment. I could be convinced otherwise, but would want to
> > > > > really understand the details.
> > > > > >> >>> > > >> > >>
> > > > > >> >>> > > >> > >> I think we haven't done a great job communicating
> > the
> > > > > purpose of the Environment proto. It was explicitly created for this
> > > > > purpose.
> > > > > >> >>> > > >> > >>
> > > > > >> >>> > > >> > >> 1. It tells the runner things it needs to know to
> > > > > interpret the DoFn (or other UDF). So these are the existing proto
> > fields
> > > > > like docker image (in the payload) and required artifacts that were
> > staged.
> > > > > >> >>> > > >> > >> 2. It is also the place for additional
> > requirements or
> > > > > hints like "high mem" or "GPU" etc.
> > > > > >> >>> > > >> > >>
> > > > > >> >>> > > >> > >> Every user function has an associated
> > environment, and
> > > > > environments exist only for the purpose of executing user functions.
> > In
> > > > > fact, Environment originated as inline requirements/attributes for a
> > user
> > > > > function proto message and was separated just to make the proto
> > smaller.
> > > > > >> >>> > > >> > >>
> > > > > >> >>> > > >> > >> A PTransform is an abstract concept for
> > organizing the
> > > > > graph, not an executable thing. So a hint/capability/requirement on a
> > > > > PTransform only really makes sense as a scoping mechanism for
> > applying a
> > > > > hint to a bunch of functions within a subgraph. This seems like a
> > user
> > > > > interface concern and the SDK should own propagating the hints. If
> > the hint
> > > > > truly applies to the whole PTransform and *not* the parts, then I am
> > > > > interested in learning about that.
> > > > > >> >>> > > >> > >>
> > > > > >> >>> > > >> > >> Kenn
> > > > > >> >>> > > >> > >>
> > > > > >> >>> > > >> > >> On Mon, Nov 16, 2020 at 10:54 AM Robert Burke <
> > > > > rob...@frantil.com> wrote:
> > > > > >> >>> > > >> > >>>
> > > > > >> >>> > > >> > >>> That's a good question.
> > > > > >> >>> > > >> > >>>
> > > > > >> >>> > > >> > >>> I think the main difference is a matter of scope.
> > > > > Annotations would apply to a PTransform while an environment applies
> > to
> > > > > sets of transforms. A difference is the optional nature of the
> > annotations
> > > > > they don't affect correctness. Runners don't need to do anything
> > with them
> > > > > and still execute the pipeline correctly.
> > > > > >> >>> > > >> > >>>
> > > > > >> >>> > > >> > >>> Consider a privacy analysis on a pipeline graph.
> > An
> > > > > annotation indicating that a transform provides a certain level of
> > > > > anonymization can be used in an analysis to determine if the
> > downstream
> > > > > transforms are encountering raw data or not.
> > > > > >> >>> > > >> > >>>
> > > > > >> >>> > > >> > >>> From my understanding (which can be wrong)
> > > > > environments are rigid. Transforms in different environments can't be
> > > > > fused. "This is the python env", "this is the java env" can't be
> > merged
> > > > > together. It's not clear to me that we have defined when
> > environments are
> > > > > safely fuseable outside of equality. There's value in that
> > simplicity.
> > > > > >> >>> > > >> > >>>
> > > > > >> >>> > > >> > >>> AFIACT environment has less to do with the
> > machines a
> > > > > pipeline is executing on than it does about the kinds of SDK
> > pipelines it
> > > > > understands and can execute.
> > > > > >> >>> > > >> > >>>
> > > > > >> >>> > > >> > >>>
> > > > > >> >>> > > >> > >>>
> > > > > >> >>> > > >> > >>> On Mon, Nov 16, 2020, 10:36 AM Chad Dombrova <
> > > > > chad...@gmail.com> wrote:
> > > > > >> >>> > > >> > >>>>>
> > > > > >> >>> > > >> > >>>>>
> > > > > >> >>> > > >> > >>>>> Another example of an optional annotation is
> > marking
> > > > > a transform to run on secure hardware, or to give hints to
> > > > > profiling/dynamic analysis tools.
> > > > > >> >>> > > >> > >>>>
> > > > > >> >>> > > >> > >>>>
> > > > > >> >>> > > >> > >>>> There seems to be a lot of overlap between this
> > idea
> > > > > and Environments.  Can you talk about how you feel they may be
> > different or
> > > > > related?  For example, I could see annotations as a way of tagging
> > > > > transforms with an Environment, or I could see Environments becoming
> > a
> > > > > specialized form of annotation.
> > > > > >> >>> > > >> > >>>>
> > > > > >> >>> > > >> > >>>> -chad
> > > > > >> >>> > > >> > >>>>
> > > > > >> >>> > >
> > > > >
> > > >
> > >
> >
>

Re: PTransform Annotations Proposal

Reply via email to