Re: PTransform Annotations Proposal

Ismaël Mejía Tue, 17 Nov 2020 06:36:14 -0800

+1 Nice to see there is finally interest on this. Annotations for
PTransforms make total sense!


The semantics should be strictly optional for runners and correct
execution should not be affected by lack of support of any annotation.
We should however keep the set of annotations small.

> PTransforms are hierarchical - namely a PTransform contains other 
> PTransforms, and so on. Is the runner expected to resolve all annotations 
> down to leaf nodes? What happens if that results in conflicting annotations?

+1 to this question, This needs to be detailed.

I am curious about how the end user APIs of this will look maybe in
Java or Python, just an extra method to inject a Map or via Java
annotations/Python decorators?

We might prefer not to mix the concepts of annotations and
environments because this will limit the scope of annotations.
Annotations are different from environments because they serve a more
general idea: to express an intention and it is up to the runner to
choose the strategy to accomplish this, for example in the GPU
assignation case it could be to rewrite resource allocation via
Environments but it could also just delegate this to a resource
manager which is what we could do for example for GPU (or data
locality) cases on the Spark/Flink classic runners. If we tie this to
environments we will leave classic runners out of the loop for no
major reason and also not cover use cases not related to resource
allocation.

I do not understand the use case to justify PCollection annotations
but to not mix this thread with them, would you be interested to
elaborate more about them in a different thread Jan?

On Tue, Nov 17, 2020 at 2:28 AM Robert Bradshaw <[email protected]> wrote:
>
> I agree things like GPU, high-mem, etc. belong to the environment. If
> annotations are truly advisory, one can imagine merging environments
> by taking the union of annotations and still producing a correct
> pipeline. (This would mean that annotations would have to be a
> multi-map...)
>
> On the other hand, this doesn't seem to handle the case of privacy
> analysis, which could apply to composites without applying to any
> individual component, and don't really make sense as part of a
> fusion/execution story.
>
> On Mon, Nov 16, 2020 at 4:00 PM Robert Burke <[email protected]> wrote:
> >
> > That's good historical context.
> >
> > But then we'd still need to codify the annotation would need to be 
> > optional, and not affect correctness.
> >
> > Conflicts become easier to manage, (as environments with conflicting 
> > annotations simply don't get merged, and stay as distinct environments) but 
> > are still notionally annotation dependant. Do most runners handle 
> > environments so individually though?
> >
> > Reuven's question is a good one though. For the Go SDK, and the proposed 
> > implementation i saw, they only applied to leaf nodes. This is an artifact 
> > of how the Go SDK handles composites. Nothing stops it from being 
> > implemented on the composites Go has, but it didn't make sense otherwise. 
> > AFAICT Composites are generally for organizational convenience and not for 
> > functional aspects. Is this wrong? Afterall, does it make sense for 
> > environments to be on leaves and composites either? It's the same issue 
> > there.
> >
> >
> > On Mon, Nov 16, 2020, 3:45 PM Kenneth Knowles <[email protected]> wrote:
> >>
> >> I am +1 to the proposal but believe it should be moved to the Environment. 
> >> I could be convinced otherwise, but would want to really understand the 
> >> details.
> >>
> >> I think we haven't done a great job communicating the purpose of the 
> >> Environment proto. It was explicitly created for this purpose.
> >>
> >> 1. It tells the runner things it needs to know to interpret the DoFn (or 
> >> other UDF). So these are the existing proto fields like docker image (in 
> >> the payload) and required artifacts that were staged.
> >> 2. It is also the place for additional requirements or hints like "high 
> >> mem" or "GPU" etc.
> >>
> >> Every user function has an associated environment, and environments exist 
> >> only for the purpose of executing user functions. In fact, Environment 
> >> originated as inline requirements/attributes for a user function proto 
> >> message and was separated just to make the proto smaller.
> >>
> >> A PTransform is an abstract concept for organizing the graph, not an 
> >> executable thing. So a hint/capability/requirement on a PTransform only 
> >> really makes sense as a scoping mechanism for applying a hint to a bunch 
> >> of functions within a subgraph. This seems like a user interface concern 
> >> and the SDK should own propagating the hints. If the hint truly applies to 
> >> the whole PTransform and *not* the parts, then I am interested in learning 
> >> about that.
> >>
> >> Kenn
> >>
> >> On Mon, Nov 16, 2020 at 10:54 AM Robert Burke <[email protected]> wrote:
> >>>
> >>> That's a good question.
> >>>
> >>> I think the main difference is a matter of scope. Annotations would apply 
> >>> to a PTransform while an environment applies to sets of transforms. A 
> >>> difference is the optional nature of the annotations they don't affect 
> >>> correctness. Runners don't need to do anything with them and still 
> >>> execute the pipeline correctly.
> >>>
> >>> Consider a privacy analysis on a pipeline graph. An annotation indicating 
> >>> that a transform provides a certain level of anonymization can be used in 
> >>> an analysis to determine if the downstream transforms are encountering 
> >>> raw data or not.
> >>>
> >>> From my understanding (which can be wrong) environments are rigid. 
> >>> Transforms in different environments can't be fused. "This is the python 
> >>> env", "this is the java env" can't be merged together. It's not clear to 
> >>> me that we have defined when environments are safely fuseable outside of 
> >>> equality. There's value in that simplicity.
> >>>
> >>> AFIACT environment has less to do with the machines a pipeline is 
> >>> executing on than it does about the kinds of SDK pipelines it understands 
> >>> and can execute.
> >>>
> >>>
> >>>
> >>> On Mon, Nov 16, 2020, 10:36 AM Chad Dombrova <[email protected]> wrote:
> >>>>>
> >>>>>
> >>>>> Another example of an optional annotation is marking a transform to run 
> >>>>> on secure hardware, or to give hints to profiling/dynamic analysis 
> >>>>> tools.
> >>>>
> >>>>
> >>>> There seems to be a lot of overlap between this idea and Environments.  
> >>>> Can you talk about how you feel they may be different or related?  For 
> >>>> example, I could see annotations as a way of tagging transforms with an 
> >>>> Environment, or I could see Environments becoming a specialized form of 
> >>>> annotation.
> >>>>
> >>>> -chad
> >>>>

Re: PTransform Annotations Proposal

Reply via email to