Re: PTransform Annotations Proposal

Kenneth Knowles Mon, 16 Nov 2020 15:45:52 -0800

I am +1 to the proposal but believe it should be moved to the Environment.
I could be convinced otherwise, but would want to really understand the
details.

I think we haven't done a great job communicating the purpose of the
Environment proto. It was explicitly created for this purpose.

1. It tells the runner things it needs to know to interpret the DoFn (or
other UDF). So these are the existing proto fields like docker image (in
the payload) and required artifacts that were staged.
2. It is also the place for additional requirements or hints like "high
mem" or "GPU" etc.

Every user function has an associated environment, and environments exist
only for the purpose of executing user functions. In fact, Environment
originated as inline requirements/attributes for a user function proto
message and was separated just to make the proto smaller.

A PTransform is an abstract concept for organizing the graph, not an
executable thing. So a hint/capability/requirement on a PTransform only
really makes sense as a scoping mechanism for applying a hint to a bunch of
functions within a subgraph. This seems like a user interface concern and
the SDK should own propagating the hints. If the hint truly applies to the
whole PTransform and *not* the parts, then I am interested in learning
about that.

Kenn

On Mon, Nov 16, 2020 at 10:54 AM Robert Burke <rob...@frantil.com> wrote:

> That's a good question.
>
> I think the main difference is a matter of scope. Annotations would apply
> to a PTransform while an environment applies to sets of transforms. A
> difference is the optional nature of the annotations they don't affect
> correctness. Runners don't need to do anything with them and still execute
> the pipeline correctly.
>
> Consider a privacy analysis on a pipeline graph. An annotation indicating
> that a transform provides a certain level of anonymization can be used in
> an analysis to determine if the downstream transforms are encountering raw
> data or not.
>
> From my understanding (which can be wrong) environments are rigid.
> Transforms in different environments can't be fused. "This is the python
> env", "this is the java env" can't be merged together. It's not clear to me
> that we have defined when environments are safely fuseable outside of
> equality. There's value in that simplicity.
>
> AFIACT environment has less to do with the machines a pipeline is
> executing on than it does about the kinds of SDK pipelines it understands
> and can execute.
>
>
>
> On Mon, Nov 16, 2020, 10:36 AM Chad Dombrova <chad...@gmail.com> wrote:
>
>>
>>> Another example of an optional annotation is marking a transform to run
>>> on secure hardware, or to give hints to profiling/dynamic analysis tools.
>>>
>>
>> There seems to be a lot of overlap between this idea and Environments.
>> Can you talk about how you feel they may be different or related?  For
>> example, I could see annotations as a way of tagging transforms with an
>> Environment, or I could see Environments becoming a specialized form of
>> annotation.
>>
>> -chad
>>
>>

Re: PTransform Annotations Proposal

Reply via email to