On Wed, Feb 12, 2020 at 11:08 AM Luke Cwik <lc...@google.com> wrote:
>
> We can always detect on the runner/SDK side whether there is an unknown 
> field[1] within a payload and fail to process it but this is painful in two 
> situations:
> 1) It doesn't provide for a good error message since you can't say what the 
> purpose of the field is. With a capability URN, the runner/SDK could say 
> which URN it doesn't understand.
> 2) It doesn't allow for the addition of fields which don't impact semantics 
> of execution. For example, if the display data feature was being developed, a 
> runner could ignore it and still execute the pipeline correctly.

Yeah, I don't think proto reflection is a flexible enough tool to do
this well either.

> If we think this to be common enough, we can add capabilities list to the 
> PTransform so each PTransform can do this and has a natural way of being 
> extended for additions which are forwards compatible. The alternative to 
> having capabilities on PTransform (and other constructs) is that we would 
> have a new URN when the specification of the transform changes. For forwards 
> compatible changes, each SDK/runner would map older versions of the URN onto 
> the latest and internally treat it as the latest version but always downgrade 
> it to the version the other party expects when communicating with it. 
> Backwards incompatible changes would always require a new URN which 
> capabilities at the PTransform level would not help with.

As you point out, stateful+splittable may not be a particularly useful
combination, but as another example, we have
(backwards-incompatible-when-introduced) markers on DoFn as to whether
it requires finalization, stable inputs, and now time sorting. I don't
think we should have a new URN for each combination.

>> > I do think that splittable ParDo and stateful ParDo should have separate 
>> > PTransform URNs since they are different paradigms than "vanilla" ParDo.
>>
>> Here I disagree. What about one that is both splittable and stateful? Would 
>> one have a fourth URN for that? If/when another flavor of DoFn comes out, 
>> would we then want 8 distinct URNs? (SplitableParDo in particular can be 
>> executed as a normal ParDo as long as the output is bounded.)
>
> I agree that you could have stateful and splittable dofns where the element 
> is the key and you share state and timers across restrictions. No runner is 
> capable of executing this efficiently.
>
>> >> > On the SDK requirements side: the constructing SDK owns the Environment 
>> >> > proto completely, so it is in a position to ensure the involved docker 
>> >> > images support the necessary features.
>> >>
>> >> Yes.
>
>
> I believe capabilities do exist on a Pipeline and it informs runners about 
> new types of fields to be aware of either within Components or on the 
> Pipeline object itself but for this discussion it makes sense that an 
> environment would store most "capabilities" related to execution.
>
>> [snip]
>
> As for the proto clean-ups, the scope is to cover almost all things needed 
> for execution now and to follow-up with optional transforms, payloads, and 
> coders later which would exclude job managment APIs and artifact staging. A 
> formal enumeration would be useful here. Also, we should provide formal 
> guidance about adding new fields, adding new types of transforms, new types 
> of proto messages, ... (best to describe this on a case by case basis as to 
> how people are trying to modify the protos and evolve this guidance over 
> time).

What we need is the ability for (1) runners to reject future pipelines
they cannot faithfully execute and (2) runners to be able to take
advantage of advanced features/protocols when interacting with those
SDKs that understand them while avoiding them for older (or newer)
SDKs that don't. Let's call (1) (hard) requirements and (2) (optional)
capabilities.

Where possible, I think this is best expressed inherently in the set
of transform (and possibly other component) URNs. For example, when an
SDK uses a combine_per_key composite, that's a signal that it
understands the various related combine_* transforms. Similarly, a
pipeline with a test_stream URN would be rejected by pipelines not
recognizing/supporting this primitive. However, this is not always
possible, e.g. for (1) we have the aforementioned boolean flags on
ParDo and for (2) we have features like large iterable and progress
support.

For (1) we have to enumerate now everywhere a runner must look a far
into the future as we want to remain backwards compatible. This is why
I suggested putting something on the pipeline itself, but we could
(likely in addition) add it to Transform and/or ParDoPayload if we
think that'd be useful now. (Note that a future pipeline-level
requirement could be "inspect (previously non-existent) requirements
field attached to objects of type X.")

For (2) I think adding a capabilities field to the environment for now
makes the most sense, and as it's optional to inspect them adding it
elsewhere if needed is backwards compatible. (The motivation to do it
now is that there are some capabilities that we'd like to enumerate
now rather than make part of the minimal set of things an SDK must
support.)

> All in all, I think "capabilities" is about informing a runner about what 
> they should know about and what they are allowed to do. If we go with a list 
> of "capabilities", we could always add a "parameterized capabilities" urn 
> which would tell runners they need to also look at some other field.

Good point. That lets us keep it as a list for now. (The risk is that
it makes possible the bug of populating parameters without adding the
required notification to the list.)

> I also believe capabilities should NOT be "inherited". For example if we 
> define capabilities on a ParDoPayload, and on a PTransform and on 
> Environment, then ParDoPayload capabilities shouldn't be copied to PTransform 
> and PTransform specific capabilities shouldn't be copied to the Environment. 
> My reasoning about this is that some "capabilities" can only be scoped to a 
> single ParDoPayload or a single PTransform and wouldn't apply generally 
> everywhere. The best example I could think of is that Environment A supports 
> progress reporting while Environment B doesn't so it wouldn't have made sense 
> to say the "Pipeline" supports progress reporting.
>
> Are capabilities strictly different from "resources" (transform needs python 
> package X) or "execution hints" (e.g. deploy on machines that have GPUs, some 
> generic but mostly runner specific hints)? At first glance I would say yes.

Agreed.

Reply via email to