On Wed, Feb 12, 2020 at 2:24 PM Kenneth Knowles <k...@apache.org> wrote:

>
>
> On Wed, Feb 12, 2020 at 12:04 PM Robert Bradshaw <rober...@google.com>
> wrote:
>
>> On Wed, Feb 12, 2020 at 11:08 AM Luke Cwik <lc...@google.com> wrote:
>> >
>> > We can always detect on the runner/SDK side whether there is an unknown
>> field[1] within a payload and fail to process it but this is painful in two
>> situations:
>> > 1) It doesn't provide for a good error message since you can't say what
>> the purpose of the field is. With a capability URN, the runner/SDK could
>> say which URN it doesn't understand.
>> > 2) It doesn't allow for the addition of fields which don't impact
>> semantics of execution. For example, if the display data feature was being
>> developed, a runner could ignore it and still execute the pipeline
>> correctly.
>>
>> Yeah, I don't think proto reflection is a flexible enough tool to do
>> this well either.
>>
>> > If we think this to be common enough, we can add capabilities list to
>> the PTransform so each PTransform can do this and has a natural way of
>> being extended for additions which are forwards compatible. The alternative
>> to having capabilities on PTransform (and other constructs) is that we
>> would have a new URN when the specification of the transform changes. For
>> forwards compatible changes, each SDK/runner would map older versions of
>> the URN onto the latest and internally treat it as the latest version but
>> always downgrade it to the version the other party expects when
>> communicating with it. Backwards incompatible changes would always require
>> a new URN which capabilities at the PTransform level would not help with.
>>
>> As you point out, stateful+splittable may not be a particularly useful
>> combination, but as another example, we have
>> (backwards-incompatible-when-introduced) markers on DoFn as to whether
>> it requires finalization, stable inputs, and now time sorting. I don't
>> think we should have a new URN for each combination.
>>
>
> Agree with this. I don't think stateful, splittable, and "plain" ParDo are
> comparable to these. Each is an entirely different computational paradigm:
> per-element independent processing, per-key-and-window linear processing,
> and per-element-and-restriction splittable processing. Most relevant IMO is
> the nature of the parallelism. If you added state to splittable processing,
> it would still be splittable processing. Just as Combine and ParDo can
> share the SideInput specification, it is easy to share relevant
> sub-structures like state declarations. But it is a fair point that the
> ability to split can be ignored and run as a plain-old ParDo. It brings up
> the question of whether a runner that doesn't know SDF is should have to
> reject it or should be allowed to run poorly.
>

Being splittable means that the SDK could choose to return a continuation
saying please process the rest of my element in X amount of time which
would require the runner to inspect certain fields on responses. One
example would be I don't have many more messages to read from this message
stream at the moment and another example could be that I detected that this
filesystem is throttling me or is down and I would like to resume
processing later.


> It isn't a huge deal. Three different top-level URNS versus three
> different sub-URNs will achieve the same result in the end if we get this
> "capability" thing in place.
>
> Kenn
>
>
>>
>> >> > I do think that splittable ParDo and stateful ParDo should have
>> separate PTransform URNs since they are different paradigms than "vanilla"
>> ParDo.
>> >>
>> >> Here I disagree. What about one that is both splittable and stateful?
>> Would one have a fourth URN for that? If/when another flavor of DoFn comes
>> out, would we then want 8 distinct URNs? (SplitableParDo in particular can
>> be executed as a normal ParDo as long as the output is bounded.)
>> >
>> > I agree that you could have stateful and splittable dofns where the
>> element is the key and you share state and timers across restrictions. No
>> runner is capable of executing this efficiently.
>> >
>> >> >> > On the SDK requirements side: the constructing SDK owns the
>> Environment proto completely, so it is in a position to ensure the involved
>> docker images support the necessary features.
>> >> >>
>> >> >> Yes.
>> >
>> >
>> > I believe capabilities do exist on a Pipeline and it informs runners
>> about new types of fields to be aware of either within Components or on the
>> Pipeline object itself but for this discussion it makes sense that an
>> environment would store most "capabilities" related to execution.
>> >
>> >> [snip]
>> >
>> > As for the proto clean-ups, the scope is to cover almost all things
>> needed for execution now and to follow-up with optional transforms,
>> payloads, and coders later which would exclude job managment APIs and
>> artifact staging. A formal enumeration would be useful here. Also, we
>> should provide formal guidance about adding new fields, adding new types of
>> transforms, new types of proto messages, ... (best to describe this on a
>> case by case basis as to how people are trying to modify the protos and
>> evolve this guidance over time).
>>
>> What we need is the ability for (1) runners to reject future pipelines
>> they cannot faithfully execute and (2) runners to be able to take
>> advantage of advanced features/protocols when interacting with those
>> SDKs that understand them while avoiding them for older (or newer)
>> SDKs that don't. Let's call (1) (hard) requirements and (2) (optional)
>> capabilities.
>>
>> Where possible, I think this is best expressed inherently in the set
>> of transform (and possibly other component) URNs. For example, when an
>> SDK uses a combine_per_key composite, that's a signal that it
>> understands the various related combine_* transforms. Similarly, a
>> pipeline with a test_stream URN would be rejected by pipelines not
>> recognizing/supporting this primitive. However, this is not always
>> possible, e.g. for (1) we have the aforementioned boolean flags on
>> ParDo and for (2) we have features like large iterable and progress
>> support.
>>
>> For (1) we have to enumerate now everywhere a runner must look a far
>> into the future as we want to remain backwards compatible. This is why
>> I suggested putting something on the pipeline itself, but we could
>> (likely in addition) add it to Transform and/or ParDoPayload if we
>> think that'd be useful now. (Note that a future pipeline-level
>> requirement could be "inspect (previously non-existent) requirements
>> field attached to objects of type X.")
>>
>> For (2) I think adding a capabilities field to the environment for now
>> makes the most sense, and as it's optional to inspect them adding it
>> elsewhere if needed is backwards compatible. (The motivation to do it
>> now is that there are some capabilities that we'd like to enumerate
>> now rather than make part of the minimal set of things an SDK must
>> support.)
>>
>>
Agree on the separation of requirements from capabilities where
requirements is a set of MUST understand while capabilities are a set of
MAY understand.


> > All in all, I think "capabilities" is about informing a runner about
>> what they should know about and what they are allowed to do. If we go with
>> a list of "capabilities", we could always add a "parameterized
>> capabilities" urn which would tell runners they need to also look at some
>> other field.
>>
>> Good point. That lets us keep it as a list for now. (The risk is that
>> it makes possible the bug of populating parameters without adding the
>> required notification to the list.)
>>
>> > I also believe capabilities should NOT be "inherited". For example if
>> we define capabilities on a ParDoPayload, and on a PTransform and on
>> Environment, then ParDoPayload capabilities shouldn't be copied to
>> PTransform and PTransform specific capabilities shouldn't be copied to the
>> Environment. My reasoning about this is that some "capabilities" can only
>> be scoped to a single ParDoPayload or a single PTransform and wouldn't
>> apply generally everywhere. The best example I could think of is that
>> Environment A supports progress reporting while Environment B doesn't so it
>> wouldn't have made sense to say the "Pipeline" supports progress reporting.
>> >
>> > Are capabilities strictly different from "resources" (transform needs
>> python package X) or "execution hints" (e.g. deploy on machines that have
>> GPUs, some generic but mostly runner specific hints)? At first glance I
>> would say yes.
>>
>> Agreed.
>>
>

Reply via email to