+1 to deferring for now. Since they should not be modified after adoption,
it makes sense not to get ahead of ourselves.

On Thu, Feb 13, 2020, 10:59 AM Robert Bradshaw <[email protected]> wrote:

> On Thu, Feb 13, 2020 at 10:12 AM Robert Burke <[email protected]> wrote:
> >
> > One thing that doesn't appear to have been suggested yet is we could
> "batch" urns together under a "super urn" so that adding one super urn is
> like adding each of the represented batch of features. This prevents
> needing to send dozens of urns to be individually sent over.
> >
> >
> > The super urns would need to be static after definition to avoid
> mismatched definitions down the road.
> >
> > We collect together urns what is reasonably consider "vX" support, and
> can then increment that later.
> >
> > This would simplify new SDKs, as they can have a goal of initial v1
> support as we define what level of feature support it has, and doesn't
> prevent new capabilities from being added incrementally.
>
> Yes, this is a very good idea. I've also been thinking of certain sets
> of common operations/well known DoFns that often occur on opposite
> sides of GBKs (e.g. the pair-with-one, sum-ints, drop-keys, ...) that
> are commonly supported that could be grouped under these meta-urns.
>
> Note that these need not be monotonic, for example a current v1 might
> be requiring LengthPrefixCoderV1, but if a more efficient
> LengthPrefixCoderV2 comes along eventually v2 could require that and
> *not* require the old, now rarely used LengthPrefixCoderV1.
>
> Probably makes sense to defer adding such super-urns until we notice a
> set that is commonly used together in practice.
>
> Of course there's still value in SDKs being able to support features
> piecemeal as well, which is the big reason we're avoiding a simple
> monotonically-increasing version number.
>
> > Similarly, certain features sets could stand alone, eg around SQL. It's
> benefitial for optimization reasons if an SDK has native projection and UDF
> support for example, which a runner could take advantage of by avoiding
> extra cross language hops. These could then also be grouped under a SQL
> super urn.
> >
> > This is from the SDK capability side of course, rather than the SDK
> pipeline requirements side.
> >
> > -------
> > Related to that last point, it might be good to nail down early the
> perspective used when discussing these things, as there's a dual between
> "what and SDK can do", and "what the runner will do to a pipeline that the
> SDK can understand" (eg. Combiner lifting, and state backed iterables), as
> well as "what the pipeline requires from the runner" and "what the runner
> is able to do" (eg. Requires sorted input)
> >
> >
> > On Thu, Feb 13, 2020, 9:06 AM Luke Cwik <[email protected]> wrote:
> >>
> >>
> >>
> >> On Wed, Feb 12, 2020 at 2:24 PM Kenneth Knowles <[email protected]>
> wrote:
> >>>
> >>>
> >>>
> >>> On Wed, Feb 12, 2020 at 12:04 PM Robert Bradshaw <[email protected]>
> wrote:
> >>>>
> >>>> On Wed, Feb 12, 2020 at 11:08 AM Luke Cwik <[email protected]> wrote:
> >>>> >
> >>>> > We can always detect on the runner/SDK side whether there is an
> unknown field[1] within a payload and fail to process it but this is
> painful in two situations:
> >>>> > 1) It doesn't provide for a good error message since you can't say
> what the purpose of the field is. With a capability URN, the runner/SDK
> could say which URN it doesn't understand.
> >>>> > 2) It doesn't allow for the addition of fields which don't impact
> semantics of execution. For example, if the display data feature was being
> developed, a runner could ignore it and still execute the pipeline
> correctly.
> >>>>
> >>>> Yeah, I don't think proto reflection is a flexible enough tool to do
> >>>> this well either.
> >>>>
> >>>> > If we think this to be common enough, we can add capabilities list
> to the PTransform so each PTransform can do this and has a natural way of
> being extended for additions which are forwards compatible. The alternative
> to having capabilities on PTransform (and other constructs) is that we
> would have a new URN when the specification of the transform changes. For
> forwards compatible changes, each SDK/runner would map older versions of
> the URN onto the latest and internally treat it as the latest version but
> always downgrade it to the version the other party expects when
> communicating with it. Backwards incompatible changes would always require
> a new URN which capabilities at the PTransform level would not help with.
> >>>>
> >>>> As you point out, stateful+splittable may not be a particularly useful
> >>>> combination, but as another example, we have
> >>>> (backwards-incompatible-when-introduced) markers on DoFn as to whether
> >>>> it requires finalization, stable inputs, and now time sorting. I don't
> >>>> think we should have a new URN for each combination.
> >>>
> >>>
> >>> Agree with this. I don't think stateful, splittable, and "plain" ParDo
> are comparable to these. Each is an entirely different computational
> paradigm: per-element independent processing, per-key-and-window linear
> processing, and per-element-and-restriction splittable processing. Most
> relevant IMO is the nature of the parallelism. If you added state to
> splittable processing, it would still be splittable processing. Just as
> Combine and ParDo can share the SideInput specification, it is easy to
> share relevant sub-structures like state declarations. But it is a fair
> point that the ability to split can be ignored and run as a plain-old
> ParDo. It brings up the question of whether a runner that doesn't know SDF
> is should have to reject it or should be allowed to run poorly.
> >>
> >>
> >> Being splittable means that the SDK could choose to return a
> continuation saying please process the rest of my element in X amount of
> time which would require the runner to inspect certain fields on responses.
> One example would be I don't have many more messages to read from this
> message stream at the moment and another example could be that I detected
> that this filesystem is throttling me or is down and I would like to resume
> processing later.
> >>
> >>>
> >>> It isn't a huge deal. Three different top-level URNS versus three
> different sub-URNs will achieve the same result in the end if we get this
> "capability" thing in place.
> >>>
> >>> Kenn
> >>>
> >>>>
> >>>>
> >>>> >> > I do think that splittable ParDo and stateful ParDo should have
> separate PTransform URNs since they are different paradigms than "vanilla"
> ParDo.
> >>>> >>
> >>>> >> Here I disagree. What about one that is both splittable and
> stateful? Would one have a fourth URN for that? If/when another flavor of
> DoFn comes out, would we then want 8 distinct URNs? (SplitableParDo in
> particular can be executed as a normal ParDo as long as the output is
> bounded.)
> >>>> >
> >>>> > I agree that you could have stateful and splittable dofns where the
> element is the key and you share state and timers across restrictions. No
> runner is capable of executing this efficiently.
> >>>> >
> >>>> >> >> > On the SDK requirements side: the constructing SDK owns the
> Environment proto completely, so it is in a position to ensure the involved
> docker images support the necessary features.
> >>>> >> >>
> >>>> >> >> Yes.
> >>>> >
> >>>> >
> >>>> > I believe capabilities do exist on a Pipeline and it informs
> runners about new types of fields to be aware of either within Components
> or on the Pipeline object itself but for this discussion it makes sense
> that an environment would store most "capabilities" related to execution.
> >>>> >
> >>>> >> [snip]
> >>>> >
> >>>> > As for the proto clean-ups, the scope is to cover almost all things
> needed for execution now and to follow-up with optional transforms,
> payloads, and coders later which would exclude job managment APIs and
> artifact staging. A formal enumeration would be useful here. Also, we
> should provide formal guidance about adding new fields, adding new types of
> transforms, new types of proto messages, ... (best to describe this on a
> case by case basis as to how people are trying to modify the protos and
> evolve this guidance over time).
> >>>>
> >>>> What we need is the ability for (1) runners to reject future pipelines
> >>>> they cannot faithfully execute and (2) runners to be able to take
> >>>> advantage of advanced features/protocols when interacting with those
> >>>> SDKs that understand them while avoiding them for older (or newer)
> >>>> SDKs that don't. Let's call (1) (hard) requirements and (2) (optional)
> >>>> capabilities.
> >>>>
> >>>> Where possible, I think this is best expressed inherently in the set
> >>>> of transform (and possibly other component) URNs. For example, when an
> >>>> SDK uses a combine_per_key composite, that's a signal that it
> >>>> understands the various related combine_* transforms. Similarly, a
> >>>> pipeline with a test_stream URN would be rejected by pipelines not
> >>>> recognizing/supporting this primitive. However, this is not always
> >>>> possible, e.g. for (1) we have the aforementioned boolean flags on
> >>>> ParDo and for (2) we have features like large iterable and progress
> >>>> support.
> >>>>
> >>>> For (1) we have to enumerate now everywhere a runner must look a far
> >>>> into the future as we want to remain backwards compatible. This is why
> >>>> I suggested putting something on the pipeline itself, but we could
> >>>> (likely in addition) add it to Transform and/or ParDoPayload if we
> >>>> think that'd be useful now. (Note that a future pipeline-level
> >>>> requirement could be "inspect (previously non-existent) requirements
> >>>> field attached to objects of type X.")
> >>>>
> >>>> For (2) I think adding a capabilities field to the environment for now
> >>>> makes the most sense, and as it's optional to inspect them adding it
> >>>> elsewhere if needed is backwards compatible. (The motivation to do it
> >>>> now is that there are some capabilities that we'd like to enumerate
> >>>> now rather than make part of the minimal set of things an SDK must
> >>>> support.)
> >>>>
> >>
> >> Agree on the separation of requirements from capabilities where
> requirements is a set of MUST understand while capabilities are a set of
> MAY understand.
> >>
> >>>>
> >>>> > All in all, I think "capabilities" is about informing a runner
> about what they should know about and what they are allowed to do. If we go
> with a list of "capabilities", we could always add a "parameterized
> capabilities" urn which would tell runners they need to also look at some
> other field.
> >>>>
> >>>> Good point. That lets us keep it as a list for now. (The risk is that
> >>>> it makes possible the bug of populating parameters without adding the
> >>>> required notification to the list.)
> >>>>
> >>>> > I also believe capabilities should NOT be "inherited". For example
> if we define capabilities on a ParDoPayload, and on a PTransform and on
> Environment, then ParDoPayload capabilities shouldn't be copied to
> PTransform and PTransform specific capabilities shouldn't be copied to the
> Environment. My reasoning about this is that some "capabilities" can only
> be scoped to a single ParDoPayload or a single PTransform and wouldn't
> apply generally everywhere. The best example I could think of is that
> Environment A supports progress reporting while Environment B doesn't so it
> wouldn't have made sense to say the "Pipeline" supports progress reporting.
> >>>> >
> >>>> > Are capabilities strictly different from "resources" (transform
> needs python package X) or "execution hints" (e.g. deploy on machines that
> have GPUs, some generic but mostly runner specific hints)? At first glance
> I would say yes.
> >>>>
> >>>> Agreed.
>

Reply via email to