On Mon, Feb 10, 2020 at 7:35 PM Kenneth Knowles <k...@apache.org> wrote:
>
> On the runner requirements side: if you have such a list at the pipeline 
> level, it is an opportunity for the list to be inconsistent with the contents 
> of the pipeline. For example, if a DoFn is marked "requires stable input" but 
> not listed at the pipeline level, then the runner may run it without ensuring 
> it requires stable input.

Yes. Listing this feature at the top level, if used, would be part of
the contract. The problem here that we're trying to solve is that the
runner wouldn't know about the field used to mark a DoFn as "requires
stable input." Another alternative would be to make this kind of ParDo
a different URN, but that would result in a cross product of URNs for
all supported features.

Rather than attaching it to the pipeline object, we could attach it to
the transform. (But if there are ever extensions that don't belong to
transforms, we'd be out of luck. It'd be even worse to attach it to
the ParDoPayload, as then we'd need one on CombinePayload, etc. just
in case.) This is why I was leaning towards just putting it at the
top.

I agree about the potential for incompatibility. As much as possible
I'd rather extend things in a way that would be intrinsically rejected
by a non-comprehending runner. But I'm not sure how to do that when
introducing new constraints for existing components like this. But I'm
open to other suggestions.

> On the SDK requirements side: the constructing SDK owns the Environment proto 
> completely, so it is in a position to ensure the involved docker images 
> support the necessary features.

Yes.

> Is it sufficient for each SDK involved in a cross-language expansion to 
> validate that it understands the inputs? For example if Python sends a 
> PCollection with a pickle coder to Java as input to an expansion then it will 
> fail. And conversely if the returned subgraph outputs a PCollection with a 
> Java custom coder.

Yes. It's possible to imagine there could be some negotiation about
inserting length prefix coders (e.g. a Count transform could act on
any opaque data as long as it can delimit it), but that's still TBD.

> More complex use cases that I can imagine all seem futuristic and unlikely to 
> come to pass (Python passes a pickled DoFn to the Java expansion service 
> which inserts it into the graph in a way where a Java-based transform would 
> have to invoke it on every element, etc)

Some transforms are configured with UDFs of this form...but we'll
cross that bridge when we get to it.

>
> Kenn
>
> On Mon, Feb 10, 2020 at 5:03 PM Brian Hulette <bhule...@google.com> wrote:
>>
>> I like the capabilities/requirements idea. Would these capabilities be at a 
>> level that it would make sense to document in the capabilities matrix? i.e. 
>> could the URNs be the values of "X" Pablo described here [1].
>>
>> Brian
>>
>> [1] 
>> https://lists.apache.org/thread.html/e93ac64d484551d61e559e1ba0cf4a15b760e69d74c5b1d0549ff74f%40%3Cdev.beam.apache.org%3E
>>
>> On Mon, Feb 10, 2020 at 3:55 PM Robert Bradshaw <rober...@google.com> wrote:
>>>
>>> With an eye towards cross-language (which includes cross-version)
>>> pipelines and services (specifically looking at Dataflow) supporting
>>> portable pipelines, there's been a desire to stabilize the portability
>>> protos. There are currently many cleanups we'd like to do [1] (some
>>> essential, others nice to have); are there others that people would
>>> like to see?
>>>
>>> Of course we would like it to be possible for the FnAPI and Beam
>>> itself to continue to evolve. Most of this can be handled by runners
>>> understanding various transform URNs, but not all. (An example that
>>> comes to mind is support for large iterables [2], or the requirement
>>> to observe and respect new fields on a PTransform or its payloads
>>> [3]). One proposal for this is to add capabilities and/or
>>> requirements. An environment (corresponding generally to an SDK) could
>>> adveritize various capabilities (as a list or map of URNs) which a
>>> runner can take advantage of without requiring all SDKs to support all
>>> features at the same time. For the other way around, we need a way of
>>> marking something that a runner must reject if it does not understand
>>> it. This could be a set of requirements (again, a list of map of URNs)
>>> that designate capabilities required to at least be understood by the
>>> runner to faithfully execute this pipeline. (These could be attached
>>> to a transform or the pipeline itself.) Do these sound like reasonable
>>> additions? Also, would they ever need to be parameterized (map), or
>>> would a list suffice?
>>>
>>> [1] BEAM-2645, BEAM-2822, BEAM-3203, BEAM-3221, BEAM-3223, BEAM-3227,
>>> BEAM-3576, BEAM-3577, BEAM-3595, BEAM-4150, BEAM-4180, BEAM-4374,
>>> BEAM-5391, BEAM-5649, BEAM-8172, BEAM-8201, BEAM-8271, BEAM-8373,
>>> BEAM-8539, BEAM-8804, BEAM-9229, BEAM-9262, BEAM-9266, and BEAM-9272
>>> [2] 
>>> https://lists.apache.org/thread.html/70cac361b659516933c505b513d43986c25c13da59eabfd28457f1f2@%3Cdev.beam.apache.org%3E
>>> [3] 
>>> https://lists.apache.org/thread.html/rdc57f240069c0807eae87ed2ff13d3ee503bc18e5f906d05624e6433%40%3Cdev.beam.apache.org%3E

Reply via email to