Expecting runners to populate, or override, SDK-level pipeline options
isn't a great thing, particularly in a scenario that would affect
correctness.

The main thing is discoverability of a subtle API like this -- there's
little chance somebody writing a new runner would stumble across this and
do the right thing. It would be much better to make expectations from a
runner clear, say, via a runner-provided "context" API. I'd stay away from
a pipeline option with a default value.

The other contentions topic here is the usage of a job-level or
execution-level identifier. This easily becomes ambiguous in the presence
of Flink's savepoints, Dataflow's update, fast re-execution, canary vs.
production pipeline, cross-job optimizations, etc. I think we'd be better
off with a transform-level nonce than a job-level one.

Finally, the real solution is to enhance the model and make such a
functionality available to everyone, e.g., roughly "init" + "checkpoint" +
"side-input to source / splittabledofn / composable io".

--

Practically, to solve the problem at hand quickly, I'd be in favor of a
context-based approach.

On Thu, Jan 19, 2017 at 10:22 AM, Sam McVeety <[email protected]>
wrote:

> Hi folks, I'm looking for feedback on whether the following is a reasonable
> approach to handling ValueProviders that are intended to be populated at
> runtime by a given Runner (e.g. a Dataflow job ID, which is a GUID from the
> service).  Two potential pieces of a solution:
>
> 1. Annotate such parameters with @RunnerProvided, which results in an
> Exception if the user manually tries to set the parameter.
>
> 2. Allow for a DefaultValueFactory to be present for the set of Runners
> that do not override the parameter.
>
> Best,
> Sam
>

Reply via email to