Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Robert Burke Tue, 13 Nov 2018 10:05:31 -0800

+1 to Option 3

I'd rather have each SDK have a single point of well defined complexity to
do something general, than have to make tiny but simple changes. Less toil
and maintenance in the long run per SDK.


Similarly I don't have time to make it happen right now.

On Tue, Nov 13, 2018, 9:22 AM Thomas Weise <[email protected]> wrote:

> Discovering options from the job server would be the only way to perform
> full validation (and provide upfront help to the user).
>
> The runner cannot perform full validation, since it is not aware of the
> user and SDK options (that it has to forward to the SDK worker).
>
> Special runner options flag to forward unknown options also wouldn't fully
> solve the problem (besides being subject to change in the future). Let's
> say runner understands --fancy-int-option and the user repeats that option
> multiple times. Not knowing the type of option, the SDK will pass it as a
> list and the runner will fail.
>
> Replicating SDK options is a workaround for known runners but it really
> goes against the idea of portability (making assumptions outside the API
> contract). We already have runners implemented outside of Beam and hope for
> the ecosystem to grow. What we do for options should also work for those
> runners.
>
> I'm with Luke here that options discovery provides the best user
> experience and can address the other issues. Even the scenario of multiple
> intermediate runners could be addressed by forwarding the unparsed options
> with the discovery call. I don't see SDK implementation complexity as a
> significant drawback so far.
>
> Thomas
>
>
> On Mon, Nov 12, 2018 at 2:30 PM Lukasz Cwik <[email protected]> wrote:
>
>>
>> On Mon, Nov 12, 2018 at 9:38 AM Maximilian Michels <[email protected]>
>> wrote:
>>
>>> Thank you Robert and Lukasz for your points.
>>>
>>> > Note that I believe that we will want to have multiple URLs to support
>>> cross language pipelines since we will want to be able to ask other SDK
>>> languages/versions for their "list" of supported PipelineOptions.
>>>
>>> Why is that? The Runner itself is the source of truth for its options.
>>>
>>
>> Because other languages (or even different versions of the same language)
>> may have their own options. For example, the Go SDK talks to a Java service
>> which applies a SQL transform and returns the expanded form (this may
>> require knowledge of some options like credentials for the filesystem, ...)
>> and then talks to a Python service that performs another transform
>> expansion. Finally the pipeline containing Go, Java and Python transforms
>> is submitted to a runner and it performs its own internal
>> replacements/expansions related to executing the pipeline.
>>
>>
>>> Everything else is SDK-related and should be validated there.
>>>
>>> I imagined the process to go like this:
>>>
>>>    a) Parse options to find JobServer URL
>>>    a) Retrieve options from JobServer
>>>    c) Parse all options
>>>    ...continue as always...
>>>
>>> An option is just represented by a name and a type. There is nothing
>>> more to it, at least as of now. So it should be possible to parse them
>>> in the SDK without much further work.
>>>
>>> Nevertheless, I agree with your sentiment, Robert. The "runner_option"
>>> flag would prevent additional complexity. I still don't prefer it
>>> because it's not nice from an end user perspective. If we were to
>>> implement it, I would definitely go for the "option promotion" which you
>>> mentioned.
>>>
>>> I hadn't thought about delegating runners, although the PortableRunner
>>> is basically a delegating Runner. If that was an important feature, I
>>> suppose the "runner_option" would be the preferred way.
>>>
>>> All in all, since there doesn't seem to be an excitement to implement
>>> JobServer option retrieval and we will need the help of all SDK
>>> developers, "runner_option" seems to be the more likely path.
>>>
>>
>> I would say its a lack of time for people to improve this story over
>> others but it can be revisited at some point in the future and I agree that
>> using --runner_option as an interim provides value.
>>
>>
>>>
>>> -Max
>>>
>>> On 08.11.18 21:50, Lukasz Cwik wrote:
>>> > The purpose of the spec would be to provide the names, type and
>>> > descriptions of the options. We don't need anything beyond the JSON
>>> > types (string, number, bool, object, list) because the only ambiguity
>>> we
>>> > get is how do we parse command line string into the JSON type (and
>>> that
>>> > ambiguity is actually only between string and non-string since all the
>>> > other JSON types are unambiguous).
>>> >
>>> > Also, I believe the flow would be
>>> > 1) Parse options
>>> >    a) Find the URL from args specified and/or additional methods on
>>> > PipelineOptions that exposes a programmatic way to set the URL during
>>> > parsing.
>>> >    b) Query URL for option specs
>>> >    c) Parse the remainder of the options
>>> > 2) Construct pipeline
>>> > 3) Choose runner
>>> > 4) Submit job to runner
>>> >
>>> > Note that I believe that we will want to have multiple URLs to support
>>> > cross language pipelines since we will want to be able to ask other
>>> SDK
>>> > languages/versions for their "list" of supported PipelineOptions.
>>> >
>>> > On Thu, Nov 8, 2018 at 11:51 AM Robert Bradshaw <[email protected]
>>> > <mailto:[email protected]>> wrote:
>>> >
>>> >     There's two questions here:
>>> >
>>> >     (A) What do we do in the short term?
>>> >
>>> >     I think adding every runner option to every SDK is not sustainable
>>> >     (n*m work, assuming every SDK knows about every runner), and
>>> having a
>>> >     patchwork of options that were added as one-offs to SDKs is not
>>> >     desirable either. Furthermore, it seems difficult to parse unknown
>>> >     options as if they were valid options, so my preference here would
>>> be
>>> >     to just use a special runner_option flag. (One could also pass a
>>> set
>>> >     of unparsed/unvalidated runner options to the runner, even if
>>> they're
>>> >     not distinguished for the user, and runners (or any intermediates)
>>> >     could run a "promote" operation that promotes any of these unknowns
>>> >     that they recognize to real options before further processing. The
>>> >     parsing would be done as repeated-string, and not be intermingled
>>> with
>>> >     the actually validated options. This is essential a variant of
>>> >     option 1.)
>>> >
>>> >     (B) What do do in the long term? While the JobServer approach
>>> sounds
>>> >     nice, I think it introduces a lot of complexity (we have too much
>>> of
>>> >     that already) and still doesn't completely solve the problem. In
>>> >     particular, it changes the flow from
>>> >
>>> >     1. Parse options
>>> >     2. Construct pipeline
>>> >     3. Choose runner
>>> >     4. Submit job to runner
>>> >
>>> >     to
>>> >
>>> >     1. Parse options
>>> >     2. Construct pipeline
>>> >     3. Choose runner
>>> >     4a. Query runner for option specs
>>> >     4b. Re-parse options
>>> >     4c. Submit job to runner
>>> >
>>> >     In particular, doing 4b in the SDK rather than just let the runner
>>> >     itself do the validation as part of (4) doesn't save much and
>>> forces
>>> >     us to come up with a (probably incomplete) spec as to how to define
>>> >     options, their types, and their validations. It also means that a
>>> >     delegating runner must choose and interact with its downstream
>>> >     runner(s) synchronously, else we haven't actually solved the issue.
>>> >
>>> >     For these reasons, I don't think we even want to go with the
>>> JobServer
>>> >     approach in the long term, which has bearing on (A).
>>> >
>>> >     - Robert
>>> >
>>> >
>>> >     On Wed, Nov 7, 2018 at 8:50 PM Maximilian Michels <[email protected]
>>> >     <mailto:[email protected]>> wrote:
>>> >      >
>>> >      > +1
>>> >      >
>>> >      > If the preferred approach is to eventually have the JobServer
>>> >     serve the
>>> >      > options, then the best intermediate solution is to replicate
>>> common
>>> >      > options in the SDKs.
>>> >      >
>>> >      > If we went down the "--runner_option" path, we would end up with
>>> >      > multiple ways of specifying the same options. We would
>>> eventually
>>> >     have
>>> >      > to deprecate "runner options" once we have the JobServer
>>> >     approach. I'd
>>> >      > like to avoid that.
>>> >      >
>>> >      > For the upcoming release we can revert the changes again and
>>> add the
>>> >      > most common missing options to the SDKs. Then hopefully we
>>> should
>>> >     have
>>> >      > fetching implemented for the release after.
>>> >      >
>>> >      > Do you think that is feasible?
>>> >      >
>>> >      > Thanks,
>>> >      > Max
>>> >      >
>>> >      > On 30.10.18 23:00, Lukasz Cwik wrote:
>>> >      > > I still like #3 the most, just can't devote the time to get it
>>> >     done.
>>> >      > >
>>> >      > > Instead of going with a fully implemented #3, we could
>>> hardcode
>>> >     the a
>>> >      > > subset of options and types within each SDK until the job
>>> server is
>>> >      > > ready to provide this information and then migrate to the
>>> >     "full" list.
>>> >      > > This would be an easy path for SDKs to take on. They could
>>> >     "know" of a
>>> >      > > few well known options, and if they want to support all
>>> >     options, they
>>> >      > > implement the integration with the job server.
>>> >      > >
>>> >      > > On Fri, Oct 26, 2018 at 9:19 AM Maximilian Michels
>>> >     <[email protected] <mailto:[email protected]>
>>> >      > > <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>> >      > >
>>> >      > >      > I would prefer we don't introduce a (quirky) way of
>>> passing
>>> >      > >     unknown options that forces users to type JSON into the
>>> >     command line
>>> >      > >     (or similar acrobatics)
>>> >      > >     Same here, the JSON approach seems technically nice but
>>> too
>>> >     bulky
>>> >      > >     for users.
>>> >      > >
>>> >      > >      > To someone wanting to run a pipeline, all options are
>>> >     equally
>>> >      > >     important, whether they are application specific, SDK
>>> >     specific or
>>> >      > >     runner specific.
>>> >      > >
>>> >      > >     I'm also reluctant to force users to use
>>> `--runner_option=`
>>> >     because the
>>> >      > >     division into "Runner" options and other options seems
>>> >     rather arbitrary
>>> >      > >     to users. Most built-in options are also Runner-related.
>>> >      > >
>>> >      > >      > It should be possible to *optionally* qualify/scope (to
>>> >     cover
>>> >      > >     cases where there is ambiguity), but otherwise I prefer
>>> the
>>> >     format
>>> >      > >     we currently have.
>>> >      > >
>>> >      > >     Yes, namespacing is a problem. What happens if the user
>>> >     defines a
>>> >      > >     custom
>>> >      > >     PipelineOption which clashes with one of the builtin ones?
>>> >     If both are
>>> >      > >
>>> >      > >     set, which one is actually applied?
>>> >      > >
>>> >      > >
>>> >      > > Note that PipelineOptions so far has been treating name
>>> >     equality to mean
>>> >      > > option equality and the Java implementation has a bunch of
>>> >     strict checks
>>> >      > > to make sure that default values aren't used for duplicate
>>> >     definitions,
>>> >      > > they have the same type, etc...
>>> >      > > With 1), you fail the job if the runner can't understand your
>>> >     option
>>> >      > > because its not represented the same way. User then needs to
>>> fix-up
>>> >      > > their declaration of the option name.
>>> >      > > With 2), there are no name conflicts, the SDK will need to
>>> >     validate that
>>> >      > > the option isn't set in both formats and error out if it is
>>> before
>>> >      > > pipeline submission time.
>>> >      > > With 3), you can prefetch all the options and error out to
>>> the user
>>> >      > > during argument parsing time.
>>> >      > >
>>> >      > >
>>> >      > >
>>> >      > >     Here is a summary of the possible paths going forward:
>>> >      > >
>>> >      > >
>>> >      > >     1) Validate PipelineOptions at Runner side
>>> >      > >     ==========================================
>>> >      > >
>>> >      > >     The main issue raised here was that we want to move away
>>> >     from parsing
>>> >      > >     arguments which look like options without validating them.
>>> >     An easy fix
>>> >      > >     would be to actually validate them on the Runner side.
>>> This
>>> >     could be
>>> >      > >     done by changing the deserialization code of
>>> >     PipelineOptions which so
>>> >      > >     far ignores unknown JSON options.
>>> >      > >
>>> >      > >     See: PipelineOptionsTranslation.fromProto(Struct
>>> protoOptions)
>>> >      > >
>>> >      > >     Actually, this wouldn't work for user-defined
>>> >     PipelineOptions because
>>> >      > >     they might not be known to the Runner (if they are defined
>>> >     in Python).
>>> >      > >
>>> >      > >
>>> >      > >     2) Introduce a Runner-Option Flag
>>> >      > >     =================================
>>> >      > >
>>> >      > >     In this approach we would try to add as many pipeline
>>> >     options for a
>>> >      > >     Runner to the SDK, but allow additional Runner options to
>>> >     be passed
>>> >      > >     using the `--runner-option=key=val` flag. The Runner, like
>>> >     in 1), would
>>> >      > >     have to ensure validation. I think this has been the most
>>> >     favored
>>> >      > >     way so
>>> >      > >     far. Going forward, that means that `--parallelism=4` and
>>> >      > >     `--runner-option=parallelism=4` will have the same effect
>>> >     for the Flink
>>> >      > >     Runner.
>>> >      > >
>>> >      > >
>>> >      > >     3) Implement Fetching of Options from JobServer
>>> >      > >     ===============================================
>>> >      > >
>>> >      > >     The options are retrieved from the JobServer before
>>> >     submitting the
>>> >      > >     pipeline. I think this would be ideal but, as mentioned
>>> >     before, it
>>> >      > >     increases the complexity for implementing new SDKs and
>>> >     might overall
>>> >      > >     just not be worth the effort.
>>> >      > >
>>> >      > >
>>> >      > >     What do you think? I'd implement 2) for the next release,
>>> >     unless there
>>> >      > >     are advocates for a different approach.
>>> >      > >
>>> >      > >     Cheers,
>>> >      > >     Max
>>> >
>>>
>>

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Reply via email to