Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Maximilian Michels Wed, 30 Jan 2019 03:27:13 -0800

Thomas was so kind to implement Option 3) inhttps://github.com/apache/beam/pull/7597

Heads-up to the Go SDK people to eventually implement the newDescribePipelineOptionsRequest. Tracking issue:https://issues.apache.org/jira/browse/BEAM-6549

Also related, we will have to follow-up with proper scoping of pipeline options:https://issues.apache.org/jira/browse/BEAM-6537


Thanks,
Max

On 13.11.18 19:05, Robert Burke wrote:

+1 to Option 3

I'd rather have each SDK have a single point of well defined complexity to dosomething general, than have to make tiny but simple changes. Less toil andmaintenance in the long run per SDK.


Similarly I don't have time to make it happen right now.

On Tue, Nov 13, 2018, 9:22 AM Thomas Weise <[email protected]<mailto:[email protected]>> wrote:


    Discovering options from the job server would be the only way to perform
    full validation (and provide upfront help to the user).

    The runner cannot perform full validation, since it is not aware of the user
    and SDK options (that it has to forward to the SDK worker).

    Special runner options flag to forward unknown options also wouldn't fully
    solve the problem (besides being subject to change in the future). Let's say
    runner understands --fancy-int-option and the user repeats that option
    multiple times. Not knowing the type of option, the SDK will pass it as a
    list and the runner will fail.

    Replicating SDK options is a workaround for known runners but it really goes
    against the idea of portability (making assumptions outside the API
    contract). We already have runners implemented outside of Beam and hope for
    the ecosystem to grow. What we do for options should also work for those
    runners.

    I'm with Luke here that options discovery provides the best user experience
    and can address the other issues. Even the scenario of multiple intermediate
    runners could be addressed by forwarding the unparsed options with the
    discovery call. I don't see SDK implementation complexity as a significant
    drawback so far.

    Thomas


    On Mon, Nov 12, 2018 at 2:30 PM Lukasz Cwik <[email protected]
    <mailto:[email protected]>> wrote:


        On Mon, Nov 12, 2018 at 9:38 AM Maximilian Michels <[email protected]
        <mailto:[email protected]>> wrote:

            Thank you Robert and Lukasz for your points.

             > Note that I believe that we will want to have multiple URLs to
            support cross language pipelines since we will want to be able to
            ask other SDK languages/versions for their "list" of supported
            PipelineOptions.

            Why is that? The Runner itself is the source of truth for its 
options.


        Because other languages (or even different versions of the same
        language) may have their own options. For example, the Go SDK talks to a
        Java service which applies a SQL transform and returns the expanded form
        (this may require knowledge of some options like credentials for the
        filesystem, ...) and then talks to a Python service that performs
        another transform expansion. Finally the pipeline containing Go, Java
        and Python transforms is submitted to a runner and it performs its own
        internal replacements/expansions related to executing the pipeline.

            Everything else is SDK-related and should be validated there.

            I imagined the process to go like this:

                a) Parse options to find JobServer URL
                a) Retrieve options from JobServer
                c) Parse all options
                ...continue as always...

            An option is just represented by a name and a type. There is nothing
            more to it, at least as of now. So it should be possible to parse 
them
            in the SDK without much further work.

            Nevertheless, I agree with your sentiment, Robert. The 
"runner_option"
            flag would prevent additional complexity. I still don't prefer it
            because it's not nice from an end user perspective. If we were to
            implement it, I would definitely go for the "option promotion" which
            you
            mentioned.

            I hadn't thought about delegating runners, although the 
PortableRunner
            is basically a delegating Runner. If that was an important feature, 
I
            suppose the "runner_option" would be the preferred way.

            All in all, since there doesn't seem to be an excitement to 
implement
            JobServer option retrieval and we will need the help of all SDK
            developers, "runner_option" seems to be the more likely path.


        I would say its a lack of time for people to improve this story over
        others but it can be revisited at some point in the future and I agree
        that using --runner_option as an interim provides value.


            -Max

            On 08.11.18 21:50, Lukasz Cwik wrote:
             > The purpose of the spec would be to provide the names, type and
             > descriptions of the options. We don't need anything beyond the 
JSON
             > types (string, number, bool, object, list) because the only
            ambiguity we
             > get is how do we parse command line string into the JSON type
            (and that
             > ambiguity is actually only between string and non-string since
            all the
             > other JSON types are unambiguous).
             >
             > Also, I believe the flow would be
             > 1) Parse options
             >    a) Find the URL from args specified and/or additional methods 
on
             > PipelineOptions that exposes a programmatic way to set the URL
            during
             > parsing.
             >    b) Query URL for option specs
             >    c) Parse the remainder of the options
             > 2) Construct pipeline
             > 3) Choose runner
             > 4) Submit job to runner
             >
             > Note that I believe that we will want to have multiple URLs to
            support
             > cross language pipelines since we will want to be able to ask
            other SDK
             > languages/versions for their "list" of supported PipelineOptions.
             >
             > On Thu, Nov 8, 2018 at 11:51 AM Robert Bradshaw
            <[email protected] <mailto:[email protected]>
             > <mailto:[email protected] <mailto:[email protected]>>> wrote:
             >
             >     There's two questions here:
             >
             >     (A) What do we do in the short term?
             >
             >     I think adding every runner option to every SDK is not
            sustainable
             >     (n*m work, assuming every SDK knows about every runner), and
            having a
             >     patchwork of options that were added as one-offs to SDKs is 
not
             >     desirable either. Furthermore, it seems difficult to parse
            unknown
             >     options as if they were valid options, so my preference here
            would be
             >     to just use a special runner_option flag. (One could also
            pass a set
             >     of unparsed/unvalidated runner options to the runner, even if
            they're
             >     not distinguished for the user, and runners (or any
            intermediates)
             >     could run a "promote" operation that promotes any of these
            unknowns
             >     that they recognize to real options before further
            processing. The
             >     parsing would be done as repeated-string, and not be
            intermingled with
             >     the actually validated options. This is essential a variant 
of
             >     option 1.)
             >
             >     (B) What do do in the long term? While the JobServer approach
            sounds
             >     nice, I think it introduces a lot of complexity (we have too
            much of
             >     that already) and still doesn't completely solve the 
problem. In
             >     particular, it changes the flow from
             >
             >     1. Parse options
             >     2. Construct pipeline
             >     3. Choose runner
             >     4. Submit job to runner
             >
             >     to
             >
             >     1. Parse options
             >     2. Construct pipeline
             >     3. Choose runner
             >     4a. Query runner for option specs
             >     4b. Re-parse options
             >     4c. Submit job to runner
             >
             >     In particular, doing 4b in the SDK rather than just let the
            runner
             >     itself do the validation as part of (4) doesn't save much and
            forces
             >     us to come up with a (probably incomplete) spec as to how to
            define
             >     options, their types, and their validations. It also means 
that a
             >     delegating runner must choose and interact with its 
downstream
             >     runner(s) synchronously, else we haven't actually solved the
            issue.
             >
             >     For these reasons, I don't think we even want to go with the
            JobServer
             >     approach in the long term, which has bearing on (A).
             >
             >     - Robert
             >
             >
             >     On Wed, Nov 7, 2018 at 8:50 PM Maximilian Michels
            <[email protected] <mailto:[email protected]>
             >     <mailto:[email protected] <mailto:[email protected]>>> wrote:
             >      >
             >      > +1
             >      >
             >      > If the preferred approach is to eventually have the 
JobServer
             >     serve the
             >      > options, then the best intermediate solution is to
            replicate common
             >      > options in the SDKs.
             >      >
             >      > If we went down the "--runner_option" path, we would end
            up with
             >      > multiple ways of specifying the same options. We would
            eventually
             >     have
             >      > to deprecate "runner options" once we have the JobServer
             >     approach. I'd
             >      > like to avoid that.
             >      >
             >      > For the upcoming release we can revert the changes again
            and add the
             >      > most common missing options to the SDKs. Then hopefully we
            should
             >     have
             >      > fetching implemented for the release after.
             >      >
             >      > Do you think that is feasible?
             >      >
             >      > Thanks,
             >      > Max
             >      >
             >      > On 30.10.18 23:00, Lukasz Cwik wrote:
             >      > > I still like #3 the most, just can't devote the time to
            get it
             >     done.
             >      > >
             >      > > Instead of going with a fully implemented #3, we could
            hardcode
             >     the a
             >      > > subset of options and types within each SDK until the
            job server is
             >      > > ready to provide this information and then migrate to 
the
             >     "full" list.
             >      > > This would be an easy path for SDKs to take on. They 
could
             >     "know" of a
             >      > > few well known options, and if they want to support all
             >     options, they
             >      > > implement the integration with the job server.
             >      > >
             >      > > On Fri, Oct 26, 2018 at 9:19 AM Maximilian Michels
             >     <[email protected] <mailto:[email protected]>
            <mailto:[email protected] <mailto:[email protected]>>
             >      > > <mailto:[email protected] <mailto:[email protected]>
            <mailto:[email protected] <mailto:[email protected]>>>> wrote:
             >      > >
             >      > >      > I would prefer we don't introduce a (quirky) way
            of passing
             >      > >     unknown options that forces users to type JSON into 
the
             >     command line
             >      > >     (or similar acrobatics)
             >      > >     Same here, the JSON approach seems technically nice
            but too
             >     bulky
             >      > >     for users.
             >      > >
             >      > >      > To someone wanting to run a pipeline, all 
options are
             >     equally
             >      > >     important, whether they are application specific, 
SDK
             >     specific or
             >      > >     runner specific.
             >      > >
             >      > >     I'm also reluctant to force users to use
            `--runner_option=`
             >     because the
             >      > >     division into "Runner" options and other options 
seems
             >     rather arbitrary
             >      > >     to users. Most built-in options are also 
Runner-related.
             >      > >
             >      > >      > It should be possible to *optionally*
            qualify/scope (to
             >     cover
             >      > >     cases where there is ambiguity), but otherwise I
            prefer the
             >     format
             >      > >     we currently have.
             >      > >
             >      > >     Yes, namespacing is a problem. What happens if the 
user
             >     defines a
             >      > >     custom
             >      > >     PipelineOption which clashes with one of the builtin
            ones?
             >     If both are
             >      > >
             >      > >     set, which one is actually applied?
             >      > >
             >      > >
             >      > > Note that PipelineOptions so far has been treating name
             >     equality to mean
             >      > > option equality and the Java implementation has a bunch 
of
             >     strict checks
             >      > > to make sure that default values aren't used for 
duplicate
             >     definitions,
             >      > > they have the same type, etc...
             >      > > With 1), you fail the job if the runner can't understand
            your
             >     option
             >      > > because its not represented the same way. User then
            needs to fix-up
             >      > > their declaration of the option name.
             >      > > With 2), there are no name conflicts, the SDK will need 
to
             >     validate that
             >      > > the option isn't set in both formats and error out if it
            is before
             >      > > pipeline submission time.
             >      > > With 3), you can prefetch all the options and error out
            to the user
             >      > > during argument parsing time.
             >      > >
             >      > >
             >      > >
             >      > >     Here is a summary of the possible paths going 
forward:
             >      > >
             >      > >
             >      > >     1) Validate PipelineOptions at Runner side
             >      > >     ==========================================
             >      > >
             >      > >     The main issue raised here was that we want to move 
away
             >     from parsing
             >      > >     arguments which look like options without validating
            them.
             >     An easy fix
             >      > >     would be to actually validate them on the Runner
            side. This
             >     could be
             >      > >     done by changing the deserialization code of
             >     PipelineOptions which so
             >      > >     far ignores unknown JSON options.
             >      > >
             >      > >     See: PipelineOptionsTranslation.fromProto(Struct
            protoOptions)
             >      > >
             >      > >     Actually, this wouldn't work for user-defined
             >     PipelineOptions because
             >      > >     they might not be known to the Runner (if they are
            defined
             >     in Python).
             >      > >
             >      > >
             >      > >     2) Introduce a Runner-Option Flag
             >      > >     =================================
             >      > >
             >      > >     In this approach we would try to add as many 
pipeline
             >     options for a
             >      > >     Runner to the SDK, but allow additional Runner
            options to
             >     be passed
             >      > >     using the `--runner-option=key=val` flag. The
            Runner, like
             >     in 1), would
             >      > >     have to ensure validation. I think this has been the
            most
             >     favored
             >      > >     way so
             >      > >     far. Going forward, that means that
            `--parallelism=4` and
             >      > >     `--runner-option=parallelism=4` will have the same
            effect
             >     for the Flink
             >      > >     Runner.
             >      > >
             >      > >
             >      > >     3) Implement Fetching of Options from JobServer
             >      > >     ===============================================
             >      > >
             >      > >     The options are retrieved from the JobServer before
             >     submitting the
             >      > >     pipeline. I think this would be ideal but, as 
mentioned
             >     before, it
             >      > >     increases the complexity for implementing new SDKs 
and
             >     might overall
             >      > >     just not be worth the effort.
             >      > >
             >      > >
             >      > >     What do you think? I'd implement 2) for the next
            release,
             >     unless there
             >      > >     are advocates for a different approach.
             >      > >
             >      > >     Cheers,
             >      > >     Max
             >

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Reply via email to