Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Thomas Weise Thu, 25 Oct 2018 12:20:02 -0700

Reminder that this is something we ideally address before the next
release...


Considering the discussion so far, my preference is that we get away from
unknown options and discover valid options from the runner (by expanding
the job service).

Once the SDK is aware of all valid options, it is possible to provide
meaningful feedback to the user (validate or help), and correctly handle
scopes and types.

I would prefer we don't introduce a (quirky) way of passing unknown options
that forces users to type JSON into the command line (or similar
acrobatics). To someone wanting to run a pipeline, all options are equally
important, whether they are application specific, SDK specific or runner
specific. It should be possible to *optionally* qualify/scope (to cover
cases where there is ambiguity), but otherwise I prefer the format we
currently have.

Regarding type inference: Correct handling of numeric types matters, see
following issue with protobuf (not JSON):
https://issues.apache.org/jira/browse/BEAM-5509

Thomas


On Thu, Oct 18, 2018 at 6:55 AM Robert Bradshaw <[email protected]> wrote:

> On Wed, Oct 17, 2018 at 11:35 PM Lukasz Cwik <[email protected]> wrote:
>
>>
>> On Tue, Oct 16, 2018 at 11:51 AM Robert Bradshaw <[email protected]>
>> wrote:
>>
>>> On Tue, Oct 16, 2018 at 7:03 PM Lukasz Cwik <[email protected]> wrote:
>>> >
>>> > For all unknown options, the SDK can require that all flag values be
>>> specified explicitly as a valid JSON type.
>>> > starts with { -> object
>>> > starts with [ -> list
>>> > starts with " -> string
>>> > is null / true / false -> null / true / false
>>> > otherwise is number.
>>> >
>>> > This isn't great for strings but works well for all the other types.
>>> >
>>> > Thus for known options, the additional typing information would
>>> disambiguate whether something should be a
>>> string/boolean/number/object/list but for unknown options we would expect
>>> the user to use valid JSON explicitly and write:
>>> > --foo={"object": "value"}
>>> > --foo=["value", "value2"]
>>> > --foo="string value"
>>>
>>> Due to shell escaping, one would have to write
>>>
>>> --foo=\"string value\"
>>>
>>> or actually, due to the space
>>>
>>> --foo='"string value"'
>>>
>>> or some other variation on that, which is really unfortunate. (The JSON
>>> list/objects would need similar quoting, but that's less surprising.) Also,
>>> does this mean we'd only have one kind of number (not integer vs. float,
>>> i.e. --parallelism=5.0 works)? I suppose that is JSON.
>>>
>>
>> Yes, I was suspecting that users would need to type the second variant as
>> \"...\" I found more burdensome then '"..."'
>>
>>
>>>
>>> > --foo=3.5 --foo=-4
>>> > --foo=true --foo=false
>>> > --foo=null
>>> > This also works if the flag is repeated, so --foo=3.5 --foo=-4 is
>>> [3.5, -4]
>>>
>>> The thing that sparked this discussion was what to do when unknown foo
>>> is repeated, but only one value given.
>>>
>>
>> If the person only specifies one value, then they have to disambiguate
>> and put it in a list, only if they specify more then one value will they
>> have to turn it into a list.
>>
>> I believe we could come up with other schemes on how to convert unknown
>> options to JSON where we prefer strings over non-string types like
>> null/true/false/numbers/list/object and require the user to escape out of
>> the string default but anything that is too different from strict JSON
>> would cause headaches when attempting to explain the format to users. I
>> think a happy middle ground would be that we will only require escaping for
>> strings which are ambiguous, so things like true, null, false, ... to be
>> treated as strings would require the user to escape them.
>>
>
> I'd prefer to avoid inferring the type of an unknown argument based on its
> contents, which can lead to surprises. We could declare every unknown type
> to be repeated string, and let any parsing/validation occur on the runner.
> If desired, we could pass these around as a single "runner options" dict
> that runners could inspect and use to populate the actual dict rather than
> mixing parsed and unparsed options.
>
>
>>
>>
>>> > On Tue, Oct 16, 2018 at 7:56 AM Thomas Weise <[email protected]> wrote:
>>> >>
>>> >> Discovering options from the job server seems preferable over
>>> replicating runner options in SDKs.
>>> >>
>>> >> Runners evolve on their own, and with portability the SDK does not
>>> need to know anything about the runner.
>>> >>
>>> >> Regarding --runner-option. It is true that this looks less user
>>> friendly. On the other hand it eliminates the possibility of name
>>> collisions.
>>> >>
>>> >> But if options are discovered, the SDK can perform full validation.
>>> It would only be necessary to use explicit scoping when there is ambiguity.
>>> >>
>>> >> Thomas
>>> >>
>>> >>
>>> >> On Tue, Oct 16, 2018 at 3:48 AM Maximilian Michels <[email protected]>
>>> wrote:
>>> >>>
>>> >>> Fetching options directly from the Runner's JobServer seems like the
>>> >>> ideal solution. I agree with Robert that it creates additional
>>> >>> complexity for SDK authors, so the `--runner-option` flag would be an
>>> >>> easy and explicit way to specify additional Runner options.
>>> >>>
>>> >>> The format I prefer would be: --runner_option=key1=val1
>>> >>> --runner_option=key2=val2
>>> >>>
>>> >>> Now, from the perspective of end users, I think it is neither
>>> convenient
>>> >>> nor reasonable to require the use of the `--runner-option` flag. To
>>> the
>>> >>> user it seems nebulous why some pipeline options live in the
>>> top-level
>>> >>> option namespace while others need to be nested within an option.
>>> This
>>> >>> is amplified by there being two Runners the user needs to be aware
>>> of,
>>> >>> i.e. PortableRunner and the actual Runner (Dataflow/Flink/Spark..).
>>> >>>
>>> >>> I feel like we would eventually replicate all options in the SDK
>>> because
>>> >>> otherwise users have to use the `--runner-option`, but at least we
>>> can
>>> >>> specify options which have not been replicated yet.
>>> >>>
>>> >>> -Max
>>> >>>
>>> >>> On 16.10.18 10:27, Robert Bradshaw wrote:
>>> >>> > Yes, we don't know how to parse and/or validate it.
>>> >>> >
>>> >>> > On Tue, Oct 16, 2018 at 1:14 AM Lukasz Cwik <[email protected]
>>> >>> > <mailto:[email protected]>> wrote:
>>> >>> >
>>> >>> >     I see, is the issue that we currently are using a JSON
>>> >>> >     representation for options when being serialized and when we
>>> get
>>> >>> >     some unknown option, we don't know how to convert it into its
>>> JSON form?
>>> >>> >
>>> >>> >     On Mon, Oct 15, 2018 at 2:41 PM Robert Bradshaw <
>>> [email protected]
>>> >>> >     <mailto:[email protected]>> wrote:
>>> >>> >
>>> >>> >         On Mon, Oct 15, 2018 at 11:30 PM Lukasz Cwik <
>>> [email protected]
>>> >>> >         <mailto:[email protected]>> wrote:
>>> >>> >          >
>>> >>> >          > On Mon, Oct 15, 2018 at 1:17 PM Robert Bradshaw
>>> >>> >         <[email protected] <mailto:[email protected]>> wrote:
>>> >>> >          >>
>>> >>> >          >> On Mon, Oct 15, 2018 at 7:50 PM Lukasz Cwik
>>> >>> >         <[email protected] <mailto:[email protected]>> wrote:
>>> >>> >          >> >
>>> >>> >          >> > I agree with the sentiment for better error checking.
>>> >>> >          >> >
>>> >>> >          >> > We can try to make it such that the SDK can "fetch"
>>> the
>>> >>> >         set of options that the runner supports by making a call
>>> to the
>>> >>> >         Job API. The API could return a list of option names
>>> >>> >         (descriptions for --help purposes and also potentially the
>>> >>> >         expected format) which would remove the worry around
>>> "unknown"
>>> >>> >         options. Yes I understand to be able to make the Job API
>>> call,
>>> >>> >         we may need to parse some options from the args parameters
>>> first
>>> >>> >         and then parse the unknown options after they are fetched.
>>> >>> >          >>
>>> >>> >          >> This is an interesting idea, but seems it could get
>>> quite
>>> >>> >         complicated.
>>> >>> >          >> E.g. for delegating runners, one would first read the
>>> options to
>>> >>> >          >> determine which runner to fetch the options from, which
>>> >>> >         would then
>>> >>> >          >> return a set of options that possibly depends on the
>>> values
>>> >>> >         of some of
>>> >>> >          >> its options...
>>> >>> >          >>
>>> >>> >          >> > Alternatively, we can choose an explicit format
>>> upfront.
>>> >>> >          >> > To expand on the exact format for
>>> --runner_option=...,
>>> >>> >         here are some different ideas:
>>> >>> >          >> > 1) Specified multiple times, each one is an explicit
>>> flag
>>> >>> >          >> > --runner_option=--blah=bar --runner_option=--foo=baz1
>>> >>> >         --runner_option=--foo=baz2
>>> >>> >          >>
>>> >>> >          >> I'm -1 on this format. We should move away from the
>>> idea
>>> >>> >         that options
>>> >>> >          >> == flags (as that doesn't compose well with other
>>> libraries
>>> >>> >         that do
>>> >>> >          >> their own flags parsing). The ability to parse a set of
>>> >>> >         flags into
>>> >>> >          >> options is just a convenience that an author may (or
>>> may
>>> >>> >         not) choose
>>> >>> >          >> to use (e.g. when running pipelines a long-lived
>>> process like a
>>> >>> >          >> service or a notebook, the command line flags are
>>> almost
>>> >>> >         certainly not
>>> >>> >          >> the right interface).
>>> >>> >          >>
>>> >>> >          >> > 2) specified multiple times, we drop the explicit
>>> flag
>>> >>> >          >> > --runner_option=blah=bar --runner_option=foo=baz1
>>> >>> >         --runner_option=foo=baz2
>>> >>> >          >>
>>> >>> >          >> This or (4) is my preference.
>>> >>> >          >>
>>> >>> >          >> > 3) we use a string which the runner can choose to
>>> >>> >         interpret however they want (JSON/XML shown below)
>>> >>> >          >> > --runner_option='{"blah": "bar", "foo": ["baz1",
>>> "baz2"]}'
>>> >>> >          >> >
>>> >>> >
>>> --runner_option='<options><blah>bar</blah><foo>baz1</foo><foo>baz2</foo></options>'
>>> >>> >          >>
>>> >>> >          >> This would make validation hard. Also, I think it makes
>>> >>> >         sense for some
>>> >>> >          >> runner options to be "shared" (parallelism") by
>>> convention,
>>> >>> >         so letting
>>> >>> >          >> it be a free-form string wouldn't allow different
>>> runners to
>>> >>> >         inspect
>>> >>> >          >> different bits.
>>> >>> >          >>
>>> >>> >          >> We should consider if we should use urns for
>>> namespacing, and
>>> >>> >          >> assigning semantic meaning to strings, here.
>>> >>> >          >>
>>> >>> >          >> > 4) we use a string which must be a specific format
>>> such as
>>> >>> >         JSON (allows the SDK to do simple validation):
>>> >>> >          >> > --runner_option='{"blah": "bar", "foo": ["baz1",
>>> "baz2"]}'
>>> >>> >          >>
>>> >>> >          >> I like this in that at least some validation can be
>>> >>> >         performed, and
>>> >>> >          >> expectations of how to format richer types. On the
>>> other
>>> >>> >         hand it gets
>>> >>> >          >> a bit verbose, given that most (I'd imagine) options
>>> will be
>>> >>> >         simple.
>>> >>> >          >> As with normal options,
>>> >>> >          >>
>>> >>> >          >>     --option1=value1 --option2=value2
>>> >>> >          >>
>>> >>> >          >> is shorthand for {"option1": value1, "option2":
>>> value2}.
>>> >>> >          >>
>>> >>> >          > I lean to 4 the most. With 2, you run into issues of
>>> what
>>> >>> >         does --runner_option=foo=["a", "b"]
>>> --runner_option=foo=["c",
>>> >>> >         "d"] mean?
>>> >>> >          > Is it an error or list of lists or concatenated. Similar
>>> >>> >         issues for map types represented via JSON object {...}
>>> >>> >
>>> >>> >         We can err to be on the safe side unless/until an argument
>>> can
>>> >>> >         be made
>>> >>> >         that merging is more natural. I just think this will be
>>> excessively
>>> >>> >         verbose to use.
>>> >>> >
>>> >>> >          >> > I would strongly suggest that we go with the "fetch"
>>> >>> >         approach, since this makes the set of options discoverable
>>> and
>>> >>> >         helps users find errors much earlier in their pipeline.
>>> >>> >          >>
>>> >>> >          >> This seems like an advanced feature that SDKs may want
>>> to
>>> >>> >         support, but
>>> >>> >          >> I wouldn't want to require this complexity for
>>> bootstrapping
>>> >>> >         an SDK.
>>> >>> >          >>
>>> >>> >          > SDKs that are starting off wouldn't need to "fetch"
>>> options,
>>> >>> >         they could choose to not support runner options or they
>>> could
>>> >>> >         choose to pass all options through to the runner blindly.
>>> >>> >         Fetching the options only provides the SDK the ability to
>>> >>> >         provide error checking upfront and useful error/help
>>> messages.
>>> >>> >
>>> >>> >         But how to even pass all options through blindly is
>>> exactly the
>>> >>> >         difficulty we're running into here.
>>> >>> >
>>> >>> >          >> Regarding always keeping runner options separate, +1,
>>> though
>>> >>> >         I'm not
>>> >>> >          >> sure the line is always clear.
>>> >>> >          >>
>>> >>> >          >>
>>> >>> >          >> > On Mon, Oct 15, 2018 at 8:04 AM Robert Bradshaw
>>> >>> >         <[email protected] <mailto:[email protected]>> wrote:
>>> >>> >          >> >>
>>> >>> >          >> >> On Mon, Oct 15, 2018 at 3:58 PM Maximilian Michels
>>> >>> >         <[email protected] <mailto:[email protected]>> wrote:
>>> >>> >          >> >> >
>>> >>> >          >> >> > I agree that the current approach breaks the
>>> pipeline
>>> >>> >         options contract
>>> >>> >          >> >> > because "unknown" options get parsed in the same
>>> way as
>>> >>> >         options which
>>> >>> >          >> >> > have been defined by the user.
>>> >>> >          >> >>
>>> >>> >          >> >> FWIW, I think we're already breaking this
>>> "contract."
>>> >>> >         Unknown options
>>> >>> >          >> >> are silently ignored; with this change we just
>>> change how
>>> >>> >         we record
>>> >>> >          >> >> them. It still feels a bit hacky though.
>>> >>> >          >> >>
>>> >>> >          >> >> > I'm not sure the `experiments` flag works for us.
>>> AFAIK
>>> >>> >         it only allows
>>> >>> >          >> >> > true/false flags. We want to pass all types of
>>> pipeline
>>> >>> >         options to the
>>> >>> >          >> >> > Runner.
>>> >>> >          >> >>
>>> >>> >          >> >> Experiments is an arbitrary set of strings, which
>>> can be
>>> >>> >         of the form
>>> >>> >          >> >> "param=value" if that's useful. (Dataflow does
>>> this.)
>>> >>> >         There is, again,
>>> >>> >          >> >> no namespacing on the param names, but we could
>>> user urns
>>> >>> >         or impose
>>> >>> >          >> >> some other structure here.
>>> >>> >          >> >>
>>> >>> >          >> >> > How to solve this?
>>> >>> >          >> >> >
>>> >>> >          >> >> > 1) Add all options of all Runners to each SDK
>>> >>> >          >> >> > We added some of the FlinkRunner options to the
>>> Python
>>> >>> >         SDK but realized
>>> >>> >          >> >> > syncing is rather cumbersome in the long term.
>>> However,
>>> >>> >         we want the most
>>> >>> >          >> >> > important options to be validated on the client
>>> side.
>>> >>> >          >> >>
>>> >>> >          >> >> I don't think this is sustainable in the long run.
>>> >>> >         However, thinking
>>> >>> >          >> >> about this, in the worse case validation happens
>>> after
>>> >>> >         construction
>>> >>> >          >> >> but before execution (as with much of our other
>>> >>> >         validation) so it
>>> >>> >          >> >> isn't that bad.
>>> >>> >          >> >>
>>> >>> >          >> >> > 2) Pass "unknown" options via a separate list in
>>> the
>>> >>> >         Proto which can
>>> >>> >          >> >> > only be accessed internally by the Runners. This
>>> still
>>> >>> >         allows passing
>>> >>> >          >> >> > arbitrary options but we wouldn't leak unknown
>>> options
>>> >>> >         and display them
>>> >>> >          >> >> > as top-level options.
>>> >>> >          >> >>
>>> >>> >          >> >> I think there needs to be a way for the user to
>>> >>> >         communicate values
>>> >>> >          >> >> directly to the runner regardless of the SDK. My
>>> >>> >         preference would be
>>> >>> >          >> >> to make this explicit, e.g. (repeated)
>>> >>> >         --runner_option=..., rather
>>> >>> >          >> >> than scooping up all unknown flags at command line
>>> >>> >         parsing time.
>>> >>> >          >> >> Perhaps an SDK that is aware of some runners could
>>> choose
>>> >>> >         to lift
>>> >>> >          >> >> these as top-level options, but still pass them as
>>> runner
>>> >>> >         options.
>>> >>> >          >> >>
>>> >>> >          >> >> > On 13.10.18 02:34, Charles Chen wrote:
>>> >>> >          >> >> > > The current release branch
>>> >>> >          >> >> > >
>>> >>> >         (https://github.com/apache/beam/commits/release-2.8.0)
>>> was cut
>>> >>> >         after the
>>> >>> >          >> >> > > revert went in.  Sent out
>>> >>> >         https://github.com/apache/beam/pull/6683 as a
>>> >>> >          >> >> > > revert of the revert.  Regarding your comment
>>> above,
>>> >>> >         I can help out with
>>> >>> >          >> >> > > the design / PR reviews for common Python code
>>> as you
>>> >>> >         suggest.
>>> >>> >          >> >> > >
>>> >>> >          >> >> > > On Fri, Oct 12, 2018 at 4:48 PM Thomas Weise
>>> >>> >         <[email protected] <mailto:[email protected]>
>>> >>> >          >> >> > > <mailto:[email protected] <mailto:[email protected]>>>
>>> wrote:
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >     Thanks, will tag you and looking forward to
>>> >>> >         feedback so we can
>>> >>> >          >> >> > >     ensure that changes work for everyone.
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >     Looking at the PR, I see agreement from Max
>>> to
>>> >>> >         revert the change on
>>> >>> >          >> >> > >     the release branch, but not in master.
>>> Would you
>>> >>> >         mind to restore it
>>> >>> >          >> >> > >     in master?
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >     Thanks
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >     On Fri, Oct 12, 2018 at 4:40 PM Ahmet Altay
>>> >>> >         <[email protected] <mailto:[email protected]>
>>> >>> >          >> >> > >     <mailto:[email protected]
>>> >>> >         <mailto:[email protected]>>> wrote:
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >         On Fri, Oct 12, 2018 at 11:31 AM,
>>> Charles
>>> >>> >         Chen <[email protected] <mailto:[email protected]>
>>> >>> >          >> >> > >         <mailto:[email protected]
>>> >>> >         <mailto:[email protected]>>> wrote:
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >             What I mean is that a user may find
>>> that
>>> >>> >         it works for them
>>> >>> >          >> >> > >             to pass "--myarg blah" and access
>>> it as
>>> >>> >         "options.myarg"
>>> >>> >          >> >> > >             without explicitly defining a
>>> "my_arg"
>>> >>> >         flag due to the added
>>> >>> >          >> >> > >             logic.  This is not the intended
>>> behavior
>>> >>> >         and we may want to
>>> >>> >          >> >> > >             change this implementation detail
>>> in the
>>> >>> >         future.  However,
>>> >>> >          >> >> > >             having this logic in a released
>>> version
>>> >>> >         makes it hard to
>>> >>> >          >> >> > >             change this behavior since users may
>>> >>> >         erroneously depend on
>>> >>> >          >> >> > >             this undocumented behavior.
>>> Instead, we
>>> >>> >         should namespace /
>>> >>> >          >> >> > >             scope this so that it is obvious
>>> that
>>> >>> >         this is meant for
>>> >>> >          >> >> > >             runner (and not Beam user)
>>> consumption.
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >             On Fri, Oct 12, 2018 at 10:48 AM
>>> Thomas Weise
>>> >>> >          >> >> > >             <[email protected] <mailto:
>>> [email protected]>
>>> >>> >         <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >                 Can you please elaborate more
>>> what
>>> >>> >         practical problems
>>> >>> >          >> >> > >                 this introduces for users?
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >                 I can see that this change
>>> allows a
>>> >>> >         user to specify a
>>> >>> >          >> >> > >                 runner specific option, which
>>> in the
>>> >>> >         future may change
>>> >>> >          >> >> > >                 because we decide to scope
>>> >>> >         differently. If this only
>>> >>> >          >> >> > >                 affects users of the portable
>>> Flink
>>> >>> >         runner (like us),
>>> >>> >          >> >> > >                 then no need to revert, because
>>> at
>>> >>> >         this early stage we
>>> >>> >          >> >> > >                 prefer something that works over
>>> >>> >         being blocked.
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >                 It would also be really great
>>> if some
>>> >>> >         of the core Python
>>> >>> >          >> >> > >                 SDK developers could help out
>>> with
>>> >>> >         the design aspects
>>> >>> >          >> >> > >                 and PR reviews of changes that
>>> affect
>>> >>> >         common Python
>>> >>> >          >> >> > >                 code. Anyone who specifically
>>> wants
>>> >>> >         to be tagged on
>>> >>> >          >> >> > >                 relevant JIRAs and PRs?
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >         I would be happy to be tagged, and I
>>> can also
>>> >>> >         help with
>>> >>> >          >> >> > >         including other relevant folks whenever
>>> >>> >         possible. In general I
>>> >>> >          >> >> > >         think Robert, Charles, myself are good
>>> >>> >         candidates.
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >                 Thanks
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >                 On Fri, Oct 12, 2018 at 10:20 AM
>>> >>> >         Ahmet Altay
>>> >>> >          >> >> > >                 <[email protected]
>>> >>> >         <mailto:[email protected]> <mailto:[email protected]
>>> >>> >         <mailto:[email protected]>>> wrote:
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >                     On Fri, Oct 12, 2018 at
>>> 10:11 AM,
>>> >>> >         Charles Chen
>>> >>> >          >> >> > >                     <[email protected]
>>> >>> >         <mailto:[email protected]> <mailto:[email protected]
>>> >>> >         <mailto:[email protected]>>> wrote:
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >                         For context, I made
>>> comments on
>>> >>> >          >> >> > > https://github.com/apache/beam/pull/6600 noting
>>> >>> >          >> >> > >                         that the changes being
>>> made
>>> >>> >         were not good for
>>> >>> >          >> >> > >                         Beam
>>> >>> >         backwards-compatibility.  The change as is
>>> >>> >          >> >> > >                         allows users to use
>>> pipeline
>>> >>> >         options without
>>> >>> >          >> >> > >                         explicitly defining
>>> them,
>>> >>> >         which is not the type
>>> >>> >          >> >> > >                         of usage we would like
>>> to
>>> >>> >         encourage since we
>>> >>> >          >> >> > >                         prefer to be explicit
>>> >>> >         whenever possible.  If
>>> >>> >          >> >> > >                         users write pipelines
>>> with
>>> >>> >         this sort of pattern,
>>> >>> >          >> >> > >                         they will potentially
>>> >>> >         encounter pain when
>>> >>> >          >> >> > >                         upgrading to a later
>>> version
>>> >>> >         since this is an
>>> >>> >          >> >> > >                         implementation detail
>>> and not
>>> >>> >         an officially
>>> >>> >          >> >> > >                         supported pattern.  I
>>> agree
>>> >>> >         with the comments
>>> >>> >          >> >> > >                         above that this is
>>> ultimately
>>> >>> >         a scoping issue.
>>> >>> >          >> >> > >                         I would not have a
>>> problem
>>> >>> >         with these changes if
>>> >>> >          >> >> > >                         they were explicitly
>>> scoped
>>> >>> >         under either a
>>> >>> >          >> >> > >                         runner or unparsed
>>> options
>>> >>> >         namespace.
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >                         As a second note, since
>>> the
>>> >>> >         2.8.0 release is
>>> >>> >          >> >> > >                         being cut right now,
>>> because
>>> >>> >         of these
>>> >>> >          >> >> > >                         backwards-compatibility
>>> >>> >         concerns, I would
>>> >>> >          >> >> > >                         suggest reverting these
>>> >>> >         changes, at least until
>>> >>> >          >> >> > >                         2.8.0 is cut, so we can
>>> have
>>> >>> >         a discussion here
>>> >>> >          >> >> > >                         before committing to and
>>> >>> >         releasing any API-level
>>> >>> >          >> >> > >                         changes.
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >                     +1 I would like to revert
>>> the
>>> >>> >         changes in order not
>>> >>> >          >> >> > >                     rush this into the release.
>>> Once
>>> >>> >         this discussion
>>> >>> >          >> >> > >                     results in an agreement
>>> changes
>>> >>> >         can be brought back.
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >                         On Fri, Oct 12, 2018 at
>>> 9:26
>>> >>> >         AM Henning Rohde
>>> >>> >          >> >> > >                         <[email protected]
>>> >>> >         <mailto:[email protected]> <mailto:[email protected]
>>> >>> >         <mailto:[email protected]>>>
>>> >>> >          >> >> > >                         wrote:
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >                             Agree that pipeline
>>> >>> >         options lack some
>>> >>> >          >> >> > >                             mechanism for
>>> scoping. It
>>> >>> >         is also not always
>>> >>> >          >> >> > >                             possible distinguish
>>> >>> >         options meant to be
>>> >>> >          >> >> > >                             consumed at pipeline
>>> >>> >         construction time, by
>>> >>> >          >> >> > >                             the runner, by the
>>> SDK
>>> >>> >         harness, by the user
>>> >>> >          >> >> > >                             code or any
>>> combination
>>> >>> >         -- and this causes
>>> >>> >          >> >> > >                             confusion every now
>>> and then.
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >                             For Dataflow, we
>>> have
>>> >>> >         been using
>>> >>> >          >> >> > >                             "experiments" for
>>> >>> >         arbitrary runner-specific
>>> >>> >          >> >> > >                             options. It's
>>> simply a
>>> >>> >         string list pipeline
>>> >>> >          >> >> > >                             option that all SDKs
>>> >>> >         support and, for Go at
>>> >>> >          >> >> > >                             least, is sent to
>>> >>> >         portable runners. Flink
>>> >>> >          >> >> > >                             can do the same in
>>> the
>>> >>> >         short term to move
>>> >>> >          >> >> > >                             forward.
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >                             Henning
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >                             On Fri, Oct 12,
>>> 2018 at
>>> >>> >         8:50 AM Thomas Weise
>>> >>> >          >> >> > >                             <[email protected]
>>> >>> >         <mailto:[email protected]> <mailto:[email protected]
>>> >>> >         <mailto:[email protected]>>> wrote:
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >                                 [moving to the
>>> list]
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >                                 The requirement
>>> >>> >         driving this part of the
>>> >>> >          >> >> > >                                 change was to
>>> allow a
>>> >>> >         user to specify
>>> >>> >          >> >> > >                                 pipeline
>>> options that
>>> >>> >         a runner supports
>>> >>> >          >> >> > >                                 without having
>>> to
>>> >>> >         declare those in each
>>> >>> >          >> >> > >                                 language SDK.
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >                                 In the specific
>>> >>> >         scenario, we have
>>> >>> >          >> >> > >                                 options that the
>>> >>> >         Flink runner supports
>>> >>> >          >> >> > >                                 (and can
>>> validate),
>>> >>> >         that are not
>>> >>> >          >> >> > >                                 enumerated in
>>> the
>>> >>> >         Python SDK.
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >                                 I think we have
>>> a
>>> >>> >         bigger problem scoping
>>> >>> >          >> >> > >                                 pipeline
>>> options. For
>>> >>> >         example, the
>>> >>> >          >> >> > >                                 runner options
>>> are
>>> >>> >         dumped into the SDK
>>> >>> >          >> >> > >                                 worker. There
>>> is also
>>> >>> >         a possibility of
>>> >>> >          >> >> > >                                 name
>>> collisions. So I
>>> >>> >         think this would
>>> >>> >          >> >> > >                                 benefit from
>>> broader
>>> >>> >         feedback.
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >                                 Thanks,
>>> >>> >          >> >> > >                                 Thomas
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >                                 ----------
>>> Forwarded
>>> >>> >         message ---------
>>> >>> >          >> >> > >                                 From: *Charles
>>> Chen*
>>> >>> >          >> >> > >
>>> >>> >           <[email protected] <mailto:
>>> [email protected]>
>>> >>> >          >> >> > >
>>> >>> >           <mailto:[email protected]
>>> >>> >         <mailto:[email protected]>>>
>>> >>> >          >> >> > >                                 Date: Fri, Oct
>>> 12,
>>> >>> >         2018 at 8:36 AM
>>> >>> >          >> >> > >                                 Subject: Re:
>>> >>> >         [apache/beam] [BEAM-5442]
>>> >>> >          >> >> > >                                 Store duplicate
>>> >>> >         unknown options in a
>>> >>> >          >> >> > >                                 list argument
>>> (#6600)
>>> >>> >          >> >> > >                                 To: apache/beam
>>> >>> >         <[email protected] <mailto:[email protected]>
>>> >>> >          >> >> > >
>>> >>> >           <mailto:[email protected] <mailto:
>>> [email protected]>>>
>>> >>> >          >> >> > >                                 Cc: Thomas Weise
>>> >>> >         <[email protected] <mailto:[email protected]>
>>> >>> >          >> >> > >
>>> >>> >           <mailto:[email protected] <mailto:
>>> [email protected]>>>,
>>> >>> >          >> >> > >                                 Mention
>>> >>> >         <[email protected] <mailto:
>>> [email protected]>
>>> >>> >          >> >> > >
>>> >>> >           <mailto:[email protected]
>>> >>> >         <mailto:[email protected]>>>
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >                                 CC: @tweise
>>> >>> >         <https://github.com/tweise>
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >                                 —
>>> >>> >          >> >> > >                                 You are
>>> receiving
>>> >>> >         this because you were
>>> >>> >          >> >> > >                                 mentioned.
>>> >>> >          >> >> > >                                 Reply to this
>>> email
>>> >>> >         directly, view it on
>>> >>> >          >> >> > >                                 GitHub
>>> >>> >          >> >> > >
>>> >>> >           <
>>> https://github.com/apache/beam/pull/6600#issuecomment-429367754>,
>>> >>> >          >> >> > >                                 or mute the
>>> thread
>>> >>> >          >> >> > >
>>> >>> >           <
>>> https://github.com/notifications/unsubscribe-auth/AAQGDwwt15R85eq9pySUisyxq2HYz-Vyks5ukLcLgaJpZM4XMo-T
>>> >.
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >
>>> >>> >
>>>
>>

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Reply via email to