Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Thomas Weise Mon, 15 Oct 2018 11:47:55 -0700

Would it be better to generally separate the runner options (whether they
are unknown or not) from other pipeline options?



On Mon, Oct 15, 2018 at 10:55 AM Lukasz Cwik <lc...@google.com> wrote:

> Note, that thinking ahead to cross language pipelines, we'll want
> something like "options" discovery as well. So reusing this concept for
> runners makes sense to me.
>
> On Mon, Oct 15, 2018 at 10:50 AM Lukasz Cwik <lc...@google.com> wrote:
>
>> I agree with the sentiment for better error checking.
>>
>> We can try to make it such that the SDK can "fetch" the set of options
>> that the runner supports by making a call to the Job API. The API could
>> return a list of option names (descriptions for --help purposes and also
>> potentially the expected format) which would remove the worry around
>> "unknown" options. Yes I understand to be able to make the Job API call, we
>> may need to parse some options from the args parameters first and then
>> parse the unknown options after they are fetched.
>>
>> Alternatively, we can choose an explicit format upfront.
>> To expand on the exact format for --runner_option=..., here are some
>> different ideas:
>> 1) Specified multiple times, each one is an explicit flag
>> --runner_option=--blah=bar --runner_option=--foo=baz1
>> --runner_option=--foo=baz2
>>
>> 2) specified multiple times, we drop the explicit flag
>> --runner_option=blah=bar --runner_option=foo=baz1
>> --runner_option=foo=baz2
>>
>> 3) we use a string which the runner can choose to interpret however they
>> want (JSON/XML shown below)
>> --runner_option='{"blah": "bar", "foo": ["baz1", "baz2"]}'
>>
>> --runner_option='<options><blah>bar</blah><foo>baz1</foo><foo>baz2</foo></options>'
>>
>> 4) we use a string which must be a specific format such as JSON (allows
>> the SDK to do simple validation):
>> --runner_option='{"blah": "bar", "foo": ["baz1", "baz2"]}'
>>
>> I would strongly suggest that we go with the "fetch" approach, since this
>> makes the set of options discoverable and helps users find errors much
>> earlier in their pipeline.
>>
>>
>>
>> On Mon, Oct 15, 2018 at 8:04 AM Robert Bradshaw <rober...@google.com>
>> wrote:
>>
>>> On Mon, Oct 15, 2018 at 3:58 PM Maximilian Michels <m...@apache.org>
>>> wrote:
>>> >
>>> > I agree that the current approach breaks the pipeline options contract
>>> > because "unknown" options get parsed in the same way as options which
>>> > have been defined by the user.
>>>
>>> FWIW, I think we're already breaking this "contract." Unknown options
>>> are silently ignored; with this change we just change how we record
>>> them. It still feels a bit hacky though.
>>>
>>> > I'm not sure the `experiments` flag works for us. AFAIK it only allows
>>> > true/false flags. We want to pass all types of pipeline options to the
>>> > Runner.
>>>
>>> Experiments is an arbitrary set of strings, which can be of the form
>>> "param=value" if that's useful. (Dataflow does this.) There is, again,
>>> no namespacing on the param names, but we could user urns or impose
>>> some other structure here.
>>>
>>> > How to solve this?
>>> >
>>> > 1) Add all options of all Runners to each SDK
>>> > We added some of the FlinkRunner options to the Python SDK but realized
>>> > syncing is rather cumbersome in the long term. However, we want the
>>> most
>>> > important options to be validated on the client side.
>>>
>>> I don't think this is sustainable in the long run. However, thinking
>>> about this, in the worse case validation happens after construction
>>> but before execution (as with much of our other validation) so it
>>> isn't that bad.
>>>
>>> > 2) Pass "unknown" options via a separate list in the Proto which can
>>> > only be accessed internally by the Runners. This still allows passing
>>> > arbitrary options but we wouldn't leak unknown options and display them
>>> > as top-level options.
>>>
>>> I think there needs to be a way for the user to communicate values
>>> directly to the runner regardless of the SDK. My preference would be
>>> to make this explicit, e.g. (repeated) --runner_option=..., rather
>>> than scooping up all unknown flags at command line parsing time.
>>> Perhaps an SDK that is aware of some runners could choose to lift
>>> these as top-level options, but still pass them as runner options.
>>>
>>> > On 13.10.18 02:34, Charles Chen wrote:
>>> > > The current release branch
>>> > > (https://github.com/apache/beam/commits/release-2.8.0) was cut
>>> after the
>>> > > revert went in.  Sent out https://github.com/apache/beam/pull/6683
>>> as a
>>> > > revert of the revert.  Regarding your comment above, I can help out
>>> with
>>> > > the design / PR reviews for common Python code as you suggest.
>>> > >
>>> > > On Fri, Oct 12, 2018 at 4:48 PM Thomas Weise <t...@apache.org
>>> > > <mailto:t...@apache.org>> wrote:
>>> > >
>>> > >     Thanks, will tag you and looking forward to feedback so we can
>>> > >     ensure that changes work for everyone.
>>> > >
>>> > >     Looking at the PR, I see agreement from Max to revert the change
>>> on
>>> > >     the release branch, but not in master. Would you mind to restore
>>> it
>>> > >     in master?
>>> > >
>>> > >     Thanks
>>> > >
>>> > >     On Fri, Oct 12, 2018 at 4:40 PM Ahmet Altay <al...@google.com
>>> > >     <mailto:al...@google.com>> wrote:
>>> > >
>>> > >
>>> > >
>>> > >         On Fri, Oct 12, 2018 at 11:31 AM, Charles Chen <
>>> c...@google.com
>>> > >         <mailto:c...@google.com>> wrote:
>>> > >
>>> > >             What I mean is that a user may find that it works for
>>> them
>>> > >             to pass "--myarg blah" and access it as "options.myarg"
>>> > >             without explicitly defining a "my_arg" flag due to the
>>> added
>>> > >             logic.  This is not the intended behavior and we may
>>> want to
>>> > >             change this implementation detail in the future.
>>> However,
>>> > >             having this logic in a released version makes it hard to
>>> > >             change this behavior since users may erroneously depend
>>> on
>>> > >             this undocumented behavior.  Instead, we should
>>> namespace /
>>> > >             scope this so that it is obvious that this is meant for
>>> > >             runner (and not Beam user) consumption.
>>> > >
>>> > >             On Fri, Oct 12, 2018 at 10:48 AM Thomas Weise
>>> > >             <t...@apache.org <mailto:t...@apache.org>> wrote:
>>> > >
>>> > >                 Can you please elaborate more what practical problems
>>> > >                 this introduces for users?
>>> > >
>>> > >                 I can see that this change allows a user to specify a
>>> > >                 runner specific option, which in the future may
>>> change
>>> > >                 because we decide to scope differently. If this only
>>> > >                 affects users of the portable Flink runner (like us),
>>> > >                 then no need to revert, because at this early stage
>>> we
>>> > >                 prefer something that works over being blocked.
>>> > >
>>> > >                 It would also be really great if some of the core
>>> Python
>>> > >                 SDK developers could help out with the design aspects
>>> > >                 and PR reviews of changes that affect common Python
>>> > >                 code. Anyone who specifically wants to be tagged on
>>> > >                 relevant JIRAs and PRs?
>>> > >
>>> > >
>>> > >         I would be happy to be tagged, and I can also help with
>>> > >         including other relevant folks whenever possible. In general
>>> I
>>> > >         think Robert, Charles, myself are good candidates.
>>> > >
>>> > >
>>> > >                 Thanks
>>> > >
>>> > >
>>> > >                 On Fri, Oct 12, 2018 at 10:20 AM Ahmet Altay
>>> > >                 <al...@google.com <mailto:al...@google.com>> wrote:
>>> > >
>>> > >
>>> > >
>>> > >                     On Fri, Oct 12, 2018 at 10:11 AM, Charles Chen
>>> > >                     <c...@google.com <mailto:c...@google.com>> wrote:
>>> > >
>>> > >                         For context, I made comments on
>>> > >                         https://github.com/apache/beam/pull/6600
>>> noting
>>> > >                         that the changes being made were not good for
>>> > >                         Beam backwards-compatibility.  The change as
>>> is
>>> > >                         allows users to use pipeline options without
>>> > >                         explicitly defining them, which is not the
>>> type
>>> > >                         of usage we would like to encourage since we
>>> > >                         prefer to be explicit whenever possible.  If
>>> > >                         users write pipelines with this sort of
>>> pattern,
>>> > >                         they will potentially encounter pain when
>>> > >                         upgrading to a later version since this is an
>>> > >                         implementation detail and not an officially
>>> > >                         supported pattern.  I agree with the comments
>>> > >                         above that this is ultimately a scoping
>>> issue.
>>> > >                         I would not have a problem with these
>>> changes if
>>> > >                         they were explicitly scoped under either a
>>> > >                         runner or unparsed options namespace.
>>> > >
>>> > >                         As a second note, since the 2.8.0 release is
>>> > >                         being cut right now, because of these
>>> > >                         backwards-compatibility concerns, I would
>>> > >                         suggest reverting these changes, at least
>>> until
>>> > >                         2.8.0 is cut, so we can have a discussion
>>> here
>>> > >                         before committing to and releasing any
>>> API-level
>>> > >                         changes.
>>> > >
>>> > >
>>> > >                     +1 I would like to revert the changes in order
>>> not
>>> > >                     rush this into the release. Once this discussion
>>> > >                     results in an agreement changes can be brought
>>> back.
>>> > >
>>> > >
>>> > >                         On Fri, Oct 12, 2018 at 9:26 AM Henning Rohde
>>> > >                         <hero...@google.com <mailto:
>>> hero...@google.com>>
>>> > >                         wrote:
>>> > >
>>> > >                             Agree that pipeline options lack some
>>> > >                             mechanism for scoping. It is also not
>>> always
>>> > >                             possible distinguish options meant to be
>>> > >                             consumed at pipeline construction time,
>>> by
>>> > >                             the runner, by the SDK harness, by the
>>> user
>>> > >                             code or any combination -- and this
>>> causes
>>> > >                             confusion every now and then.
>>> > >
>>> > >                             For Dataflow, we have been using
>>> > >                             "experiments" for arbitrary
>>> runner-specific
>>> > >                             options. It's simply a string list
>>> pipeline
>>> > >                             option that all SDKs support and, for Go
>>> at
>>> > >                             least, is sent to portable runners. Flink
>>> > >                             can do the same in the short term to move
>>> > >                             forward.
>>> > >
>>> > >                             Henning
>>> > >
>>> > >
>>> > >                             On Fri, Oct 12, 2018 at 8:50 AM Thomas
>>> Weise
>>> > >                             <t...@apache.org <mailto:t...@apache.org>>
>>> wrote:
>>> > >
>>> > >                                 [moving to the list]
>>> > >
>>> > >                                 The requirement driving this part of
>>> the
>>> > >                                 change was to allow a user to specify
>>> > >                                 pipeline options that a runner
>>> supports
>>> > >                                 without having to declare those in
>>> each
>>> > >                                 language SDK.
>>> > >
>>> > >                                 In the specific scenario, we have
>>> > >                                 options that the Flink runner
>>> supports
>>> > >                                 (and can validate), that are not
>>> > >                                 enumerated in the Python SDK.
>>> > >
>>> > >                                 I think we have a bigger problem
>>> scoping
>>> > >                                 pipeline options. For example, the
>>> > >                                 runner options are dumped into the
>>> SDK
>>> > >                                 worker. There is also a possibility
>>> of
>>> > >                                 name collisions. So I think this
>>> would
>>> > >                                 benefit from broader feedback.
>>> > >
>>> > >                                 Thanks,
>>> > >                                 Thomas
>>> > >
>>> > >
>>> > >                                 ---------- Forwarded message
>>> ---------
>>> > >                                 From: *Charles Chen*
>>> > >                                 <notificati...@github.com
>>> > >                                 <mailto:notificati...@github.com>>
>>> > >                                 Date: Fri, Oct 12, 2018 at 8:36 AM
>>> > >                                 Subject: Re: [apache/beam]
>>> [BEAM-5442]
>>> > >                                 Store duplicate unknown options in a
>>> > >                                 list argument (#6600)
>>> > >                                 To: apache/beam <
>>> b...@noreply.github.com
>>> > >                                 <mailto:b...@noreply.github.com>>
>>> > >                                 Cc: Thomas Weise <
>>> thomas.we...@gmail.com
>>> > >                                 <mailto:thomas.we...@gmail.com>>,
>>> > >                                 Mention <ment...@noreply.github.com
>>> > >                                 <mailto:ment...@noreply.github.com>>
>>> > >
>>> > >
>>> > >                                 CC: @tweise <
>>> https://github.com/tweise>
>>> > >
>>> > >                                 —
>>> > >                                 You are receiving this because you
>>> were
>>> > >                                 mentioned.
>>> > >                                 Reply to this email directly, view
>>> it on
>>> > >                                 GitHub
>>> > >                                 <
>>> https://github.com/apache/beam/pull/6600#issuecomment-429367754>,
>>> > >                                 or mute the thread
>>> > >                                 <
>>> https://github.com/notifications/unsubscribe-auth/AAQGDwwt15R85eq9pySUisyxq2HYz-Vyks5ukLcLgaJpZM4XMo-T
>>> >.
>>> > >
>>> > >
>>> > >
>>>
>>

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Reply via email to