Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Robert Bradshaw Mon, 15 Oct 2018 14:41:45 -0700

On Mon, Oct 15, 2018 at 11:30 PM Lukasz Cwik <[email protected]> wrote:
>
> On Mon, Oct 15, 2018 at 1:17 PM Robert Bradshaw <[email protected]> wrote:
>>
>> On Mon, Oct 15, 2018 at 7:50 PM Lukasz Cwik <[email protected]> wrote:
>> >
>> > I agree with the sentiment for better error checking.
>> >
>> > We can try to make it such that the SDK can "fetch" the set of options 
>> > that the runner supports by making a call to the Job API. The API could 
>> > return a list of option names (descriptions for --help purposes and also 
>> > potentially the expected format) which would remove the worry around 
>> > "unknown" options. Yes I understand to be able to make the Job API call, 
>> > we may need to parse some options from the args parameters first and then 
>> > parse the unknown options after they are fetched.
>>
>> This is an interesting idea, but seems it could get quite complicated.
>> E.g. for delegating runners, one would first read the options to
>> determine which runner to fetch the options from, which would then
>> return a set of options that possibly depends on the values of some of
>> its options...
>>
>> > Alternatively, we can choose an explicit format upfront.
>> > To expand on the exact format for --runner_option=..., here are some 
>> > different ideas:
>> > 1) Specified multiple times, each one is an explicit flag
>> > --runner_option=--blah=bar --runner_option=--foo=baz1 
>> > --runner_option=--foo=baz2
>>
>> I'm -1 on this format. We should move away from the idea that options
>> == flags (as that doesn't compose well with other libraries that do
>> their own flags parsing). The ability to parse a set of flags into
>> options is just a convenience that an author may (or may not) choose
>> to use (e.g. when running pipelines a long-lived process like a
>> service or a notebook, the command line flags are almost certainly not
>> the right interface).
>>
>> > 2) specified multiple times, we drop the explicit flag
>> > --runner_option=blah=bar --runner_option=foo=baz1 --runner_option=foo=baz2
>>
>> This or (4) is my preference.
>>
>> > 3) we use a string which the runner can choose to interpret however they 
>> > want (JSON/XML shown below)
>> > --runner_option='{"blah": "bar", "foo": ["baz1", "baz2"]}'
>> > --runner_option='<options><blah>bar</blah><foo>baz1</foo><foo>baz2</foo></options>'
>>
>> This would make validation hard. Also, I think it makes sense for some
>> runner options to be "shared" (parallelism") by convention, so letting
>> it be a free-form string wouldn't allow different runners to inspect
>> different bits.
>>
>> We should consider if we should use urns for namespacing, and
>> assigning semantic meaning to strings, here.
>>
>> > 4) we use a string which must be a specific format such as JSON (allows 
>> > the SDK to do simple validation):
>> > --runner_option='{"blah": "bar", "foo": ["baz1", "baz2"]}'
>>
>> I like this in that at least some validation can be performed, and
>> expectations of how to format richer types. On the other hand it gets
>> a bit verbose, given that most (I'd imagine) options will be simple.
>> As with normal options,
>>
>>     --option1=value1 --option2=value2
>>
>> is shorthand for {"option1": value1, "option2": value2}.
>>
> I lean to 4 the most. With 2, you run into issues of what does 
> --runner_option=foo=["a", "b"] --runner_option=foo=["c", "d"] mean?
> Is it an error or list of lists or concatenated. Similar issues for map types 
> represented via JSON object {...}


We can err to be on the safe side unless/until an argument can be made
that merging is more natural. I just think this will be excessively
verbose to use.

>> > I would strongly suggest that we go with the "fetch" approach, since this 
>> > makes the set of options discoverable and helps users find errors much 
>> > earlier in their pipeline.
>>
>> This seems like an advanced feature that SDKs may want to support, but
>> I wouldn't want to require this complexity for bootstrapping an SDK.
>>
> SDKs that are starting off wouldn't need to "fetch" options, they could 
> choose to not support runner options or they could choose to pass all options 
> through to the runner blindly. Fetching the options only provides the SDK the 
> ability to provide error checking upfront and useful error/help messages.

But how to even pass all options through blindly is exactly the
difficulty we're running into here.

>> Regarding always keeping runner options separate, +1, though I'm not
>> sure the line is always clear.
>>
>>
>> > On Mon, Oct 15, 2018 at 8:04 AM Robert Bradshaw <[email protected]> 
>> > wrote:
>> >>
>> >> On Mon, Oct 15, 2018 at 3:58 PM Maximilian Michels <[email protected]> 
>> >> wrote:
>> >> >
>> >> > I agree that the current approach breaks the pipeline options contract
>> >> > because "unknown" options get parsed in the same way as options which
>> >> > have been defined by the user.
>> >>
>> >> FWIW, I think we're already breaking this "contract." Unknown options
>> >> are silently ignored; with this change we just change how we record
>> >> them. It still feels a bit hacky though.
>> >>
>> >> > I'm not sure the `experiments` flag works for us. AFAIK it only allows
>> >> > true/false flags. We want to pass all types of pipeline options to the
>> >> > Runner.
>> >>
>> >> Experiments is an arbitrary set of strings, which can be of the form
>> >> "param=value" if that's useful. (Dataflow does this.) There is, again,
>> >> no namespacing on the param names, but we could user urns or impose
>> >> some other structure here.
>> >>
>> >> > How to solve this?
>> >> >
>> >> > 1) Add all options of all Runners to each SDK
>> >> > We added some of the FlinkRunner options to the Python SDK but realized
>> >> > syncing is rather cumbersome in the long term. However, we want the most
>> >> > important options to be validated on the client side.
>> >>
>> >> I don't think this is sustainable in the long run. However, thinking
>> >> about this, in the worse case validation happens after construction
>> >> but before execution (as with much of our other validation) so it
>> >> isn't that bad.
>> >>
>> >> > 2) Pass "unknown" options via a separate list in the Proto which can
>> >> > only be accessed internally by the Runners. This still allows passing
>> >> > arbitrary options but we wouldn't leak unknown options and display them
>> >> > as top-level options.
>> >>
>> >> I think there needs to be a way for the user to communicate values
>> >> directly to the runner regardless of the SDK. My preference would be
>> >> to make this explicit, e.g. (repeated) --runner_option=..., rather
>> >> than scooping up all unknown flags at command line parsing time.
>> >> Perhaps an SDK that is aware of some runners could choose to lift
>> >> these as top-level options, but still pass them as runner options.
>> >>
>> >> > On 13.10.18 02:34, Charles Chen wrote:
>> >> > > The current release branch
>> >> > > (https://github.com/apache/beam/commits/release-2.8.0) was cut after 
>> >> > > the
>> >> > > revert went in.  Sent out https://github.com/apache/beam/pull/6683 as 
>> >> > > a
>> >> > > revert of the revert.  Regarding your comment above, I can help out 
>> >> > > with
>> >> > > the design / PR reviews for common Python code as you suggest.
>> >> > >
>> >> > > On Fri, Oct 12, 2018 at 4:48 PM Thomas Weise <[email protected]
>> >> > > <mailto:[email protected]>> wrote:
>> >> > >
>> >> > >     Thanks, will tag you and looking forward to feedback so we can
>> >> > >     ensure that changes work for everyone.
>> >> > >
>> >> > >     Looking at the PR, I see agreement from Max to revert the change 
>> >> > > on
>> >> > >     the release branch, but not in master. Would you mind to restore 
>> >> > > it
>> >> > >     in master?
>> >> > >
>> >> > >     Thanks
>> >> > >
>> >> > >     On Fri, Oct 12, 2018 at 4:40 PM Ahmet Altay <[email protected]
>> >> > >     <mailto:[email protected]>> wrote:
>> >> > >
>> >> > >
>> >> > >
>> >> > >         On Fri, Oct 12, 2018 at 11:31 AM, Charles Chen 
>> >> > > <[email protected]
>> >> > >         <mailto:[email protected]>> wrote:
>> >> > >
>> >> > >             What I mean is that a user may find that it works for them
>> >> > >             to pass "--myarg blah" and access it as "options.myarg"
>> >> > >             without explicitly defining a "my_arg" flag due to the 
>> >> > > added
>> >> > >             logic.  This is not the intended behavior and we may want 
>> >> > > to
>> >> > >             change this implementation detail in the future.  However,
>> >> > >             having this logic in a released version makes it hard to
>> >> > >             change this behavior since users may erroneously depend on
>> >> > >             this undocumented behavior.  Instead, we should namespace 
>> >> > > /
>> >> > >             scope this so that it is obvious that this is meant for
>> >> > >             runner (and not Beam user) consumption.
>> >> > >
>> >> > >             On Fri, Oct 12, 2018 at 10:48 AM Thomas Weise
>> >> > >             <[email protected] <mailto:[email protected]>> wrote:
>> >> > >
>> >> > >                 Can you please elaborate more what practical problems
>> >> > >                 this introduces for users?
>> >> > >
>> >> > >                 I can see that this change allows a user to specify a
>> >> > >                 runner specific option, which in the future may change
>> >> > >                 because we decide to scope differently. If this only
>> >> > >                 affects users of the portable Flink runner (like us),
>> >> > >                 then no need to revert, because at this early stage we
>> >> > >                 prefer something that works over being blocked.
>> >> > >
>> >> > >                 It would also be really great if some of the core 
>> >> > > Python
>> >> > >                 SDK developers could help out with the design aspects
>> >> > >                 and PR reviews of changes that affect common Python
>> >> > >                 code. Anyone who specifically wants to be tagged on
>> >> > >                 relevant JIRAs and PRs?
>> >> > >
>> >> > >
>> >> > >         I would be happy to be tagged, and I can also help with
>> >> > >         including other relevant folks whenever possible. In general I
>> >> > >         think Robert, Charles, myself are good candidates.
>> >> > >
>> >> > >
>> >> > >                 Thanks
>> >> > >
>> >> > >
>> >> > >                 On Fri, Oct 12, 2018 at 10:20 AM Ahmet Altay
>> >> > >                 <[email protected] <mailto:[email protected]>> wrote:
>> >> > >
>> >> > >
>> >> > >
>> >> > >                     On Fri, Oct 12, 2018 at 10:11 AM, Charles Chen
>> >> > >                     <[email protected] <mailto:[email protected]>> wrote:
>> >> > >
>> >> > >                         For context, I made comments on
>> >> > >                         https://github.com/apache/beam/pull/6600 
>> >> > > noting
>> >> > >                         that the changes being made were not good for
>> >> > >                         Beam backwards-compatibility.  The change as 
>> >> > > is
>> >> > >                         allows users to use pipeline options without
>> >> > >                         explicitly defining them, which is not the 
>> >> > > type
>> >> > >                         of usage we would like to encourage since we
>> >> > >                         prefer to be explicit whenever possible.  If
>> >> > >                         users write pipelines with this sort of 
>> >> > > pattern,
>> >> > >                         they will potentially encounter pain when
>> >> > >                         upgrading to a later version since this is an
>> >> > >                         implementation detail and not an officially
>> >> > >                         supported pattern.  I agree with the comments
>> >> > >                         above that this is ultimately a scoping issue.
>> >> > >                         I would not have a problem with these changes 
>> >> > > if
>> >> > >                         they were explicitly scoped under either a
>> >> > >                         runner or unparsed options namespace.
>> >> > >
>> >> > >                         As a second note, since the 2.8.0 release is
>> >> > >                         being cut right now, because of these
>> >> > >                         backwards-compatibility concerns, I would
>> >> > >                         suggest reverting these changes, at least 
>> >> > > until
>> >> > >                         2.8.0 is cut, so we can have a discussion here
>> >> > >                         before committing to and releasing any 
>> >> > > API-level
>> >> > >                         changes.
>> >> > >
>> >> > >
>> >> > >                     +1 I would like to revert the changes in order not
>> >> > >                     rush this into the release. Once this discussion
>> >> > >                     results in an agreement changes can be brought 
>> >> > > back.
>> >> > >
>> >> > >
>> >> > >                         On Fri, Oct 12, 2018 at 9:26 AM Henning Rohde
>> >> > >                         <[email protected] 
>> >> > > <mailto:[email protected]>>
>> >> > >                         wrote:
>> >> > >
>> >> > >                             Agree that pipeline options lack some
>> >> > >                             mechanism for scoping. It is also not 
>> >> > > always
>> >> > >                             possible distinguish options meant to be
>> >> > >                             consumed at pipeline construction time, by
>> >> > >                             the runner, by the SDK harness, by the 
>> >> > > user
>> >> > >                             code or any combination -- and this causes
>> >> > >                             confusion every now and then.
>> >> > >
>> >> > >                             For Dataflow, we have been using
>> >> > >                             "experiments" for arbitrary 
>> >> > > runner-specific
>> >> > >                             options. It's simply a string list 
>> >> > > pipeline
>> >> > >                             option that all SDKs support and, for Go 
>> >> > > at
>> >> > >                             least, is sent to portable runners. Flink
>> >> > >                             can do the same in the short term to move
>> >> > >                             forward.
>> >> > >
>> >> > >                             Henning
>> >> > >
>> >> > >
>> >> > >                             On Fri, Oct 12, 2018 at 8:50 AM Thomas 
>> >> > > Weise
>> >> > >                             <[email protected] 
>> >> > > <mailto:[email protected]>> wrote:
>> >> > >
>> >> > >                                 [moving to the list]
>> >> > >
>> >> > >                                 The requirement driving this part of 
>> >> > > the
>> >> > >                                 change was to allow a user to specify
>> >> > >                                 pipeline options that a runner 
>> >> > > supports
>> >> > >                                 without having to declare those in 
>> >> > > each
>> >> > >                                 language SDK.
>> >> > >
>> >> > >                                 In the specific scenario, we have
>> >> > >                                 options that the Flink runner supports
>> >> > >                                 (and can validate), that are not
>> >> > >                                 enumerated in the Python SDK.
>> >> > >
>> >> > >                                 I think we have a bigger problem 
>> >> > > scoping
>> >> > >                                 pipeline options. For example, the
>> >> > >                                 runner options are dumped into the SDK
>> >> > >                                 worker. There is also a possibility of
>> >> > >                                 name collisions. So I think this would
>> >> > >                                 benefit from broader feedback.
>> >> > >
>> >> > >                                 Thanks,
>> >> > >                                 Thomas
>> >> > >
>> >> > >
>> >> > >                                 ---------- Forwarded message ---------
>> >> > >                                 From: *Charles Chen*
>> >> > >                                 <[email protected]
>> >> > >                                 <mailto:[email protected]>>
>> >> > >                                 Date: Fri, Oct 12, 2018 at 8:36 AM
>> >> > >                                 Subject: Re: [apache/beam] [BEAM-5442]
>> >> > >                                 Store duplicate unknown options in a
>> >> > >                                 list argument (#6600)
>> >> > >                                 To: apache/beam 
>> >> > > <[email protected]
>> >> > >                                 <mailto:[email protected]>>
>> >> > >                                 Cc: Thomas Weise 
>> >> > > <[email protected]
>> >> > >                                 <mailto:[email protected]>>,
>> >> > >                                 Mention <[email protected]
>> >> > >                                 <mailto:[email protected]>>
>> >> > >
>> >> > >
>> >> > >                                 CC: @tweise 
>> >> > > <https://github.com/tweise>
>> >> > >
>> >> > >                                 —
>> >> > >                                 You are receiving this because you 
>> >> > > were
>> >> > >                                 mentioned.
>> >> > >                                 Reply to this email directly, view it 
>> >> > > on
>> >> > >                                 GitHub
>> >> > >                                 
>> >> > > <https://github.com/apache/beam/pull/6600#issuecomment-429367754>,
>> >> > >                                 or mute the thread
>> >> > >                                 
>> >> > > <https://github.com/notifications/unsubscribe-auth/AAQGDwwt15R85eq9pySUisyxq2HYz-Vyks5ukLcLgaJpZM4XMo-T>.
>> >> > >
>> >> > >
>> >> > >

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Reply via email to