Would it be better to generally separate the runner options (whether they are unknown or not) from other pipeline options?
On Mon, Oct 15, 2018 at 10:55 AM Lukasz Cwik <lc...@google.com> wrote: > Note, that thinking ahead to cross language pipelines, we'll want > something like "options" discovery as well. So reusing this concept for > runners makes sense to me. > > On Mon, Oct 15, 2018 at 10:50 AM Lukasz Cwik <lc...@google.com> wrote: > >> I agree with the sentiment for better error checking. >> >> We can try to make it such that the SDK can "fetch" the set of options >> that the runner supports by making a call to the Job API. The API could >> return a list of option names (descriptions for --help purposes and also >> potentially the expected format) which would remove the worry around >> "unknown" options. Yes I understand to be able to make the Job API call, we >> may need to parse some options from the args parameters first and then >> parse the unknown options after they are fetched. >> >> Alternatively, we can choose an explicit format upfront. >> To expand on the exact format for --runner_option=..., here are some >> different ideas: >> 1) Specified multiple times, each one is an explicit flag >> --runner_option=--blah=bar --runner_option=--foo=baz1 >> --runner_option=--foo=baz2 >> >> 2) specified multiple times, we drop the explicit flag >> --runner_option=blah=bar --runner_option=foo=baz1 >> --runner_option=foo=baz2 >> >> 3) we use a string which the runner can choose to interpret however they >> want (JSON/XML shown below) >> --runner_option='{"blah": "bar", "foo": ["baz1", "baz2"]}' >> >> --runner_option='<options><blah>bar</blah><foo>baz1</foo><foo>baz2</foo></options>' >> >> 4) we use a string which must be a specific format such as JSON (allows >> the SDK to do simple validation): >> --runner_option='{"blah": "bar", "foo": ["baz1", "baz2"]}' >> >> I would strongly suggest that we go with the "fetch" approach, since this >> makes the set of options discoverable and helps users find errors much >> earlier in their pipeline. >> >> >> >> On Mon, Oct 15, 2018 at 8:04 AM Robert Bradshaw <rober...@google.com> >> wrote: >> >>> On Mon, Oct 15, 2018 at 3:58 PM Maximilian Michels <m...@apache.org> >>> wrote: >>> > >>> > I agree that the current approach breaks the pipeline options contract >>> > because "unknown" options get parsed in the same way as options which >>> > have been defined by the user. >>> >>> FWIW, I think we're already breaking this "contract." Unknown options >>> are silently ignored; with this change we just change how we record >>> them. It still feels a bit hacky though. >>> >>> > I'm not sure the `experiments` flag works for us. AFAIK it only allows >>> > true/false flags. We want to pass all types of pipeline options to the >>> > Runner. >>> >>> Experiments is an arbitrary set of strings, which can be of the form >>> "param=value" if that's useful. (Dataflow does this.) There is, again, >>> no namespacing on the param names, but we could user urns or impose >>> some other structure here. >>> >>> > How to solve this? >>> > >>> > 1) Add all options of all Runners to each SDK >>> > We added some of the FlinkRunner options to the Python SDK but realized >>> > syncing is rather cumbersome in the long term. However, we want the >>> most >>> > important options to be validated on the client side. >>> >>> I don't think this is sustainable in the long run. However, thinking >>> about this, in the worse case validation happens after construction >>> but before execution (as with much of our other validation) so it >>> isn't that bad. >>> >>> > 2) Pass "unknown" options via a separate list in the Proto which can >>> > only be accessed internally by the Runners. This still allows passing >>> > arbitrary options but we wouldn't leak unknown options and display them >>> > as top-level options. >>> >>> I think there needs to be a way for the user to communicate values >>> directly to the runner regardless of the SDK. My preference would be >>> to make this explicit, e.g. (repeated) --runner_option=..., rather >>> than scooping up all unknown flags at command line parsing time. >>> Perhaps an SDK that is aware of some runners could choose to lift >>> these as top-level options, but still pass them as runner options. >>> >>> > On 13.10.18 02:34, Charles Chen wrote: >>> > > The current release branch >>> > > (https://github.com/apache/beam/commits/release-2.8.0) was cut >>> after the >>> > > revert went in. Sent out https://github.com/apache/beam/pull/6683 >>> as a >>> > > revert of the revert. Regarding your comment above, I can help out >>> with >>> > > the design / PR reviews for common Python code as you suggest. >>> > > >>> > > On Fri, Oct 12, 2018 at 4:48 PM Thomas Weise <t...@apache.org >>> > > <mailto:t...@apache.org>> wrote: >>> > > >>> > > Thanks, will tag you and looking forward to feedback so we can >>> > > ensure that changes work for everyone. >>> > > >>> > > Looking at the PR, I see agreement from Max to revert the change >>> on >>> > > the release branch, but not in master. Would you mind to restore >>> it >>> > > in master? >>> > > >>> > > Thanks >>> > > >>> > > On Fri, Oct 12, 2018 at 4:40 PM Ahmet Altay <al...@google.com >>> > > <mailto:al...@google.com>> wrote: >>> > > >>> > > >>> > > >>> > > On Fri, Oct 12, 2018 at 11:31 AM, Charles Chen < >>> c...@google.com >>> > > <mailto:c...@google.com>> wrote: >>> > > >>> > > What I mean is that a user may find that it works for >>> them >>> > > to pass "--myarg blah" and access it as "options.myarg" >>> > > without explicitly defining a "my_arg" flag due to the >>> added >>> > > logic. This is not the intended behavior and we may >>> want to >>> > > change this implementation detail in the future. >>> However, >>> > > having this logic in a released version makes it hard to >>> > > change this behavior since users may erroneously depend >>> on >>> > > this undocumented behavior. Instead, we should >>> namespace / >>> > > scope this so that it is obvious that this is meant for >>> > > runner (and not Beam user) consumption. >>> > > >>> > > On Fri, Oct 12, 2018 at 10:48 AM Thomas Weise >>> > > <t...@apache.org <mailto:t...@apache.org>> wrote: >>> > > >>> > > Can you please elaborate more what practical problems >>> > > this introduces for users? >>> > > >>> > > I can see that this change allows a user to specify a >>> > > runner specific option, which in the future may >>> change >>> > > because we decide to scope differently. If this only >>> > > affects users of the portable Flink runner (like us), >>> > > then no need to revert, because at this early stage >>> we >>> > > prefer something that works over being blocked. >>> > > >>> > > It would also be really great if some of the core >>> Python >>> > > SDK developers could help out with the design aspects >>> > > and PR reviews of changes that affect common Python >>> > > code. Anyone who specifically wants to be tagged on >>> > > relevant JIRAs and PRs? >>> > > >>> > > >>> > > I would be happy to be tagged, and I can also help with >>> > > including other relevant folks whenever possible. In general >>> I >>> > > think Robert, Charles, myself are good candidates. >>> > > >>> > > >>> > > Thanks >>> > > >>> > > >>> > > On Fri, Oct 12, 2018 at 10:20 AM Ahmet Altay >>> > > <al...@google.com <mailto:al...@google.com>> wrote: >>> > > >>> > > >>> > > >>> > > On Fri, Oct 12, 2018 at 10:11 AM, Charles Chen >>> > > <c...@google.com <mailto:c...@google.com>> wrote: >>> > > >>> > > For context, I made comments on >>> > > https://github.com/apache/beam/pull/6600 >>> noting >>> > > that the changes being made were not good for >>> > > Beam backwards-compatibility. The change as >>> is >>> > > allows users to use pipeline options without >>> > > explicitly defining them, which is not the >>> type >>> > > of usage we would like to encourage since we >>> > > prefer to be explicit whenever possible. If >>> > > users write pipelines with this sort of >>> pattern, >>> > > they will potentially encounter pain when >>> > > upgrading to a later version since this is an >>> > > implementation detail and not an officially >>> > > supported pattern. I agree with the comments >>> > > above that this is ultimately a scoping >>> issue. >>> > > I would not have a problem with these >>> changes if >>> > > they were explicitly scoped under either a >>> > > runner or unparsed options namespace. >>> > > >>> > > As a second note, since the 2.8.0 release is >>> > > being cut right now, because of these >>> > > backwards-compatibility concerns, I would >>> > > suggest reverting these changes, at least >>> until >>> > > 2.8.0 is cut, so we can have a discussion >>> here >>> > > before committing to and releasing any >>> API-level >>> > > changes. >>> > > >>> > > >>> > > +1 I would like to revert the changes in order >>> not >>> > > rush this into the release. Once this discussion >>> > > results in an agreement changes can be brought >>> back. >>> > > >>> > > >>> > > On Fri, Oct 12, 2018 at 9:26 AM Henning Rohde >>> > > <hero...@google.com <mailto: >>> hero...@google.com>> >>> > > wrote: >>> > > >>> > > Agree that pipeline options lack some >>> > > mechanism for scoping. It is also not >>> always >>> > > possible distinguish options meant to be >>> > > consumed at pipeline construction time, >>> by >>> > > the runner, by the SDK harness, by the >>> user >>> > > code or any combination -- and this >>> causes >>> > > confusion every now and then. >>> > > >>> > > For Dataflow, we have been using >>> > > "experiments" for arbitrary >>> runner-specific >>> > > options. It's simply a string list >>> pipeline >>> > > option that all SDKs support and, for Go >>> at >>> > > least, is sent to portable runners. Flink >>> > > can do the same in the short term to move >>> > > forward. >>> > > >>> > > Henning >>> > > >>> > > >>> > > On Fri, Oct 12, 2018 at 8:50 AM Thomas >>> Weise >>> > > <t...@apache.org <mailto:t...@apache.org>> >>> wrote: >>> > > >>> > > [moving to the list] >>> > > >>> > > The requirement driving this part of >>> the >>> > > change was to allow a user to specify >>> > > pipeline options that a runner >>> supports >>> > > without having to declare those in >>> each >>> > > language SDK. >>> > > >>> > > In the specific scenario, we have >>> > > options that the Flink runner >>> supports >>> > > (and can validate), that are not >>> > > enumerated in the Python SDK. >>> > > >>> > > I think we have a bigger problem >>> scoping >>> > > pipeline options. For example, the >>> > > runner options are dumped into the >>> SDK >>> > > worker. There is also a possibility >>> of >>> > > name collisions. So I think this >>> would >>> > > benefit from broader feedback. >>> > > >>> > > Thanks, >>> > > Thomas >>> > > >>> > > >>> > > ---------- Forwarded message >>> --------- >>> > > From: *Charles Chen* >>> > > <notificati...@github.com >>> > > <mailto:notificati...@github.com>> >>> > > Date: Fri, Oct 12, 2018 at 8:36 AM >>> > > Subject: Re: [apache/beam] >>> [BEAM-5442] >>> > > Store duplicate unknown options in a >>> > > list argument (#6600) >>> > > To: apache/beam < >>> b...@noreply.github.com >>> > > <mailto:b...@noreply.github.com>> >>> > > Cc: Thomas Weise < >>> thomas.we...@gmail.com >>> > > <mailto:thomas.we...@gmail.com>>, >>> > > Mention <ment...@noreply.github.com >>> > > <mailto:ment...@noreply.github.com>> >>> > > >>> > > >>> > > CC: @tweise < >>> https://github.com/tweise> >>> > > >>> > > — >>> > > You are receiving this because you >>> were >>> > > mentioned. >>> > > Reply to this email directly, view >>> it on >>> > > GitHub >>> > > < >>> https://github.com/apache/beam/pull/6600#issuecomment-429367754>, >>> > > or mute the thread >>> > > < >>> https://github.com/notifications/unsubscribe-auth/AAQGDwwt15R85eq9pySUisyxq2HYz-Vyks5ukLcLgaJpZM4XMo-T >>> >. >>> > > >>> > > >>> > > >>> >>