On Mon, Oct 15, 2018 at 11:30 PM Lukasz Cwik <lc...@google.com> wrote: > > On Mon, Oct 15, 2018 at 1:17 PM Robert Bradshaw <rober...@google.com> wrote: >> >> On Mon, Oct 15, 2018 at 7:50 PM Lukasz Cwik <lc...@google.com> wrote: >> > >> > I agree with the sentiment for better error checking. >> > >> > We can try to make it such that the SDK can "fetch" the set of options >> > that the runner supports by making a call to the Job API. The API could >> > return a list of option names (descriptions for --help purposes and also >> > potentially the expected format) which would remove the worry around >> > "unknown" options. Yes I understand to be able to make the Job API call, >> > we may need to parse some options from the args parameters first and then >> > parse the unknown options after they are fetched. >> >> This is an interesting idea, but seems it could get quite complicated. >> E.g. for delegating runners, one would first read the options to >> determine which runner to fetch the options from, which would then >> return a set of options that possibly depends on the values of some of >> its options... >> >> > Alternatively, we can choose an explicit format upfront. >> > To expand on the exact format for --runner_option=..., here are some >> > different ideas: >> > 1) Specified multiple times, each one is an explicit flag >> > --runner_option=--blah=bar --runner_option=--foo=baz1 >> > --runner_option=--foo=baz2 >> >> I'm -1 on this format. We should move away from the idea that options >> == flags (as that doesn't compose well with other libraries that do >> their own flags parsing). The ability to parse a set of flags into >> options is just a convenience that an author may (or may not) choose >> to use (e.g. when running pipelines a long-lived process like a >> service or a notebook, the command line flags are almost certainly not >> the right interface). >> >> > 2) specified multiple times, we drop the explicit flag >> > --runner_option=blah=bar --runner_option=foo=baz1 --runner_option=foo=baz2 >> >> This or (4) is my preference. >> >> > 3) we use a string which the runner can choose to interpret however they >> > want (JSON/XML shown below) >> > --runner_option='{"blah": "bar", "foo": ["baz1", "baz2"]}' >> > --runner_option='<options><blah>bar</blah><foo>baz1</foo><foo>baz2</foo></options>' >> >> This would make validation hard. Also, I think it makes sense for some >> runner options to be "shared" (parallelism") by convention, so letting >> it be a free-form string wouldn't allow different runners to inspect >> different bits. >> >> We should consider if we should use urns for namespacing, and >> assigning semantic meaning to strings, here. >> >> > 4) we use a string which must be a specific format such as JSON (allows >> > the SDK to do simple validation): >> > --runner_option='{"blah": "bar", "foo": ["baz1", "baz2"]}' >> >> I like this in that at least some validation can be performed, and >> expectations of how to format richer types. On the other hand it gets >> a bit verbose, given that most (I'd imagine) options will be simple. >> As with normal options, >> >> --option1=value1 --option2=value2 >> >> is shorthand for {"option1": value1, "option2": value2}. >> > I lean to 4 the most. With 2, you run into issues of what does > --runner_option=foo=["a", "b"] --runner_option=foo=["c", "d"] mean? > Is it an error or list of lists or concatenated. Similar issues for map types > represented via JSON object {...}
We can err to be on the safe side unless/until an argument can be made that merging is more natural. I just think this will be excessively verbose to use. >> > I would strongly suggest that we go with the "fetch" approach, since this >> > makes the set of options discoverable and helps users find errors much >> > earlier in their pipeline. >> >> This seems like an advanced feature that SDKs may want to support, but >> I wouldn't want to require this complexity for bootstrapping an SDK. >> > SDKs that are starting off wouldn't need to "fetch" options, they could > choose to not support runner options or they could choose to pass all options > through to the runner blindly. Fetching the options only provides the SDK the > ability to provide error checking upfront and useful error/help messages. But how to even pass all options through blindly is exactly the difficulty we're running into here. >> Regarding always keeping runner options separate, +1, though I'm not >> sure the line is always clear. >> >> >> > On Mon, Oct 15, 2018 at 8:04 AM Robert Bradshaw <rober...@google.com> >> > wrote: >> >> >> >> On Mon, Oct 15, 2018 at 3:58 PM Maximilian Michels <m...@apache.org> >> >> wrote: >> >> > >> >> > I agree that the current approach breaks the pipeline options contract >> >> > because "unknown" options get parsed in the same way as options which >> >> > have been defined by the user. >> >> >> >> FWIW, I think we're already breaking this "contract." Unknown options >> >> are silently ignored; with this change we just change how we record >> >> them. It still feels a bit hacky though. >> >> >> >> > I'm not sure the `experiments` flag works for us. AFAIK it only allows >> >> > true/false flags. We want to pass all types of pipeline options to the >> >> > Runner. >> >> >> >> Experiments is an arbitrary set of strings, which can be of the form >> >> "param=value" if that's useful. (Dataflow does this.) There is, again, >> >> no namespacing on the param names, but we could user urns or impose >> >> some other structure here. >> >> >> >> > How to solve this? >> >> > >> >> > 1) Add all options of all Runners to each SDK >> >> > We added some of the FlinkRunner options to the Python SDK but realized >> >> > syncing is rather cumbersome in the long term. However, we want the most >> >> > important options to be validated on the client side. >> >> >> >> I don't think this is sustainable in the long run. However, thinking >> >> about this, in the worse case validation happens after construction >> >> but before execution (as with much of our other validation) so it >> >> isn't that bad. >> >> >> >> > 2) Pass "unknown" options via a separate list in the Proto which can >> >> > only be accessed internally by the Runners. This still allows passing >> >> > arbitrary options but we wouldn't leak unknown options and display them >> >> > as top-level options. >> >> >> >> I think there needs to be a way for the user to communicate values >> >> directly to the runner regardless of the SDK. My preference would be >> >> to make this explicit, e.g. (repeated) --runner_option=..., rather >> >> than scooping up all unknown flags at command line parsing time. >> >> Perhaps an SDK that is aware of some runners could choose to lift >> >> these as top-level options, but still pass them as runner options. >> >> >> >> > On 13.10.18 02:34, Charles Chen wrote: >> >> > > The current release branch >> >> > > (https://github.com/apache/beam/commits/release-2.8.0) was cut after >> >> > > the >> >> > > revert went in. Sent out https://github.com/apache/beam/pull/6683 as >> >> > > a >> >> > > revert of the revert. Regarding your comment above, I can help out >> >> > > with >> >> > > the design / PR reviews for common Python code as you suggest. >> >> > > >> >> > > On Fri, Oct 12, 2018 at 4:48 PM Thomas Weise <t...@apache.org >> >> > > <mailto:t...@apache.org>> wrote: >> >> > > >> >> > > Thanks, will tag you and looking forward to feedback so we can >> >> > > ensure that changes work for everyone. >> >> > > >> >> > > Looking at the PR, I see agreement from Max to revert the change >> >> > > on >> >> > > the release branch, but not in master. Would you mind to restore >> >> > > it >> >> > > in master? >> >> > > >> >> > > Thanks >> >> > > >> >> > > On Fri, Oct 12, 2018 at 4:40 PM Ahmet Altay <al...@google.com >> >> > > <mailto:al...@google.com>> wrote: >> >> > > >> >> > > >> >> > > >> >> > > On Fri, Oct 12, 2018 at 11:31 AM, Charles Chen >> >> > > <c...@google.com >> >> > > <mailto:c...@google.com>> wrote: >> >> > > >> >> > > What I mean is that a user may find that it works for them >> >> > > to pass "--myarg blah" and access it as "options.myarg" >> >> > > without explicitly defining a "my_arg" flag due to the >> >> > > added >> >> > > logic. This is not the intended behavior and we may want >> >> > > to >> >> > > change this implementation detail in the future. However, >> >> > > having this logic in a released version makes it hard to >> >> > > change this behavior since users may erroneously depend on >> >> > > this undocumented behavior. Instead, we should namespace >> >> > > / >> >> > > scope this so that it is obvious that this is meant for >> >> > > runner (and not Beam user) consumption. >> >> > > >> >> > > On Fri, Oct 12, 2018 at 10:48 AM Thomas Weise >> >> > > <t...@apache.org <mailto:t...@apache.org>> wrote: >> >> > > >> >> > > Can you please elaborate more what practical problems >> >> > > this introduces for users? >> >> > > >> >> > > I can see that this change allows a user to specify a >> >> > > runner specific option, which in the future may change >> >> > > because we decide to scope differently. If this only >> >> > > affects users of the portable Flink runner (like us), >> >> > > then no need to revert, because at this early stage we >> >> > > prefer something that works over being blocked. >> >> > > >> >> > > It would also be really great if some of the core >> >> > > Python >> >> > > SDK developers could help out with the design aspects >> >> > > and PR reviews of changes that affect common Python >> >> > > code. Anyone who specifically wants to be tagged on >> >> > > relevant JIRAs and PRs? >> >> > > >> >> > > >> >> > > I would be happy to be tagged, and I can also help with >> >> > > including other relevant folks whenever possible. In general I >> >> > > think Robert, Charles, myself are good candidates. >> >> > > >> >> > > >> >> > > Thanks >> >> > > >> >> > > >> >> > > On Fri, Oct 12, 2018 at 10:20 AM Ahmet Altay >> >> > > <al...@google.com <mailto:al...@google.com>> wrote: >> >> > > >> >> > > >> >> > > >> >> > > On Fri, Oct 12, 2018 at 10:11 AM, Charles Chen >> >> > > <c...@google.com <mailto:c...@google.com>> wrote: >> >> > > >> >> > > For context, I made comments on >> >> > > https://github.com/apache/beam/pull/6600 >> >> > > noting >> >> > > that the changes being made were not good for >> >> > > Beam backwards-compatibility. The change as >> >> > > is >> >> > > allows users to use pipeline options without >> >> > > explicitly defining them, which is not the >> >> > > type >> >> > > of usage we would like to encourage since we >> >> > > prefer to be explicit whenever possible. If >> >> > > users write pipelines with this sort of >> >> > > pattern, >> >> > > they will potentially encounter pain when >> >> > > upgrading to a later version since this is an >> >> > > implementation detail and not an officially >> >> > > supported pattern. I agree with the comments >> >> > > above that this is ultimately a scoping issue. >> >> > > I would not have a problem with these changes >> >> > > if >> >> > > they were explicitly scoped under either a >> >> > > runner or unparsed options namespace. >> >> > > >> >> > > As a second note, since the 2.8.0 release is >> >> > > being cut right now, because of these >> >> > > backwards-compatibility concerns, I would >> >> > > suggest reverting these changes, at least >> >> > > until >> >> > > 2.8.0 is cut, so we can have a discussion here >> >> > > before committing to and releasing any >> >> > > API-level >> >> > > changes. >> >> > > >> >> > > >> >> > > +1 I would like to revert the changes in order not >> >> > > rush this into the release. Once this discussion >> >> > > results in an agreement changes can be brought >> >> > > back. >> >> > > >> >> > > >> >> > > On Fri, Oct 12, 2018 at 9:26 AM Henning Rohde >> >> > > <hero...@google.com >> >> > > <mailto:hero...@google.com>> >> >> > > wrote: >> >> > > >> >> > > Agree that pipeline options lack some >> >> > > mechanism for scoping. It is also not >> >> > > always >> >> > > possible distinguish options meant to be >> >> > > consumed at pipeline construction time, by >> >> > > the runner, by the SDK harness, by the >> >> > > user >> >> > > code or any combination -- and this causes >> >> > > confusion every now and then. >> >> > > >> >> > > For Dataflow, we have been using >> >> > > "experiments" for arbitrary >> >> > > runner-specific >> >> > > options. It's simply a string list >> >> > > pipeline >> >> > > option that all SDKs support and, for Go >> >> > > at >> >> > > least, is sent to portable runners. Flink >> >> > > can do the same in the short term to move >> >> > > forward. >> >> > > >> >> > > Henning >> >> > > >> >> > > >> >> > > On Fri, Oct 12, 2018 at 8:50 AM Thomas >> >> > > Weise >> >> > > <t...@apache.org >> >> > > <mailto:t...@apache.org>> wrote: >> >> > > >> >> > > [moving to the list] >> >> > > >> >> > > The requirement driving this part of >> >> > > the >> >> > > change was to allow a user to specify >> >> > > pipeline options that a runner >> >> > > supports >> >> > > without having to declare those in >> >> > > each >> >> > > language SDK. >> >> > > >> >> > > In the specific scenario, we have >> >> > > options that the Flink runner supports >> >> > > (and can validate), that are not >> >> > > enumerated in the Python SDK. >> >> > > >> >> > > I think we have a bigger problem >> >> > > scoping >> >> > > pipeline options. For example, the >> >> > > runner options are dumped into the SDK >> >> > > worker. There is also a possibility of >> >> > > name collisions. So I think this would >> >> > > benefit from broader feedback. >> >> > > >> >> > > Thanks, >> >> > > Thomas >> >> > > >> >> > > >> >> > > ---------- Forwarded message --------- >> >> > > From: *Charles Chen* >> >> > > <notificati...@github.com >> >> > > <mailto:notificati...@github.com>> >> >> > > Date: Fri, Oct 12, 2018 at 8:36 AM >> >> > > Subject: Re: [apache/beam] [BEAM-5442] >> >> > > Store duplicate unknown options in a >> >> > > list argument (#6600) >> >> > > To: apache/beam >> >> > > <b...@noreply.github.com >> >> > > <mailto:b...@noreply.github.com>> >> >> > > Cc: Thomas Weise >> >> > > <thomas.we...@gmail.com >> >> > > <mailto:thomas.we...@gmail.com>>, >> >> > > Mention <ment...@noreply.github.com >> >> > > <mailto:ment...@noreply.github.com>> >> >> > > >> >> > > >> >> > > CC: @tweise >> >> > > <https://github.com/tweise> >> >> > > >> >> > > — >> >> > > You are receiving this because you >> >> > > were >> >> > > mentioned. >> >> > > Reply to this email directly, view it >> >> > > on >> >> > > GitHub >> >> > > >> >> > > <https://github.com/apache/beam/pull/6600#issuecomment-429367754>, >> >> > > or mute the thread >> >> > > >> >> > > <https://github.com/notifications/unsubscribe-auth/AAQGDwwt15R85eq9pySUisyxq2HYz-Vyks5ukLcLgaJpZM4XMo-T>. >> >> > > >> >> > > >> >> > >