On Tue, Oct 16, 2018 at 7:03 PM Lukasz Cwik <lc...@google.com> wrote: > > For all unknown options, the SDK can require that all flag values be specified explicitly as a valid JSON type. > starts with { -> object > starts with [ -> list > starts with " -> string > is null / true / false -> null / true / false > otherwise is number. > > This isn't great for strings but works well for all the other types. > > Thus for known options, the additional typing information would disambiguate whether something should be a string/boolean/number/object/list but for unknown options we would expect the user to use valid JSON explicitly and write: > --foo={"object": "value"} > --foo=["value", "value2"] > --foo="string value"
Due to shell escaping, one would have to write --foo=\"string value\" or actually, due to the space --foo='"string value"' or some other variation on that, which is really unfortunate. (The JSON list/objects would need similar quoting, but that's less surprising.) Also, does this mean we'd only have one kind of number (not integer vs. float, i.e. --parallelism=5.0 works)? I suppose that is JSON. > --foo=3.5 --foo=-4 > --foo=true --foo=false > --foo=null > This also works if the flag is repeated, so --foo=3.5 --foo=-4 is [3.5, -4] The thing that sparked this discussion was what to do when unknown foo is repeated, but only one value given. > On Tue, Oct 16, 2018 at 7:56 AM Thomas Weise <t...@apache.org> wrote: >> >> Discovering options from the job server seems preferable over replicating runner options in SDKs. >> >> Runners evolve on their own, and with portability the SDK does not need to know anything about the runner. >> >> Regarding --runner-option. It is true that this looks less user friendly. On the other hand it eliminates the possibility of name collisions. >> >> But if options are discovered, the SDK can perform full validation. It would only be necessary to use explicit scoping when there is ambiguity. >> >> Thomas >> >> >> On Tue, Oct 16, 2018 at 3:48 AM Maximilian Michels <m...@apache.org> wrote: >>> >>> Fetching options directly from the Runner's JobServer seems like the >>> ideal solution. I agree with Robert that it creates additional >>> complexity for SDK authors, so the `--runner-option` flag would be an >>> easy and explicit way to specify additional Runner options. >>> >>> The format I prefer would be: --runner_option=key1=val1 >>> --runner_option=key2=val2 >>> >>> Now, from the perspective of end users, I think it is neither convenient >>> nor reasonable to require the use of the `--runner-option` flag. To the >>> user it seems nebulous why some pipeline options live in the top-level >>> option namespace while others need to be nested within an option. This >>> is amplified by there being two Runners the user needs to be aware of, >>> i.e. PortableRunner and the actual Runner (Dataflow/Flink/Spark..). >>> >>> I feel like we would eventually replicate all options in the SDK because >>> otherwise users have to use the `--runner-option`, but at least we can >>> specify options which have not been replicated yet. >>> >>> -Max >>> >>> On 16.10.18 10:27, Robert Bradshaw wrote: >>> > Yes, we don't know how to parse and/or validate it. >>> > >>> > On Tue, Oct 16, 2018 at 1:14 AM Lukasz Cwik <lc...@google.com >>> > <mailto:lc...@google.com>> wrote: >>> > >>> > I see, is the issue that we currently are using a JSON >>> > representation for options when being serialized and when we get >>> > some unknown option, we don't know how to convert it into its JSON form? >>> > >>> > On Mon, Oct 15, 2018 at 2:41 PM Robert Bradshaw < rober...@google.com >>> > <mailto:rober...@google.com>> wrote: >>> > >>> > On Mon, Oct 15, 2018 at 11:30 PM Lukasz Cwik <lc...@google.com >>> > <mailto:lc...@google.com>> wrote: >>> > > >>> > > On Mon, Oct 15, 2018 at 1:17 PM Robert Bradshaw >>> > <rober...@google.com <mailto:rober...@google.com>> wrote: >>> > >> >>> > >> On Mon, Oct 15, 2018 at 7:50 PM Lukasz Cwik >>> > <lc...@google.com <mailto:lc...@google.com>> wrote: >>> > >> > >>> > >> > I agree with the sentiment for better error checking. >>> > >> > >>> > >> > We can try to make it such that the SDK can "fetch" the >>> > set of options that the runner supports by making a call to the >>> > Job API. The API could return a list of option names >>> > (descriptions for --help purposes and also potentially the >>> > expected format) which would remove the worry around "unknown" >>> > options. Yes I understand to be able to make the Job API call, >>> > we may need to parse some options from the args parameters first >>> > and then parse the unknown options after they are fetched. >>> > >> >>> > >> This is an interesting idea, but seems it could get quite >>> > complicated. >>> > >> E.g. for delegating runners, one would first read the options to >>> > >> determine which runner to fetch the options from, which >>> > would then >>> > >> return a set of options that possibly depends on the values >>> > of some of >>> > >> its options... >>> > >> >>> > >> > Alternatively, we can choose an explicit format upfront. >>> > >> > To expand on the exact format for --runner_option=..., >>> > here are some different ideas: >>> > >> > 1) Specified multiple times, each one is an explicit flag >>> > >> > --runner_option=--blah=bar --runner_option=--foo=baz1 >>> > --runner_option=--foo=baz2 >>> > >> >>> > >> I'm -1 on this format. We should move away from the idea >>> > that options >>> > >> == flags (as that doesn't compose well with other libraries >>> > that do >>> > >> their own flags parsing). The ability to parse a set of >>> > flags into >>> > >> options is just a convenience that an author may (or may >>> > not) choose >>> > >> to use (e.g. when running pipelines a long-lived process like a >>> > >> service or a notebook, the command line flags are almost >>> > certainly not >>> > >> the right interface). >>> > >> >>> > >> > 2) specified multiple times, we drop the explicit flag >>> > >> > --runner_option=blah=bar --runner_option=foo=baz1 >>> > --runner_option=foo=baz2 >>> > >> >>> > >> This or (4) is my preference. >>> > >> >>> > >> > 3) we use a string which the runner can choose to >>> > interpret however they want (JSON/XML shown below) >>> > >> > --runner_option='{"blah": "bar", "foo": ["baz1", "baz2"]}' >>> > >> > >>> > --runner_option='<options><blah>bar</blah><foo>baz1</foo><foo>baz2</foo></options>' >>> > >> >>> > >> This would make validation hard. Also, I think it makes >>> > sense for some >>> > >> runner options to be "shared" (parallelism") by convention, >>> > so letting >>> > >> it be a free-form string wouldn't allow different runners to >>> > inspect >>> > >> different bits. >>> > >> >>> > >> We should consider if we should use urns for namespacing, and >>> > >> assigning semantic meaning to strings, here. >>> > >> >>> > >> > 4) we use a string which must be a specific format such as >>> > JSON (allows the SDK to do simple validation): >>> > >> > --runner_option='{"blah": "bar", "foo": ["baz1", "baz2"]}' >>> > >> >>> > >> I like this in that at least some validation can be >>> > performed, and >>> > >> expectations of how to format richer types. On the other >>> > hand it gets >>> > >> a bit verbose, given that most (I'd imagine) options will be >>> > simple. >>> > >> As with normal options, >>> > >> >>> > >> --option1=value1 --option2=value2 >>> > >> >>> > >> is shorthand for {"option1": value1, "option2": value2}. >>> > >> >>> > > I lean to 4 the most. With 2, you run into issues of what >>> > does --runner_option=foo=["a", "b"] --runner_option=foo=["c", >>> > "d"] mean? >>> > > Is it an error or list of lists or concatenated. Similar >>> > issues for map types represented via JSON object {...} >>> > >>> > We can err to be on the safe side unless/until an argument can >>> > be made >>> > that merging is more natural. I just think this will be excessively >>> > verbose to use. >>> > >>> > >> > I would strongly suggest that we go with the "fetch" >>> > approach, since this makes the set of options discoverable and >>> > helps users find errors much earlier in their pipeline. >>> > >> >>> > >> This seems like an advanced feature that SDKs may want to >>> > support, but >>> > >> I wouldn't want to require this complexity for bootstrapping >>> > an SDK. >>> > >> >>> > > SDKs that are starting off wouldn't need to "fetch" options, >>> > they could choose to not support runner options or they could >>> > choose to pass all options through to the runner blindly. >>> > Fetching the options only provides the SDK the ability to >>> > provide error checking upfront and useful error/help messages. >>> > >>> > But how to even pass all options through blindly is exactly the >>> > difficulty we're running into here. >>> > >>> > >> Regarding always keeping runner options separate, +1, though >>> > I'm not >>> > >> sure the line is always clear. >>> > >> >>> > >> >>> > >> > On Mon, Oct 15, 2018 at 8:04 AM Robert Bradshaw >>> > <rober...@google.com <mailto:rober...@google.com>> wrote: >>> > >> >> >>> > >> >> On Mon, Oct 15, 2018 at 3:58 PM Maximilian Michels >>> > <m...@apache.org <mailto:m...@apache.org>> wrote: >>> > >> >> > >>> > >> >> > I agree that the current approach breaks the pipeline >>> > options contract >>> > >> >> > because "unknown" options get parsed in the same way as >>> > options which >>> > >> >> > have been defined by the user. >>> > >> >> >>> > >> >> FWIW, I think we're already breaking this "contract." >>> > Unknown options >>> > >> >> are silently ignored; with this change we just change how >>> > we record >>> > >> >> them. It still feels a bit hacky though. >>> > >> >> >>> > >> >> > I'm not sure the `experiments` flag works for us. AFAIK >>> > it only allows >>> > >> >> > true/false flags. We want to pass all types of pipeline >>> > options to the >>> > >> >> > Runner. >>> > >> >> >>> > >> >> Experiments is an arbitrary set of strings, which can be >>> > of the form >>> > >> >> "param=value" if that's useful. (Dataflow does this.) >>> > There is, again, >>> > >> >> no namespacing on the param names, but we could user urns >>> > or impose >>> > >> >> some other structure here. >>> > >> >> >>> > >> >> > How to solve this? >>> > >> >> > >>> > >> >> > 1) Add all options of all Runners to each SDK >>> > >> >> > We added some of the FlinkRunner options to the Python >>> > SDK but realized >>> > >> >> > syncing is rather cumbersome in the long term. However, >>> > we want the most >>> > >> >> > important options to be validated on the client side. >>> > >> >> >>> > >> >> I don't think this is sustainable in the long run. >>> > However, thinking >>> > >> >> about this, in the worse case validation happens after >>> > construction >>> > >> >> but before execution (as with much of our other >>> > validation) so it >>> > >> >> isn't that bad. >>> > >> >> >>> > >> >> > 2) Pass "unknown" options via a separate list in the >>> > Proto which can >>> > >> >> > only be accessed internally by the Runners. This still >>> > allows passing >>> > >> >> > arbitrary options but we wouldn't leak unknown options >>> > and display them >>> > >> >> > as top-level options. >>> > >> >> >>> > >> >> I think there needs to be a way for the user to >>> > communicate values >>> > >> >> directly to the runner regardless of the SDK. My >>> > preference would be >>> > >> >> to make this explicit, e.g. (repeated) >>> > --runner_option=..., rather >>> > >> >> than scooping up all unknown flags at command line >>> > parsing time. >>> > >> >> Perhaps an SDK that is aware of some runners could choose >>> > to lift >>> > >> >> these as top-level options, but still pass them as runner >>> > options. >>> > >> >> >>> > >> >> > On 13.10.18 02:34, Charles Chen wrote: >>> > >> >> > > The current release branch >>> > >> >> > > >>> > (https://github.com/apache/beam/commits/release-2.8.0) was cut >>> > after the >>> > >> >> > > revert went in. Sent out >>> > https://github.com/apache/beam/pull/6683 as a >>> > >> >> > > revert of the revert. Regarding your comment above, >>> > I can help out with >>> > >> >> > > the design / PR reviews for common Python code as you >>> > suggest. >>> > >> >> > > >>> > >> >> > > On Fri, Oct 12, 2018 at 4:48 PM Thomas Weise >>> > <t...@apache.org <mailto:t...@apache.org> >>> > >> >> > > <mailto:t...@apache.org <mailto:t...@apache.org>>> wrote: >>> > >> >> > > >>> > >> >> > > Thanks, will tag you and looking forward to >>> > feedback so we can >>> > >> >> > > ensure that changes work for everyone. >>> > >> >> > > >>> > >> >> > > Looking at the PR, I see agreement from Max to >>> > revert the change on >>> > >> >> > > the release branch, but not in master. Would you >>> > mind to restore it >>> > >> >> > > in master? >>> > >> >> > > >>> > >> >> > > Thanks >>> > >> >> > > >>> > >> >> > > On Fri, Oct 12, 2018 at 4:40 PM Ahmet Altay >>> > <al...@google.com <mailto:al...@google.com> >>> > >> >> > > <mailto:al...@google.com >>> > <mailto:al...@google.com>>> wrote: >>> > >> >> > > >>> > >> >> > > >>> > >> >> > > >>> > >> >> > > On Fri, Oct 12, 2018 at 11:31 AM, Charles >>> > Chen <c...@google.com <mailto:c...@google.com> >>> > >> >> > > <mailto:c...@google.com >>> > <mailto:c...@google.com>>> wrote: >>> > >> >> > > >>> > >> >> > > What I mean is that a user may find that >>> > it works for them >>> > >> >> > > to pass "--myarg blah" and access it as >>> > "options.myarg" >>> > >> >> > > without explicitly defining a "my_arg" >>> > flag due to the added >>> > >> >> > > logic. This is not the intended behavior >>> > and we may want to >>> > >> >> > > change this implementation detail in the >>> > future. However, >>> > >> >> > > having this logic in a released version >>> > makes it hard to >>> > >> >> > > change this behavior since users may >>> > erroneously depend on >>> > >> >> > > this undocumented behavior. Instead, we >>> > should namespace / >>> > >> >> > > scope this so that it is obvious that >>> > this is meant for >>> > >> >> > > runner (and not Beam user) consumption. >>> > >> >> > > >>> > >> >> > > On Fri, Oct 12, 2018 at 10:48 AM Thomas Weise >>> > >> >> > > <t...@apache.org <mailto:t...@apache.org> >>> > <mailto:t...@apache.org <mailto:t...@apache.org>>> wrote: >>> > >> >> > > >>> > >> >> > > Can you please elaborate more what >>> > practical problems >>> > >> >> > > this introduces for users? >>> > >> >> > > >>> > >> >> > > I can see that this change allows a >>> > user to specify a >>> > >> >> > > runner specific option, which in the >>> > future may change >>> > >> >> > > because we decide to scope >>> > differently. If this only >>> > >> >> > > affects users of the portable Flink >>> > runner (like us), >>> > >> >> > > then no need to revert, because at >>> > this early stage we >>> > >> >> > > prefer something that works over >>> > being blocked. >>> > >> >> > > >>> > >> >> > > It would also be really great if some >>> > of the core Python >>> > >> >> > > SDK developers could help out with >>> > the design aspects >>> > >> >> > > and PR reviews of changes that affect >>> > common Python >>> > >> >> > > code. Anyone who specifically wants >>> > to be tagged on >>> > >> >> > > relevant JIRAs and PRs? >>> > >> >> > > >>> > >> >> > > >>> > >> >> > > I would be happy to be tagged, and I can also >>> > help with >>> > >> >> > > including other relevant folks whenever >>> > possible. In general I >>> > >> >> > > think Robert, Charles, myself are good >>> > candidates. >>> > >> >> > > >>> > >> >> > > >>> > >> >> > > Thanks >>> > >> >> > > >>> > >> >> > > >>> > >> >> > > On Fri, Oct 12, 2018 at 10:20 AM >>> > Ahmet Altay >>> > >> >> > > <al...@google.com >>> > <mailto:al...@google.com> <mailto:al...@google.com >>> > <mailto:al...@google.com>>> wrote: >>> > >> >> > > >>> > >> >> > > >>> > >> >> > > >>> > >> >> > > On Fri, Oct 12, 2018 at 10:11 AM, >>> > Charles Chen >>> > >> >> > > <c...@google.com >>> > <mailto:c...@google.com> <mailto:c...@google.com >>> > <mailto:c...@google.com>>> wrote: >>> > >> >> > > >>> > >> >> > > For context, I made comments on >>> > >> >> > > https://github.com/apache/beam/pull/6600 noting >>> > >> >> > > that the changes being made >>> > were not good for >>> > >> >> > > Beam >>> > backwards-compatibility. The change as is >>> > >> >> > > allows users to use pipeline >>> > options without >>> > >> >> > > explicitly defining them, >>> > which is not the type >>> > >> >> > > of usage we would like to >>> > encourage since we >>> > >> >> > > prefer to be explicit >>> > whenever possible. If >>> > >> >> > > users write pipelines with >>> > this sort of pattern, >>> > >> >> > > they will potentially >>> > encounter pain when >>> > >> >> > > upgrading to a later version >>> > since this is an >>> > >> >> > > implementation detail and not >>> > an officially >>> > >> >> > > supported pattern. I agree >>> > with the comments >>> > >> >> > > above that this is ultimately >>> > a scoping issue. >>> > >> >> > > I would not have a problem >>> > with these changes if >>> > >> >> > > they were explicitly scoped >>> > under either a >>> > >> >> > > runner or unparsed options >>> > namespace. >>> > >> >> > > >>> > >> >> > > As a second note, since the >>> > 2.8.0 release is >>> > >> >> > > being cut right now, because >>> > of these >>> > >> >> > > backwards-compatibility >>> > concerns, I would >>> > >> >> > > suggest reverting these >>> > changes, at least until >>> > >> >> > > 2.8.0 is cut, so we can have >>> > a discussion here >>> > >> >> > > before committing to and >>> > releasing any API-level >>> > >> >> > > changes. >>> > >> >> > > >>> > >> >> > > >>> > >> >> > > +1 I would like to revert the >>> > changes in order not >>> > >> >> > > rush this into the release. Once >>> > this discussion >>> > >> >> > > results in an agreement changes >>> > can be brought back. >>> > >> >> > > >>> > >> >> > > >>> > >> >> > > On Fri, Oct 12, 2018 at 9:26 >>> > AM Henning Rohde >>> > >> >> > > <hero...@google.com >>> > <mailto:hero...@google.com> <mailto:hero...@google.com >>> > <mailto:hero...@google.com>>> >>> > >> >> > > wrote: >>> > >> >> > > >>> > >> >> > > Agree that pipeline >>> > options lack some >>> > >> >> > > mechanism for scoping. It >>> > is also not always >>> > >> >> > > possible distinguish >>> > options meant to be >>> > >> >> > > consumed at pipeline >>> > construction time, by >>> > >> >> > > the runner, by the SDK >>> > harness, by the user >>> > >> >> > > code or any combination >>> > -- and this causes >>> > >> >> > > confusion every now and then. >>> > >> >> > > >>> > >> >> > > For Dataflow, we have >>> > been using >>> > >> >> > > "experiments" for >>> > arbitrary runner-specific >>> > >> >> > > options. It's simply a >>> > string list pipeline >>> > >> >> > > option that all SDKs >>> > support and, for Go at >>> > >> >> > > least, is sent to >>> > portable runners. Flink >>> > >> >> > > can do the same in the >>> > short term to move >>> > >> >> > > forward. >>> > >> >> > > >>> > >> >> > > Henning >>> > >> >> > > >>> > >> >> > > >>> > >> >> > > On Fri, Oct 12, 2018 at >>> > 8:50 AM Thomas Weise >>> > >> >> > > <t...@apache.org >>> > <mailto:t...@apache.org> <mailto:t...@apache.org >>> > <mailto:t...@apache.org>>> wrote: >>> > >> >> > > >>> > >> >> > > [moving to the list] >>> > >> >> > > >>> > >> >> > > The requirement >>> > driving this part of the >>> > >> >> > > change was to allow a >>> > user to specify >>> > >> >> > > pipeline options that >>> > a runner supports >>> > >> >> > > without having to >>> > declare those in each >>> > >> >> > > language SDK. >>> > >> >> > > >>> > >> >> > > In the specific >>> > scenario, we have >>> > >> >> > > options that the >>> > Flink runner supports >>> > >> >> > > (and can validate), >>> > that are not >>> > >> >> > > enumerated in the >>> > Python SDK. >>> > >> >> > > >>> > >> >> > > I think we have a >>> > bigger problem scoping >>> > >> >> > > pipeline options. For >>> > example, the >>> > >> >> > > runner options are >>> > dumped into the SDK >>> > >> >> > > worker. There is also >>> > a possibility of >>> > >> >> > > name collisions. So I >>> > think this would >>> > >> >> > > benefit from broader >>> > feedback. >>> > >> >> > > >>> > >> >> > > Thanks, >>> > >> >> > > Thomas >>> > >> >> > > >>> > >> >> > > >>> > >> >> > > ---------- Forwarded >>> > message --------- >>> > >> >> > > From: *Charles Chen* >>> > >> >> > > >>> > <notificati...@github.com <mailto:notificati...@github.com> >>> > >> >> > > >>> > <mailto:notificati...@github.com >>> > <mailto:notificati...@github.com>>> >>> > >> >> > > Date: Fri, Oct 12, >>> > 2018 at 8:36 AM >>> > >> >> > > Subject: Re: >>> > [apache/beam] [BEAM-5442] >>> > >> >> > > Store duplicate >>> > unknown options in a >>> > >> >> > > list argument (#6600) >>> > >> >> > > To: apache/beam >>> > <b...@noreply.github.com <mailto:b...@noreply.github.com> >>> > >> >> > > >>> > <mailto:b...@noreply.github.com <mailto: b...@noreply.github.com>>> >>> > >> >> > > Cc: Thomas Weise >>> > <thomas.we...@gmail.com <mailto:thomas.we...@gmail.com> >>> > >> >> > > >>> > <mailto:thomas.we...@gmail.com <mailto: thomas.we...@gmail.com>>>, >>> > >> >> > > Mention >>> > <ment...@noreply.github.com <mailto:ment...@noreply.github.com > >>> > >> >> > > >>> > <mailto:ment...@noreply.github.com >>> > <mailto:ment...@noreply.github.com>>> >>> > >> >> > > >>> > >> >> > > >>> > >> >> > > CC: @tweise >>> > <https://github.com/tweise> >>> > >> >> > > >>> > >> >> > > — >>> > >> >> > > You are receiving >>> > this because you were >>> > >> >> > > mentioned. >>> > >> >> > > Reply to this email >>> > directly, view it on >>> > >> >> > > GitHub >>> > >> >> > > >>> > < https://github.com/apache/beam/pull/6600#issuecomment-429367754>, >>> > >> >> > > or mute the thread >>> > >> >> > > >>> > < https://github.com/notifications/unsubscribe-auth/AAQGDwwt15R85eq9pySUisyxq2HYz-Vyks5ukLcLgaJpZM4XMo-T >. >>> > >> >> > > >>> > >> >> > > >>> > >> >> > > >>> >