+1 to Option 3 I'd rather have each SDK have a single point of well defined complexity to do something general, than have to make tiny but simple changes. Less toil and maintenance in the long run per SDK.
Similarly I don't have time to make it happen right now. On Tue, Nov 13, 2018, 9:22 AM Thomas Weise <t...@apache.org> wrote: > Discovering options from the job server would be the only way to perform > full validation (and provide upfront help to the user). > > The runner cannot perform full validation, since it is not aware of the > user and SDK options (that it has to forward to the SDK worker). > > Special runner options flag to forward unknown options also wouldn't fully > solve the problem (besides being subject to change in the future). Let's > say runner understands --fancy-int-option and the user repeats that option > multiple times. Not knowing the type of option, the SDK will pass it as a > list and the runner will fail. > > Replicating SDK options is a workaround for known runners but it really > goes against the idea of portability (making assumptions outside the API > contract). We already have runners implemented outside of Beam and hope for > the ecosystem to grow. What we do for options should also work for those > runners. > > I'm with Luke here that options discovery provides the best user > experience and can address the other issues. Even the scenario of multiple > intermediate runners could be addressed by forwarding the unparsed options > with the discovery call. I don't see SDK implementation complexity as a > significant drawback so far. > > Thomas > > > On Mon, Nov 12, 2018 at 2:30 PM Lukasz Cwik <lc...@google.com> wrote: > >> >> On Mon, Nov 12, 2018 at 9:38 AM Maximilian Michels <m...@apache.org> >> wrote: >> >>> Thank you Robert and Lukasz for your points. >>> >>> > Note that I believe that we will want to have multiple URLs to support >>> cross language pipelines since we will want to be able to ask other SDK >>> languages/versions for their "list" of supported PipelineOptions. >>> >>> Why is that? The Runner itself is the source of truth for its options. >>> >> >> Because other languages (or even different versions of the same language) >> may have their own options. For example, the Go SDK talks to a Java service >> which applies a SQL transform and returns the expanded form (this may >> require knowledge of some options like credentials for the filesystem, ...) >> and then talks to a Python service that performs another transform >> expansion. Finally the pipeline containing Go, Java and Python transforms >> is submitted to a runner and it performs its own internal >> replacements/expansions related to executing the pipeline. >> >> >>> Everything else is SDK-related and should be validated there. >>> >>> I imagined the process to go like this: >>> >>> a) Parse options to find JobServer URL >>> a) Retrieve options from JobServer >>> c) Parse all options >>> ...continue as always... >>> >>> An option is just represented by a name and a type. There is nothing >>> more to it, at least as of now. So it should be possible to parse them >>> in the SDK without much further work. >>> >>> Nevertheless, I agree with your sentiment, Robert. The "runner_option" >>> flag would prevent additional complexity. I still don't prefer it >>> because it's not nice from an end user perspective. If we were to >>> implement it, I would definitely go for the "option promotion" which you >>> mentioned. >>> >>> I hadn't thought about delegating runners, although the PortableRunner >>> is basically a delegating Runner. If that was an important feature, I >>> suppose the "runner_option" would be the preferred way. >>> >>> All in all, since there doesn't seem to be an excitement to implement >>> JobServer option retrieval and we will need the help of all SDK >>> developers, "runner_option" seems to be the more likely path. >>> >> >> I would say its a lack of time for people to improve this story over >> others but it can be revisited at some point in the future and I agree that >> using --runner_option as an interim provides value. >> >> >>> >>> -Max >>> >>> On 08.11.18 21:50, Lukasz Cwik wrote: >>> > The purpose of the spec would be to provide the names, type and >>> > descriptions of the options. We don't need anything beyond the JSON >>> > types (string, number, bool, object, list) because the only ambiguity >>> we >>> > get is how do we parse command line string into the JSON type (and >>> that >>> > ambiguity is actually only between string and non-string since all the >>> > other JSON types are unambiguous). >>> > >>> > Also, I believe the flow would be >>> > 1) Parse options >>> > a) Find the URL from args specified and/or additional methods on >>> > PipelineOptions that exposes a programmatic way to set the URL during >>> > parsing. >>> > b) Query URL for option specs >>> > c) Parse the remainder of the options >>> > 2) Construct pipeline >>> > 3) Choose runner >>> > 4) Submit job to runner >>> > >>> > Note that I believe that we will want to have multiple URLs to support >>> > cross language pipelines since we will want to be able to ask other >>> SDK >>> > languages/versions for their "list" of supported PipelineOptions. >>> > >>> > On Thu, Nov 8, 2018 at 11:51 AM Robert Bradshaw <rober...@google.com >>> > <mailto:rober...@google.com>> wrote: >>> > >>> > There's two questions here: >>> > >>> > (A) What do we do in the short term? >>> > >>> > I think adding every runner option to every SDK is not sustainable >>> > (n*m work, assuming every SDK knows about every runner), and >>> having a >>> > patchwork of options that were added as one-offs to SDKs is not >>> > desirable either. Furthermore, it seems difficult to parse unknown >>> > options as if they were valid options, so my preference here would >>> be >>> > to just use a special runner_option flag. (One could also pass a >>> set >>> > of unparsed/unvalidated runner options to the runner, even if >>> they're >>> > not distinguished for the user, and runners (or any intermediates) >>> > could run a "promote" operation that promotes any of these unknowns >>> > that they recognize to real options before further processing. The >>> > parsing would be done as repeated-string, and not be intermingled >>> with >>> > the actually validated options. This is essential a variant of >>> > option 1.) >>> > >>> > (B) What do do in the long term? While the JobServer approach >>> sounds >>> > nice, I think it introduces a lot of complexity (we have too much >>> of >>> > that already) and still doesn't completely solve the problem. In >>> > particular, it changes the flow from >>> > >>> > 1. Parse options >>> > 2. Construct pipeline >>> > 3. Choose runner >>> > 4. Submit job to runner >>> > >>> > to >>> > >>> > 1. Parse options >>> > 2. Construct pipeline >>> > 3. Choose runner >>> > 4a. Query runner for option specs >>> > 4b. Re-parse options >>> > 4c. Submit job to runner >>> > >>> > In particular, doing 4b in the SDK rather than just let the runner >>> > itself do the validation as part of (4) doesn't save much and >>> forces >>> > us to come up with a (probably incomplete) spec as to how to define >>> > options, their types, and their validations. It also means that a >>> > delegating runner must choose and interact with its downstream >>> > runner(s) synchronously, else we haven't actually solved the issue. >>> > >>> > For these reasons, I don't think we even want to go with the >>> JobServer >>> > approach in the long term, which has bearing on (A). >>> > >>> > - Robert >>> > >>> > >>> > On Wed, Nov 7, 2018 at 8:50 PM Maximilian Michels <m...@apache.org >>> > <mailto:m...@apache.org>> wrote: >>> > > >>> > > +1 >>> > > >>> > > If the preferred approach is to eventually have the JobServer >>> > serve the >>> > > options, then the best intermediate solution is to replicate >>> common >>> > > options in the SDKs. >>> > > >>> > > If we went down the "--runner_option" path, we would end up with >>> > > multiple ways of specifying the same options. We would >>> eventually >>> > have >>> > > to deprecate "runner options" once we have the JobServer >>> > approach. I'd >>> > > like to avoid that. >>> > > >>> > > For the upcoming release we can revert the changes again and >>> add the >>> > > most common missing options to the SDKs. Then hopefully we >>> should >>> > have >>> > > fetching implemented for the release after. >>> > > >>> > > Do you think that is feasible? >>> > > >>> > > Thanks, >>> > > Max >>> > > >>> > > On 30.10.18 23:00, Lukasz Cwik wrote: >>> > > > I still like #3 the most, just can't devote the time to get it >>> > done. >>> > > > >>> > > > Instead of going with a fully implemented #3, we could >>> hardcode >>> > the a >>> > > > subset of options and types within each SDK until the job >>> server is >>> > > > ready to provide this information and then migrate to the >>> > "full" list. >>> > > > This would be an easy path for SDKs to take on. They could >>> > "know" of a >>> > > > few well known options, and if they want to support all >>> > options, they >>> > > > implement the integration with the job server. >>> > > > >>> > > > On Fri, Oct 26, 2018 at 9:19 AM Maximilian Michels >>> > <m...@apache.org <mailto:m...@apache.org> >>> > > > <mailto:m...@apache.org <mailto:m...@apache.org>>> wrote: >>> > > > >>> > > > > I would prefer we don't introduce a (quirky) way of >>> passing >>> > > > unknown options that forces users to type JSON into the >>> > command line >>> > > > (or similar acrobatics) >>> > > > Same here, the JSON approach seems technically nice but >>> too >>> > bulky >>> > > > for users. >>> > > > >>> > > > > To someone wanting to run a pipeline, all options are >>> > equally >>> > > > important, whether they are application specific, SDK >>> > specific or >>> > > > runner specific. >>> > > > >>> > > > I'm also reluctant to force users to use >>> `--runner_option=` >>> > because the >>> > > > division into "Runner" options and other options seems >>> > rather arbitrary >>> > > > to users. Most built-in options are also Runner-related. >>> > > > >>> > > > > It should be possible to *optionally* qualify/scope (to >>> > cover >>> > > > cases where there is ambiguity), but otherwise I prefer >>> the >>> > format >>> > > > we currently have. >>> > > > >>> > > > Yes, namespacing is a problem. What happens if the user >>> > defines a >>> > > > custom >>> > > > PipelineOption which clashes with one of the builtin ones? >>> > If both are >>> > > > >>> > > > set, which one is actually applied? >>> > > > >>> > > > >>> > > > Note that PipelineOptions so far has been treating name >>> > equality to mean >>> > > > option equality and the Java implementation has a bunch of >>> > strict checks >>> > > > to make sure that default values aren't used for duplicate >>> > definitions, >>> > > > they have the same type, etc... >>> > > > With 1), you fail the job if the runner can't understand your >>> > option >>> > > > because its not represented the same way. User then needs to >>> fix-up >>> > > > their declaration of the option name. >>> > > > With 2), there are no name conflicts, the SDK will need to >>> > validate that >>> > > > the option isn't set in both formats and error out if it is >>> before >>> > > > pipeline submission time. >>> > > > With 3), you can prefetch all the options and error out to >>> the user >>> > > > during argument parsing time. >>> > > > >>> > > > >>> > > > >>> > > > Here is a summary of the possible paths going forward: >>> > > > >>> > > > >>> > > > 1) Validate PipelineOptions at Runner side >>> > > > ========================================== >>> > > > >>> > > > The main issue raised here was that we want to move away >>> > from parsing >>> > > > arguments which look like options without validating them. >>> > An easy fix >>> > > > would be to actually validate them on the Runner side. >>> This >>> > could be >>> > > > done by changing the deserialization code of >>> > PipelineOptions which so >>> > > > far ignores unknown JSON options. >>> > > > >>> > > > See: PipelineOptionsTranslation.fromProto(Struct >>> protoOptions) >>> > > > >>> > > > Actually, this wouldn't work for user-defined >>> > PipelineOptions because >>> > > > they might not be known to the Runner (if they are defined >>> > in Python). >>> > > > >>> > > > >>> > > > 2) Introduce a Runner-Option Flag >>> > > > ================================= >>> > > > >>> > > > In this approach we would try to add as many pipeline >>> > options for a >>> > > > Runner to the SDK, but allow additional Runner options to >>> > be passed >>> > > > using the `--runner-option=key=val` flag. The Runner, like >>> > in 1), would >>> > > > have to ensure validation. I think this has been the most >>> > favored >>> > > > way so >>> > > > far. Going forward, that means that `--parallelism=4` and >>> > > > `--runner-option=parallelism=4` will have the same effect >>> > for the Flink >>> > > > Runner. >>> > > > >>> > > > >>> > > > 3) Implement Fetching of Options from JobServer >>> > > > =============================================== >>> > > > >>> > > > The options are retrieved from the JobServer before >>> > submitting the >>> > > > pipeline. I think this would be ideal but, as mentioned >>> > before, it >>> > > > increases the complexity for implementing new SDKs and >>> > might overall >>> > > > just not be worth the effort. >>> > > > >>> > > > >>> > > > What do you think? I'd implement 2) for the next release, >>> > unless there >>> > > > are advocates for a different approach. >>> > > > >>> > > > Cheers, >>> > > > Max >>> > >>> >>