Following up on discussion in this morning's OSS runners meeting, I have uploaded a draft PR for the full implementation (job creation + execution): https://github.com/apache/beam/pull/9408
Kyle Weaver | Software Engineer | github.com/ibzib | [email protected] On Tue, Aug 20, 2019 at 1:24 PM Robert Bradshaw <[email protected]> wrote: > The point of expansion services is to run at pipeline construction > time so that the caller can build on top of the outputs. E.g. we're > hoping to expose Beam's SQL transforms to other languages via an > expansion service and *not* duplicate the logic of parsing the SQL > statements to determine the type(s) of the outputs. Even for simpler > IOs, we would like to take advantage of schema information (e.g. > looked up at construction time) to produce results and validate (or > even inform) subsequent construction. > > I think we're also making a mistake in talking about "the" expansion > service here, as if there was only one well defined service that all > pipenes used. If we go the route of deferring some expansion to the > runner, we need a way of naming expansion services. It seems like this > proposal is simply isomorphic to defining new primitive transforms > which some (all?) runners are just expected to understand. > > On Tue, Aug 20, 2019 at 10:11 AM Thomas Weise <[email protected]> wrote: > > > > > > > > On Tue, Aug 20, 2019 at 8:56 AM Lukasz Cwik <[email protected]> wrote: > >> > >> > >> > >> On Mon, Aug 19, 2019 at 5:52 PM Ahmet Altay <[email protected]> wrote: > >>> > >>> > >>> > >>> On Sun, Aug 18, 2019 at 12:34 PM Thomas Weise <[email protected]> wrote: > >>>> > >>>> There is a PR open for this: https://github.com/apache/beam/pull/9331 > >>>> > >>>> (it wasn't tagged with the JIRA and therefore not linked) > >>>> > >>>> I think it is worthwhile to explore how we could further detangle the > client side Python and Java dependencies. > >>>> > >>>> The expansion service is one more dependency to consider in a build > environment. Is it really necessary to expand external transforms prior to > submission to the job service? > >>> > >>> > >>> +1, this will make it easier to use external transforms from the > already familiar client environments. > >>> > >> > >> > >> The intent is to make it so that you CAN (not MUST) run an expansion > service separate from a Runner. Creating a single endpoint that hosts both > the Job and Expansion service is something that gRPC does very easily since > you can host multiple service definitions on a single port. > > > > > > Yes, that's fine. The point here is when the expansion occurs. I believe > the runner can also invoke the expansion service, thereby eliminating the > expansion service interaction from the client side. > > > > > >> > >> > >>>> > >>>> > >>>> Can we come up with a partially constructed proto that can be > produced by just running the Python entry point? Note this would also > require pushing the pipeline options parsing into the job service. > >>> > >>> > >>> Why would this require pushing the pipeline options parsing to the job > service. Assuming that python will have enough idea about the external > transform what options it will need. The necessary bit could be converted > to arguments and be part of that partially constructed proto. > >>> > >>>> > >>>> > >>>> On Sun, Aug 18, 2019 at 12:01 PM enrico canzonieri < > [email protected]> wrote: > >>>>> > >>>>> I found the tracking ticket at BEAM-7966 > >>>>> > >>>>> On Sun, Aug 18, 2019 at 11:59 AM enrico canzonieri < > [email protected]> wrote: > >>>>>> > >>>>>> Is this alternative still being considered? Creating a portable jar > sounds like a good solution to re-use the existing runner specific > deployment mechanism (e.g. Flink k8s operator) and in general simplify the > deployment story. > >>>>>> > >>>>>> On Fri, Aug 9, 2019 at 12:46 AM Robert Bradshaw < > [email protected]> wrote: > >>>>>>> > >>>>>>> The expansion service is a separate service. (The flink jar > happens to > >>>>>>> bring both up.) However, there is negotiation to receive/validate > the > >>>>>>> pipeline options. > >>>>>>> > >>>>>>> On Fri, Aug 9, 2019 at 1:54 AM Thomas Weise <[email protected]> > wrote: > >>>>>>> > > >>>>>>> > We would also need to consider cross-language pipelines that > (currently) assume the interaction with an expansion service at > construction time. > >>>>>>> > > >>>>>>> > On Thu, Aug 8, 2019, 4:38 PM Kyle Weaver <[email protected]> > wrote: > >>>>>>> >> > >>>>>>> >> > It might also be useful to have the option to just output the > proto and artifacts, as alternative to the jar file. > >>>>>>> >> > >>>>>>> >> Sure, that wouldn't be too big a change if we were to decide to > go the SDK route. > >>>>>>> >> > >>>>>>> >> > For the Flink entry point we would need to allow for the job > server to be used as a library. > >>>>>>> >> > >>>>>>> >> We don't need the whole job server, we only need to add a main > method to FlinkPipelineRunner [1] as the entry point, which would basically > just do the setup described in the doc then call FlinkPipelineRunner::run. > >>>>>>> >> > >>>>>>> >> [1] > https://github.com/apache/beam/blob/master/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkPipelineRunner.java#L53 > >>>>>>> >> > >>>>>>> >> Kyle Weaver | Software Engineer | github.com/ibzib | > [email protected] > >>>>>>> >> > >>>>>>> >> > >>>>>>> >> On Thu, Aug 8, 2019 at 4:21 PM Thomas Weise <[email protected]> > wrote: > >>>>>>> >>> > >>>>>>> >>> Hi Kyle, > >>>>>>> >>> > >>>>>>> >>> It might also be useful to have the option to just output the > proto and artifacts, as alternative to the jar file. > >>>>>>> >>> > >>>>>>> >>> For the Flink entry point we would need to allow for the job > server to be used as a library. It would probably not be too hard to have > the Flink job constructed via the context execution environment, which > would require no changes on the Flink side. > >>>>>>> >>> > >>>>>>> >>> Thanks, > >>>>>>> >>> Thomas > >>>>>>> >>> > >>>>>>> >>> > >>>>>>> >>> On Thu, Aug 8, 2019 at 9:52 AM Kyle Weaver < > [email protected]> wrote: > >>>>>>> >>>> > >>>>>>> >>>> Re Javaless/serverless solution: > >>>>>>> >>>> I take it this would probably mean that we would construct > the jar directly from the SDK. There are advantages to this: full > separation of Python and Java environments, no need for a job server, and > likely a simpler implementation, since we'd no longer have to work within > the constraints of the existing job server infrastructure. The only > downside I can think of is the additional cost of implementing/maintaining > jar creation code in each SDK, but that cost may be acceptable if it's > simple enough. > >>>>>>> >>>> > >>>>>>> >>>> Kyle Weaver | Software Engineer | github.com/ibzib | > [email protected] > >>>>>>> >>>> > >>>>>>> >>>> > >>>>>>> >>>> On Thu, Aug 8, 2019 at 9:31 AM Thomas Weise <[email protected]> > wrote: > >>>>>>> >>>>> > >>>>>>> >>>>> > >>>>>>> >>>>> > >>>>>>> >>>>> On Thu, Aug 8, 2019 at 8:29 AM Robert Bradshaw < > [email protected]> wrote: > >>>>>>> >>>>>> > >>>>>>> >>>>>> > Before assembling the jar, the job server runs to create > the ingredients. That requires the (matching) Java environment on the > Python developers machine. > >>>>>>> >>>>>> > >>>>>>> >>>>>> We can run the job server and have it create the jar (and > if we keep > >>>>>>> >>>>>> the job server running we can use it to interact with the > running > >>>>>>> >>>>>> job). However, if the jar layout is simple enough, there's > no need to > >>>>>>> >>>>>> even build it from Java. > >>>>>>> >>>>>> > >>>>>>> >>>>>> Taken to the extreme, this is a one-shot, jar-based > JobService API. We > >>>>>>> >>>>>> choose a standard layout of where to put the pipeline > description and > >>>>>>> >>>>>> artifacts, and can "augment" an existing jar (that has a > >>>>>>> >>>>>> runner-specific main class whose entry point knows how to > read this > >>>>>>> >>>>>> data to kick off a pipeline as if it were a users driver > code) into > >>>>>>> >>>>>> one that has a portable pipeline packaged into it for > submission to a > >>>>>>> >>>>>> cluster. > >>>>>>> >>>>> > >>>>>>> >>>>> > >>>>>>> >>>>> It would be nice if the Python developer doesn't have to run > anything Java at all. > >>>>>>> >>>>> > >>>>>>> >>>>> As we just discussed offline, this could be accomplished by > including the proto that is produced by the SDK into the pre-existing jar. > >>>>>>> >>>>> > >>>>>>> >>>>> And if the jar has an entry point that creates the Flink job > in the prescribed manner [1], it can be directly submitted to the Flink > REST API. That would allow for Java free client. > >>>>>>> >>>>> > >>>>>>> >>>>> [1] > https://lists.apache.org/thread.html/6db869c53816f4e2917949a7c6992c2b90856d7d639d7f2e1cd13768@%3Cdev.flink.apache.org%3E > >>>>>>> >>>>> >
