Re: (mini-doc) Beam (Flink) portable job templates

Kyle Weaver Thu, 22 Aug 2019 12:58:35 -0700

Following up on discussion in this morning's OSS runners meeting, I have
uploaded a draft PR for the full implementation (job creation + execution):
https://github.com/apache/beam/pull/9408


Kyle Weaver | Software Engineer | github.com/ibzib | [email protected]


On Tue, Aug 20, 2019 at 1:24 PM Robert Bradshaw <[email protected]> wrote:

> The point of expansion services is to run at pipeline construction
> time so that the caller can build on top of the outputs. E.g. we're
> hoping to expose Beam's SQL transforms to other languages via an
> expansion service and *not* duplicate the logic of parsing the SQL
> statements to determine the type(s) of the outputs. Even for simpler
> IOs, we would like to take advantage of schema information (e.g.
> looked up at construction time) to produce results and validate (or
> even inform) subsequent construction.
>
> I think we're also making a mistake in talking about "the" expansion
> service here, as if there was only one well defined service that all
> pipenes used. If we go the route of deferring some expansion to the
> runner, we need a way of naming expansion services. It seems like this
> proposal is simply isomorphic to defining new primitive transforms
> which some (all?) runners are just expected to understand.
>
> On Tue, Aug 20, 2019 at 10:11 AM Thomas Weise <[email protected]> wrote:
> >
> >
> >
> > On Tue, Aug 20, 2019 at 8:56 AM Lukasz Cwik <[email protected]> wrote:
> >>
> >>
> >>
> >> On Mon, Aug 19, 2019 at 5:52 PM Ahmet Altay <[email protected]> wrote:
> >>>
> >>>
> >>>
> >>> On Sun, Aug 18, 2019 at 12:34 PM Thomas Weise <[email protected]> wrote:
> >>>>
> >>>> There is a PR open for this: https://github.com/apache/beam/pull/9331
> >>>>
> >>>> (it wasn't tagged with the JIRA and therefore not linked)
> >>>>
> >>>> I think it is worthwhile to explore how we could further detangle the
> client side Python and Java dependencies.
> >>>>
> >>>> The expansion service is one more dependency to consider in a build
> environment. Is it really necessary to expand external transforms prior to
> submission to the job service?
> >>>
> >>>
> >>> +1, this will make it easier to use external transforms from the
> already familiar client environments.
> >>>
> >>
> >>
> >> The intent is to make it so that you CAN (not MUST) run an expansion
> service separate from a Runner. Creating a single endpoint that hosts both
> the Job and Expansion service is something that gRPC does very easily since
> you can host multiple service definitions on a single port.
> >
> >
> > Yes, that's fine. The point here is when the expansion occurs. I believe
> the runner can also invoke the expansion service, thereby eliminating the
> expansion service interaction from the client side.
> >
> >
> >>
> >>
> >>>>
> >>>>
> >>>> Can we come up with a partially constructed proto that can be
> produced by just running the Python entry point? Note this would also
> require pushing the pipeline options parsing into the job service.
> >>>
> >>>
> >>> Why would this require pushing the pipeline options parsing to the job
> service. Assuming that python will have enough idea about the external
> transform what options it will need. The necessary bit could be converted
> to arguments and be part of that partially constructed proto.
> >>>
> >>>>
> >>>>
> >>>> On Sun, Aug 18, 2019 at 12:01 PM enrico canzonieri <
> [email protected]> wrote:
> >>>>>
> >>>>> I found the tracking ticket at BEAM-7966
> >>>>>
> >>>>> On Sun, Aug 18, 2019 at 11:59 AM enrico canzonieri <
> [email protected]> wrote:
> >>>>>>
> >>>>>> Is this alternative still being considered? Creating a portable jar
> sounds like a good solution to re-use the existing runner specific
> deployment mechanism (e.g. Flink k8s operator) and in general simplify the
> deployment story.
> >>>>>>
> >>>>>> On Fri, Aug 9, 2019 at 12:46 AM Robert Bradshaw <
> [email protected]> wrote:
> >>>>>>>
> >>>>>>> The expansion service is a separate service. (The flink jar
> happens to
> >>>>>>> bring both up.) However, there is negotiation to receive/validate
> the
> >>>>>>> pipeline options.
> >>>>>>>
> >>>>>>> On Fri, Aug 9, 2019 at 1:54 AM Thomas Weise <[email protected]>
> wrote:
> >>>>>>> >
> >>>>>>> > We would also need to consider cross-language pipelines that
> (currently) assume the interaction with an expansion service at
> construction time.
> >>>>>>> >
> >>>>>>> > On Thu, Aug 8, 2019, 4:38 PM Kyle Weaver <[email protected]>
> wrote:
> >>>>>>> >>
> >>>>>>> >> > It might also be useful to have the option to just output the
> proto and artifacts, as alternative to the jar file.
> >>>>>>> >>
> >>>>>>> >> Sure, that wouldn't be too big a change if we were to decide to
> go the SDK route.
> >>>>>>> >>
> >>>>>>> >> > For the Flink entry point we would need to allow for the job
> server to be used as a library.
> >>>>>>> >>
> >>>>>>> >> We don't need the whole job server, we only need to add a main
> method to FlinkPipelineRunner [1] as the entry point, which would basically
> just do the setup described in the doc then call FlinkPipelineRunner::run.
> >>>>>>> >>
> >>>>>>> >> [1]
> https://github.com/apache/beam/blob/master/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkPipelineRunner.java#L53
> >>>>>>> >>
> >>>>>>> >> Kyle Weaver | Software Engineer | github.com/ibzib |
> [email protected]
> >>>>>>> >>
> >>>>>>> >>
> >>>>>>> >> On Thu, Aug 8, 2019 at 4:21 PM Thomas Weise <[email protected]>
> wrote:
> >>>>>>> >>>
> >>>>>>> >>> Hi Kyle,
> >>>>>>> >>>
> >>>>>>> >>> It might also be useful to have the option to just output the
> proto and artifacts, as alternative to the jar file.
> >>>>>>> >>>
> >>>>>>> >>> For the Flink entry point we would need to allow for the job
> server to be used as a library. It would probably not be too hard to have
> the Flink job constructed via the context execution environment, which
> would require no changes on the Flink side.
> >>>>>>> >>>
> >>>>>>> >>> Thanks,
> >>>>>>> >>> Thomas
> >>>>>>> >>>
> >>>>>>> >>>
> >>>>>>> >>> On Thu, Aug 8, 2019 at 9:52 AM Kyle Weaver <
> [email protected]> wrote:
> >>>>>>> >>>>
> >>>>>>> >>>> Re Javaless/serverless solution:
> >>>>>>> >>>> I take it this would probably mean that we would construct
> the jar directly from the SDK. There are advantages to this: full
> separation of Python and Java environments, no need for a job server, and
> likely a simpler implementation, since we'd no longer have to work within
> the constraints of the existing job server infrastructure. The only
> downside I can think of is the additional cost of implementing/maintaining
> jar creation code in each SDK, but that cost may be acceptable if it's
> simple enough.
> >>>>>>> >>>>
> >>>>>>> >>>> Kyle Weaver | Software Engineer | github.com/ibzib |
> [email protected]
> >>>>>>> >>>>
> >>>>>>> >>>>
> >>>>>>> >>>> On Thu, Aug 8, 2019 at 9:31 AM Thomas Weise <[email protected]>
> wrote:
> >>>>>>> >>>>>
> >>>>>>> >>>>>
> >>>>>>> >>>>>
> >>>>>>> >>>>> On Thu, Aug 8, 2019 at 8:29 AM Robert Bradshaw <
> [email protected]> wrote:
> >>>>>>> >>>>>>
> >>>>>>> >>>>>> > Before assembling the jar, the job server runs to create
> the ingredients. That requires the (matching) Java environment on the
> Python developers machine.
> >>>>>>> >>>>>>
> >>>>>>> >>>>>> We can run the job server and have it create the jar (and
> if we keep
> >>>>>>> >>>>>> the job server running we can use it to interact with the
> running
> >>>>>>> >>>>>> job). However, if the jar layout is simple enough, there's
> no need to
> >>>>>>> >>>>>> even build it from Java.
> >>>>>>> >>>>>>
> >>>>>>> >>>>>> Taken to the extreme, this is a one-shot, jar-based
> JobService API. We
> >>>>>>> >>>>>> choose a standard layout of where to put the pipeline
> description and
> >>>>>>> >>>>>> artifacts, and can "augment" an existing jar (that has a
> >>>>>>> >>>>>> runner-specific main class whose entry point knows how to
> read this
> >>>>>>> >>>>>> data to kick off a pipeline as if it were a users driver
> code) into
> >>>>>>> >>>>>> one that has a portable pipeline packaged into it for
> submission to a
> >>>>>>> >>>>>> cluster.
> >>>>>>> >>>>>
> >>>>>>> >>>>>
> >>>>>>> >>>>> It would be nice if the Python developer doesn't have to run
> anything Java at all.
> >>>>>>> >>>>>
> >>>>>>> >>>>> As we just discussed offline, this could be accomplished by
> including the proto that is produced by the SDK into the pre-existing jar.
> >>>>>>> >>>>>
> >>>>>>> >>>>> And if the jar has an entry point that creates the Flink job
> in the prescribed manner [1], it can be directly submitted to the Flink
> REST API. That would allow for Java free client.
> >>>>>>> >>>>>
> >>>>>>> >>>>> [1]
> https://lists.apache.org/thread.html/6db869c53816f4e2917949a7c6992c2b90856d7d639d7f2e1cd13768@%3Cdev.flink.apache.org%3E
> >>>>>>> >>>>>
>

Re: (mini-doc) Beam (Flink) portable job templates

Reply via email to