Re: (mini-doc) Beam (Flink) portable job templates

Robert Bradshaw Tue, 20 Aug 2019 13:24:42 -0700

The point of expansion services is to run at pipeline construction
time so that the caller can build on top of the outputs. E.g. we're
hoping to expose Beam's SQL transforms to other languages via an
expansion service and *not* duplicate the logic of parsing the SQL
statements to determine the type(s) of the outputs. Even for simpler
IOs, we would like to take advantage of schema information (e.g.
looked up at construction time) to produce results and validate (or
even inform) subsequent construction.


I think we're also making a mistake in talking about "the" expansion
service here, as if there was only one well defined service that all
pipenes used. If we go the route of deferring some expansion to the
runner, we need a way of naming expansion services. It seems like this
proposal is simply isomorphic to defining new primitive transforms
which some (all?) runners are just expected to understand.

On Tue, Aug 20, 2019 at 10:11 AM Thomas Weise <[email protected]> wrote:
>
>
>
> On Tue, Aug 20, 2019 at 8:56 AM Lukasz Cwik <[email protected]> wrote:
>>
>>
>>
>> On Mon, Aug 19, 2019 at 5:52 PM Ahmet Altay <[email protected]> wrote:
>>>
>>>
>>>
>>> On Sun, Aug 18, 2019 at 12:34 PM Thomas Weise <[email protected]> wrote:
>>>>
>>>> There is a PR open for this: https://github.com/apache/beam/pull/9331
>>>>
>>>> (it wasn't tagged with the JIRA and therefore not linked)
>>>>
>>>> I think it is worthwhile to explore how we could further detangle the 
>>>> client side Python and Java dependencies.
>>>>
>>>> The expansion service is one more dependency to consider in a build 
>>>> environment. Is it really necessary to expand external transforms prior to 
>>>> submission to the job service?
>>>
>>>
>>> +1, this will make it easier to use external transforms from the already 
>>> familiar client environments.
>>>
>>
>>
>> The intent is to make it so that you CAN (not MUST) run an expansion service 
>> separate from a Runner. Creating a single endpoint that hosts both the Job 
>> and Expansion service is something that gRPC does very easily since you can 
>> host multiple service definitions on a single port.
>
>
> Yes, that's fine. The point here is when the expansion occurs. I believe the 
> runner can also invoke the expansion service, thereby eliminating the 
> expansion service interaction from the client side.
>
>
>>
>>
>>>>
>>>>
>>>> Can we come up with a partially constructed proto that can be produced by 
>>>> just running the Python entry point? Note this would also require pushing 
>>>> the pipeline options parsing into the job service.
>>>
>>>
>>> Why would this require pushing the pipeline options parsing to the job 
>>> service. Assuming that python will have enough idea about the external 
>>> transform what options it will need. The necessary bit could be converted 
>>> to arguments and be part of that partially constructed proto.
>>>
>>>>
>>>>
>>>> On Sun, Aug 18, 2019 at 12:01 PM enrico canzonieri <[email protected]> 
>>>> wrote:
>>>>>
>>>>> I found the tracking ticket at BEAM-7966
>>>>>
>>>>> On Sun, Aug 18, 2019 at 11:59 AM enrico canzonieri 
>>>>> <[email protected]> wrote:
>>>>>>
>>>>>> Is this alternative still being considered? Creating a portable jar 
>>>>>> sounds like a good solution to re-use the existing runner specific 
>>>>>> deployment mechanism (e.g. Flink k8s operator) and in general simplify 
>>>>>> the deployment story.
>>>>>>
>>>>>> On Fri, Aug 9, 2019 at 12:46 AM Robert Bradshaw <[email protected]> 
>>>>>> wrote:
>>>>>>>
>>>>>>> The expansion service is a separate service. (The flink jar happens to
>>>>>>> bring both up.) However, there is negotiation to receive/validate the
>>>>>>> pipeline options.
>>>>>>>
>>>>>>> On Fri, Aug 9, 2019 at 1:54 AM Thomas Weise <[email protected]> wrote:
>>>>>>> >
>>>>>>> > We would also need to consider cross-language pipelines that 
>>>>>>> > (currently) assume the interaction with an expansion service at 
>>>>>>> > construction time.
>>>>>>> >
>>>>>>> > On Thu, Aug 8, 2019, 4:38 PM Kyle Weaver <[email protected]> wrote:
>>>>>>> >>
>>>>>>> >> > It might also be useful to have the option to just output the 
>>>>>>> >> > proto and artifacts, as alternative to the jar file.
>>>>>>> >>
>>>>>>> >> Sure, that wouldn't be too big a change if we were to decide to go 
>>>>>>> >> the SDK route.
>>>>>>> >>
>>>>>>> >> > For the Flink entry point we would need to allow for the job 
>>>>>>> >> > server to be used as a library.
>>>>>>> >>
>>>>>>> >> We don't need the whole job server, we only need to add a main 
>>>>>>> >> method to FlinkPipelineRunner [1] as the entry point, which would 
>>>>>>> >> basically just do the setup described in the doc then call 
>>>>>>> >> FlinkPipelineRunner::run.
>>>>>>> >>
>>>>>>> >> [1] 
>>>>>>> >> https://github.com/apache/beam/blob/master/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkPipelineRunner.java#L53
>>>>>>> >>
>>>>>>> >> Kyle Weaver | Software Engineer | github.com/ibzib | 
>>>>>>> >> [email protected]
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> On Thu, Aug 8, 2019 at 4:21 PM Thomas Weise <[email protected]> wrote:
>>>>>>> >>>
>>>>>>> >>> Hi Kyle,
>>>>>>> >>>
>>>>>>> >>> It might also be useful to have the option to just output the proto 
>>>>>>> >>> and artifacts, as alternative to the jar file.
>>>>>>> >>>
>>>>>>> >>> For the Flink entry point we would need to allow for the job server 
>>>>>>> >>> to be used as a library. It would probably not be too hard to have 
>>>>>>> >>> the Flink job constructed via the context execution environment, 
>>>>>>> >>> which would require no changes on the Flink side.
>>>>>>> >>>
>>>>>>> >>> Thanks,
>>>>>>> >>> Thomas
>>>>>>> >>>
>>>>>>> >>>
>>>>>>> >>> On Thu, Aug 8, 2019 at 9:52 AM Kyle Weaver <[email protected]> 
>>>>>>> >>> wrote:
>>>>>>> >>>>
>>>>>>> >>>> Re Javaless/serverless solution:
>>>>>>> >>>> I take it this would probably mean that we would construct the jar 
>>>>>>> >>>> directly from the SDK. There are advantages to this: full 
>>>>>>> >>>> separation of Python and Java environments, no need for a job 
>>>>>>> >>>> server, and likely a simpler implementation, since we'd no longer 
>>>>>>> >>>> have to work within the constraints of the existing job server 
>>>>>>> >>>> infrastructure. The only downside I can think of is the additional 
>>>>>>> >>>> cost of implementing/maintaining jar creation code in each SDK, 
>>>>>>> >>>> but that cost may be acceptable if it's simple enough.
>>>>>>> >>>>
>>>>>>> >>>> Kyle Weaver | Software Engineer | github.com/ibzib | 
>>>>>>> >>>> [email protected]
>>>>>>> >>>>
>>>>>>> >>>>
>>>>>>> >>>> On Thu, Aug 8, 2019 at 9:31 AM Thomas Weise <[email protected]> 
>>>>>>> >>>> wrote:
>>>>>>> >>>>>
>>>>>>> >>>>>
>>>>>>> >>>>>
>>>>>>> >>>>> On Thu, Aug 8, 2019 at 8:29 AM Robert Bradshaw 
>>>>>>> >>>>> <[email protected]> wrote:
>>>>>>> >>>>>>
>>>>>>> >>>>>> > Before assembling the jar, the job server runs to create the 
>>>>>>> >>>>>> > ingredients. That requires the (matching) Java environment on 
>>>>>>> >>>>>> > the Python developers machine.
>>>>>>> >>>>>>
>>>>>>> >>>>>> We can run the job server and have it create the jar (and if we 
>>>>>>> >>>>>> keep
>>>>>>> >>>>>> the job server running we can use it to interact with the running
>>>>>>> >>>>>> job). However, if the jar layout is simple enough, there's no 
>>>>>>> >>>>>> need to
>>>>>>> >>>>>> even build it from Java.
>>>>>>> >>>>>>
>>>>>>> >>>>>> Taken to the extreme, this is a one-shot, jar-based JobService 
>>>>>>> >>>>>> API. We
>>>>>>> >>>>>> choose a standard layout of where to put the pipeline 
>>>>>>> >>>>>> description and
>>>>>>> >>>>>> artifacts, and can "augment" an existing jar (that has a
>>>>>>> >>>>>> runner-specific main class whose entry point knows how to read 
>>>>>>> >>>>>> this
>>>>>>> >>>>>> data to kick off a pipeline as if it were a users driver code) 
>>>>>>> >>>>>> into
>>>>>>> >>>>>> one that has a portable pipeline packaged into it for submission 
>>>>>>> >>>>>> to a
>>>>>>> >>>>>> cluster.
>>>>>>> >>>>>
>>>>>>> >>>>>
>>>>>>> >>>>> It would be nice if the Python developer doesn't have to run 
>>>>>>> >>>>> anything Java at all.
>>>>>>> >>>>>
>>>>>>> >>>>> As we just discussed offline, this could be accomplished by  
>>>>>>> >>>>> including the proto that is produced by the SDK into the 
>>>>>>> >>>>> pre-existing jar.
>>>>>>> >>>>>
>>>>>>> >>>>> And if the jar has an entry point that creates the Flink job in 
>>>>>>> >>>>> the prescribed manner [1], it can be directly submitted to the 
>>>>>>> >>>>> Flink REST API. That would allow for Java free client.
>>>>>>> >>>>>
>>>>>>> >>>>> [1] 
>>>>>>> >>>>> https://lists.apache.org/thread.html/6db869c53816f4e2917949a7c6992c2b90856d7d639d7f2e1cd13768@%3Cdev.flink.apache.org%3E
>>>>>>> >>>>>

Re: (mini-doc) Beam (Flink) portable job templates

Reply via email to