[jira] [Updated] (BEAM-11077) Simplify use of the Python Portable runner for Go SDK pipelines

Robert Burke (Jira) Wed, 21 Oct 2020 08:45:10 -0700


     [ 
https://issues.apache.org/jira/browse/BEAM-11077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Robert Burke updated BEAM-11077:
--------------------------------
    Description: 
It's possible to execute Go SDK pipelines on any portable Beam runner, using 
the "universal" runner and specifying the endpoint of the job server. However, 
this is inconvenient in some instances as it requires having a standing Job 
Management server for the runner in question.

This task is to simplify using the Python Portable Runner for arbitrary/novice 
Go SDK users. While for performance, its generally better to keep a job 
management server around so it can execute multiple jobs, this isn't required.

The goal would be to create a "python" runner for the Go SDK, which will start 
up the python portable runner job server, and submit a pipeline to it in 
Loopback mode for execution, using the "universal runner", and wait for the job 
to finish.

 This will give Go users access to a correct runner for testing, and allow them 
to develop their pipelines confidently before moving them to distributed 
runners like Flink, Spark, or Dataflow.

Ideally outside of some clearly indicated dependencies (and failures when they 
aren't present), a user should be able to import the package and specify 
--runner=python, and have their pipeline execute.

The "long way" for using the Python Portable Runner with the Go SDK is on the 
[Go Tips page of the Dev wiki. 
|https://cwiki.apache.org/confluence/display/BEAM/Go+Tips] 
 The Go side runner code is in 
[https://github.com/apache/beam/tree/master/sdks/go/pkg/beam/runners] 

The Python Portable runner entry point is here: 
[https://github.com/apache/beam/blob/3d296c42f9d9dbb7c2234dec325f6a5255b821ee/sdks/python/apache_beam/runners/portability/portable_runner.py]
 

 

The simplest way for this would probably be to require users have Docker 
installed, and for the Beam project to publish a Docker Container image that 
can start up the Python Runner job server appropriately. This keeps the 
dependencies minimal, and start up consistent for users, and we likely can 
re-use the technique for other purposes. And using a similar technique would 
make developing new SDKs easier as well, as new SDKs can use the same 
infrastructure from the start.

Other approaches to solve the problem are of course welcome.

  was:
It's possible to execute Go SDK pipelines on any portable Beam runner, using 
the "universal" runner and specifying the endpoint of the job server. However, 
this is inconvenient in some instances as it requires having a standing Job 
Management server for the runner in question.

This task is to simplify using the Python Portable Runner for arbitrary/novice 
Go SDK users. While for performance, its generally better to keep a job 
management server around so it can execute multiple jobs, this isn't required.

The goal would be to create a "python" runner for the Go SDK, which will start 
up the python portable runner job server, and submit a pipeline to it in 
Loopback mode for execution, using the "universal runner", and wait for the job 
to finish.

 This will give Go users access to a correct runner for testing, and allow them 
to develop their pipelines confidently before moving them to distributed 
runners like Flink, Spark, or Dataflow.


Ideally outside of some clearly indicated dependencies (and failures when they 
aren't present), a user should be able to import the package and specify 
--runner=python, and have their pipeline execute.

The "long way" for using the Python Portable Runner with the Go SDK is on the 
[Go Tips page of the Dev wiki. 
|https://cwiki.apache.org/confluence/display/BEAM/Go+Tips] 
The Go side runner code is in 
[https://github.com/apache/beam/tree/master/sdks/go/pkg/beam/runners] 

The Python Portable runner entry point is here: 
[https://github.com/apache/beam/blob/3d296c42f9d9dbb7c2234dec325f6a5255b821ee/sdks/python/apache_beam/runners/portability/portable_runner.py]
 

 

The simplest way for this would probably be to require users have Docker 
installed, and for the Beam project to publish a Docker Container image that 
can start up the Python Runner job server appropriately. This keeps the 
dependencies minimal, and start up consistent for users, and we likely can 
re-use the technique for other purposes.


 Other approaches to solve the problem are of course welcome.


> Simplify use of the Python Portable runner for Go SDK pipelines
> ---------------------------------------------------------------
>
>                 Key: BEAM-11077
>                 URL: https://issues.apache.org/jira/browse/BEAM-11077
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-go
>            Reporter: Robert Burke
>            Priority: P3
>
> It's possible to execute Go SDK pipelines on any portable Beam runner, using 
> the "universal" runner and specifying the endpoint of the job server. 
> However, this is inconvenient in some instances as it requires having a 
> standing Job Management server for the runner in question.
> This task is to simplify using the Python Portable Runner for 
> arbitrary/novice Go SDK users. While for performance, its generally better to 
> keep a job management server around so it can execute multiple jobs, this 
> isn't required.
> The goal would be to create a "python" runner for the Go SDK, which will 
> start up the python portable runner job server, and submit a pipeline to it 
> in Loopback mode for execution, using the "universal runner", and wait for 
> the job to finish.
>  This will give Go users access to a correct runner for testing, and allow 
> them to develop their pipelines confidently before moving them to distributed 
> runners like Flink, Spark, or Dataflow.
> Ideally outside of some clearly indicated dependencies (and failures when 
> they aren't present), a user should be able to import the package and specify 
> --runner=python, and have their pipeline execute.
> The "long way" for using the Python Portable Runner with the Go SDK is on the 
> [Go Tips page of the Dev wiki. 
> |https://cwiki.apache.org/confluence/display/BEAM/Go+Tips] 
>  The Go side runner code is in 
> [https://github.com/apache/beam/tree/master/sdks/go/pkg/beam/runners] 
> The Python Portable runner entry point is here: 
> [https://github.com/apache/beam/blob/3d296c42f9d9dbb7c2234dec325f6a5255b821ee/sdks/python/apache_beam/runners/portability/portable_runner.py]
>  
>  
> The simplest way for this would probably be to require users have Docker 
> installed, and for the Beam project to publish a Docker Container image that 
> can start up the Python Runner job server appropriately. This keeps the 
> dependencies minimal, and start up consistent for users, and we likely can 
> re-use the technique for other purposes. And using a similar technique would 
> make developing new SDKs easier as well, as new SDKs can use the same 
> infrastructure from the start.
> Other approaches to solve the problem are of course welcome.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (BEAM-11077) Simplify use of the Python Portable runner for Go SDK pipelines

Reply via email to