Re: Portable Runner performance optimisation

Alexey Romanenko Fri, 15 May 2020 09:53:28 -0700

Thanks! It looks that this is exactly what I need, though mostly for Java SDK. 
Don't you know if anyone works on this Jira?


> On 15 May 2020, at 18:01, Kyle Weaver <[email protected]> wrote:
> 
> > Yes, you can start docker containers before hand using the worker_pool 
> > option:
> 
> However, it only works for Python. Java doesn't have it yet: 
> https://issues.apache.org/jira/browse/BEAM-8137 
> <https://issues.apache.org/jira/browse/BEAM-8137>
> On Fri, May 15, 2020 at 12:00 PM Kyle Weaver <[email protected] 
> <mailto:[email protected]>> wrote:
> > 2. Is it possible to pre-run SDK Harness containers and reuse them for 
> > every Portable Runner pipeline? I could win quite a lot of time on this for 
> > more complicated pipelines.
> 
> Yes, you can start docker containers before hand using the worker_pool option:
> 
> docker run -p=50000:50000 apachebeam/python3.7_sdk --worker_pool # or some 
> other port publishing
> 
> and then in your pipeline options set:
> 
> --environment_type=EXTERNAL --environment_config=localhost:50000
> 
> On Fri, May 15, 2020 at 11:47 AM Alexey Romanenko <[email protected] 
> <mailto:[email protected]>> wrote:
> Hello,
> 
> I’m trying to optimize my pipeline runtime while using it with Portable 
> Runner and I have some related questions. 
> 
> This is a cross-language pipeline, written in Java SDK, and which executes 
> some Python code through “External.of()” transform and my custom Python 
> Expansion Service. I use Docker-based SDK Harness for Java and Python. In a 
> primitive form the pipeline would look like this: 
> 
> 
> [Source (Java)] -> [MyTransform1 (Java)] ->  [External (Execute Python code 
> with Python SDK) ] - >  [MyTransform2 (Java SDK)]
> 
> 
> 
> While running this pipeline with Portable Spark Runner, I see that quite a 
> lot of time we spend for artifacts staging (in our case, we have quite a lot 
> of artifacts in real pipeline) and launching a Docker container for every 
> Spark stage. So, my questions are the following:
> 
> 1. Is there any internal Beam functionality to pre-stage or, at least cache, 
> already staged artifacts? Since the same pipeline will be executed many times 
> in a row, there is no reason to stage the same artifacts every run.
> 
> 2. Is it possible to pre-run SDK Harness containers and reuse them for every 
> Portable Runner pipeline? I could win quite a lot of time on this for more 
> complicated pipelines.
> 
> 
> 
> Well, I guess I can find some workarounds for that but I wished to ask before 
> that perhaps there is a better way to do that in Beam.
> 
> 
> Regards,
> Alexey

Re: Portable Runner performance optimisation

Reply via email to