Re: Portable Runner performance optimisation

Robert Bradshaw Fri, 15 May 2020 09:57:28 -0700

I don't think this is being worked on, but given that Java already supports
the LOOPBACK environment (which is a special case of EXTERNAL) it would
just be a matter of properly parsing the flags.


On Fri, May 15, 2020 at 9:52 AM Alexey Romanenko <[email protected]>
wrote:

> Thanks! It looks that this is exactly what I need, though mostly for Java
> SDK.
> Don't you know if anyone works on this Jira?
>
> On 15 May 2020, at 18:01, Kyle Weaver <[email protected]> wrote:
>
> > Yes, you can start docker containers before hand using the worker_pool
> option:
>
> However, it only works for Python. Java doesn't have it yet:
> https://issues.apache.org/jira/browse/BEAM-8137
>
> On Fri, May 15, 2020 at 12:00 PM Kyle Weaver <[email protected]> wrote:
>
>> > 2. Is it possible to pre-run SDK Harness containers and reuse them for
>> every Portable Runner pipeline? I could win quite a lot of time on this for
>> more complicated pipelines.
>>
>> Yes, you can start docker containers before hand using the worker_pool
>> option:
>>
>> docker run -p=50000:50000 apachebeam/python3.7_sdk --worker_pool # or
>> some other port publishing
>>
>> and then in your pipeline options set:
>>
>> --environment_type=EXTERNAL --environment_config=localhost:50000
>>
>> On Fri, May 15, 2020 at 11:47 AM Alexey Romanenko <
>> [email protected]> wrote:
>>
>>> Hello,
>>>
>>> I’m trying to optimize my pipeline runtime while using it with Portable
>>> Runner and I have some related questions.
>>>
>>> This is a cross-language pipeline, written in Java SDK, and which
>>> executes some Python code through “External.of()” transform and my custom
>>> Python Expansion Service. I use Docker-based SDK Harness for Java and
>>> Python. In a primitive form the pipeline would look like this:
>>>
>>>
>>> [Source (Java)] -> [MyTransform1 (Java)] ->  [External (Execute Python
>>> code with Python SDK) ] - >  [MyTransform2 (Java SDK)]
>>>
>>>
>>>
>>> While running this pipeline with Portable Spark Runner, I see that quite
>>> a lot of time we spend for artifacts staging (in our case, we have quite a
>>> lot of artifacts in real pipeline) and launching a Docker container for
>>> every Spark stage. So, my questions are the following:
>>>
>>> 1. Is there any internal Beam functionality to pre-stage or, at least
>>> cache, already staged artifacts? Since the same pipeline will be executed
>>> many times in a row, there is no reason to stage the same artifacts every
>>> run.
>>>
>>> 2. Is it possible to pre-run SDK Harness containers and reuse them for
>>> every Portable Runner pipeline? I could win quite a lot of time on this for
>>> more complicated pipelines.
>>>
>>>
>>>
>>> Well, I guess I can find some workarounds for that but I wished to ask
>>> before that perhaps there is a better way to do that in Beam.
>>>
>>>
>>> Regards,
>>> Alexey
>>
>>
>

Re: Portable Runner performance optimisation

Reply via email to