Re: Portable Runner performance optimisation

Luke Cwik Fri, 15 May 2020 09:16:28 -0700

On Fri, May 15, 2020 at 9:01 AM Kyle Weaver <[email protected]> wrote:


> > Yes, you can start docker containers before hand using the worker_pool
> option:
>
> However, it only works for Python. Java doesn't have it yet:
> https://issues.apache.org/jira/browse/BEAM-8137
>
> On Fri, May 15, 2020 at 12:00 PM Kyle Weaver <[email protected]> wrote:
>
>> > 2. Is it possible to pre-run SDK Harness containers and reuse them for
>> every Portable Runner pipeline? I could win quite a lot of time on this for
>> more complicated pipelines.
>>
>> Yes, you can start docker containers before hand using the worker_pool
>> option:
>>
>> docker run -p=50000:50000 apachebeam/python3.7_sdk --worker_pool # or
>> some other port publishing
>>
>> and then in your pipeline options set:
>>
>> --environment_type=EXTERNAL --environment_config=localhost:50000
>>
>> On Fri, May 15, 2020 at 11:47 AM Alexey Romanenko <
>> [email protected]> wrote:
>>
>>> Hello,
>>>
>>> I’m trying to optimize my pipeline runtime while using it with Portable
>>> Runner and I have some related questions.
>>>
>>> This is a cross-language pipeline, written in Java SDK, and which
>>> executes some Python code through “External.of()” transform and my custom
>>> Python Expansion Service. I use Docker-based SDK Harness for Java and
>>> Python. In a primitive form the pipeline would look like this:
>>>
>>>
>>> [Source (Java)] -> [MyTransform1 (Java)] ->  [External (Execute Python
>>> code with Python SDK) ] - >  [MyTransform2 (Java SDK)]
>>>
>>>
>>>
>>> While running this pipeline with Portable Spark Runner, I see that quite
>>> a lot of time we spend for artifacts staging (in our case, we have quite a
>>> lot of artifacts in real pipeline) and launching a Docker container for
>>> every Spark stage. So, my questions are the following:
>>>
>>> 1. Is there any internal Beam functionality to pre-stage or, at least
>>> cache, already staged artifacts? Since the same pipeline will be executed
>>> many times in a row, there is no reason to stage the same artifacts every
>>> run.
>>>
>>>
Part of artifact staging there is supposed to be deduplication but
sometimes minor changes in the files like the jar gets recreated with the
same contents but different times leads to a different hash breaking the
deduplication.

You can always embed your artifacts in your containers and try to make it
so that you have zero artifacts to stage/retrieve.



> 2. Is it possible to pre-run SDK Harness containers and reuse them for
>>> every Portable Runner pipeline? I could win quite a lot of time on this for
>>> more complicated pipelines.
>>>
>>>
>>>
>>> Well, I guess I can find some workarounds for that but I wished to ask
>>> before that perhaps there is a better way to do that in Beam.
>>>
>>>
>>> Regards,
>>> Alexey
>>
>>

Re: Portable Runner performance optimisation

Reply via email to