Re: Portable Runner performance optimisation

Kyle Weaver Fri, 15 May 2020 09:40:25 -0700

Note also that a worker pool should only retrieve artifacts once:
https://github.com/apache/beam/pull/9398


On Fri, May 15, 2020 at 12:15 PM Luke Cwik <[email protected]> wrote:

>
>
> On Fri, May 15, 2020 at 9:01 AM Kyle Weaver <[email protected]> wrote:
>
>> > Yes, you can start docker containers before hand using the worker_pool
>> option:
>>
>> However, it only works for Python. Java doesn't have it yet:
>> https://issues.apache.org/jira/browse/BEAM-8137
>>
>> On Fri, May 15, 2020 at 12:00 PM Kyle Weaver <[email protected]> wrote:
>>
>>> > 2. Is it possible to pre-run SDK Harness containers and reuse them for
>>> every Portable Runner pipeline? I could win quite a lot of time on this for
>>> more complicated pipelines.
>>>
>>> Yes, you can start docker containers before hand using the worker_pool
>>> option:
>>>
>>> docker run -p=50000:50000 apachebeam/python3.7_sdk --worker_pool # or
>>> some other port publishing
>>>
>>> and then in your pipeline options set:
>>>
>>> --environment_type=EXTERNAL --environment_config=localhost:50000
>>>
>>> On Fri, May 15, 2020 at 11:47 AM Alexey Romanenko <
>>> [email protected]> wrote:
>>>
>>>> Hello,
>>>>
>>>> I’m trying to optimize my pipeline runtime while using it with Portable
>>>> Runner and I have some related questions.
>>>>
>>>> This is a cross-language pipeline, written in Java SDK, and which
>>>> executes some Python code through “External.of()” transform and my custom
>>>> Python Expansion Service. I use Docker-based SDK Harness for Java and
>>>> Python. In a primitive form the pipeline would look like this:
>>>>
>>>>
>>>> [Source (Java)] -> [MyTransform1 (Java)] ->  [External (Execute Python
>>>> code with Python SDK) ] - >  [MyTransform2 (Java SDK)]
>>>>
>>>>
>>>>
>>>> While running this pipeline with Portable Spark Runner, I see that
>>>> quite a lot of time we spend for artifacts staging (in our case, we have
>>>> quite a lot of artifacts in real pipeline) and launching a Docker container
>>>> for every Spark stage. So, my questions are the following:
>>>>
>>>> 1. Is there any internal Beam functionality to pre-stage or, at least
>>>> cache, already staged artifacts? Since the same pipeline will be executed
>>>> many times in a row, there is no reason to stage the same artifacts every
>>>> run.
>>>>
>>>>
> Part of artifact staging there is supposed to be deduplication but
> sometimes minor changes in the files like the jar gets recreated with the
> same contents but different times leads to a different hash breaking the
> deduplication.
>
> You can always embed your artifacts in your containers and try to make it
> so that you have zero artifacts to stage/retrieve.
>
>
>
>> 2. Is it possible to pre-run SDK Harness containers and reuse them for
>>>> every Portable Runner pipeline? I could win quite a lot of time on this for
>>>> more complicated pipelines.
>>>>
>>>>
>>>>
>>>> Well, I guess I can find some workarounds for that but I wished to ask
>>>> before that perhaps there is a better way to do that in Beam.
>>>>
>>>>
>>>> Regards,
>>>> Alexey
>>>
>>>

Re: Portable Runner performance optimisation

Reply via email to