Note also that a worker pool should only retrieve artifacts once: https://github.com/apache/beam/pull/9398
On Fri, May 15, 2020 at 12:15 PM Luke Cwik <[email protected]> wrote: > > > On Fri, May 15, 2020 at 9:01 AM Kyle Weaver <[email protected]> wrote: > >> > Yes, you can start docker containers before hand using the worker_pool >> option: >> >> However, it only works for Python. Java doesn't have it yet: >> https://issues.apache.org/jira/browse/BEAM-8137 >> >> On Fri, May 15, 2020 at 12:00 PM Kyle Weaver <[email protected]> wrote: >> >>> > 2. Is it possible to pre-run SDK Harness containers and reuse them for >>> every Portable Runner pipeline? I could win quite a lot of time on this for >>> more complicated pipelines. >>> >>> Yes, you can start docker containers before hand using the worker_pool >>> option: >>> >>> docker run -p=50000:50000 apachebeam/python3.7_sdk --worker_pool # or >>> some other port publishing >>> >>> and then in your pipeline options set: >>> >>> --environment_type=EXTERNAL --environment_config=localhost:50000 >>> >>> On Fri, May 15, 2020 at 11:47 AM Alexey Romanenko < >>> [email protected]> wrote: >>> >>>> Hello, >>>> >>>> I’m trying to optimize my pipeline runtime while using it with Portable >>>> Runner and I have some related questions. >>>> >>>> This is a cross-language pipeline, written in Java SDK, and which >>>> executes some Python code through “External.of()” transform and my custom >>>> Python Expansion Service. I use Docker-based SDK Harness for Java and >>>> Python. In a primitive form the pipeline would look like this: >>>> >>>> >>>> [Source (Java)] -> [MyTransform1 (Java)] -> [External (Execute Python >>>> code with Python SDK) ] - > [MyTransform2 (Java SDK)] >>>> >>>> >>>> >>>> While running this pipeline with Portable Spark Runner, I see that >>>> quite a lot of time we spend for artifacts staging (in our case, we have >>>> quite a lot of artifacts in real pipeline) and launching a Docker container >>>> for every Spark stage. So, my questions are the following: >>>> >>>> 1. Is there any internal Beam functionality to pre-stage or, at least >>>> cache, already staged artifacts? Since the same pipeline will be executed >>>> many times in a row, there is no reason to stage the same artifacts every >>>> run. >>>> >>>> > Part of artifact staging there is supposed to be deduplication but > sometimes minor changes in the files like the jar gets recreated with the > same contents but different times leads to a different hash breaking the > deduplication. > > You can always embed your artifacts in your containers and try to make it > so that you have zero artifacts to stage/retrieve. > > > >> 2. Is it possible to pre-run SDK Harness containers and reuse them for >>>> every Portable Runner pipeline? I could win quite a lot of time on this for >>>> more complicated pipelines. >>>> >>>> >>>> >>>> Well, I guess I can find some workarounds for that but I wished to ask >>>> before that perhaps there is a better way to do that in Beam. >>>> >>>> >>>> Regards, >>>> Alexey >>> >>>
