Hello, I’m trying to optimize my pipeline runtime while using it with Portable Runner and I have some related questions.
This is a cross-language pipeline, written in Java SDK, and which executes some Python code through “External.of()” transform and my custom Python Expansion Service. I use Docker-based SDK Harness for Java and Python. In a primitive form the pipeline would look like this: [Source (Java)] -> [MyTransform1 (Java)] -> [External (Execute Python code with Python SDK) ] - > [MyTransform2 (Java SDK)] While running this pipeline with Portable Spark Runner, I see that quite a lot of time we spend for artifacts staging (in our case, we have quite a lot of artifacts in real pipeline) and launching a Docker container for every Spark stage. So, my questions are the following: 1. Is there any internal Beam functionality to pre-stage or, at least cache, already staged artifacts? Since the same pipeline will be executed many times in a row, there is no reason to stage the same artifacts every run. 2. Is it possible to pre-run SDK Harness containers and reuse them for every Portable Runner pipeline? I could win quite a lot of time on this for more complicated pipelines. Well, I guess I can find some workarounds for that but I wished to ask before that perhaps there is a better way to do that in Beam. Regards, Alexey
