Hello,

I’m trying to optimize my pipeline runtime while using it with Portable Runner 
and I have some related questions. 

This is a cross-language pipeline, written in Java SDK, and which executes some 
Python code through “External.of()” transform and my custom Python Expansion 
Service. I use Docker-based SDK Harness for Java and Python. In a primitive 
form the pipeline would look like this: 


[Source (Java)] -> [MyTransform1 (Java)] ->  [External (Execute Python code 
with Python SDK) ] - >  [MyTransform2 (Java SDK)]



While running this pipeline with Portable Spark Runner, I see that quite a lot 
of time we spend for artifacts staging (in our case, we have quite a lot of 
artifacts in real pipeline) and launching a Docker container for every Spark 
stage. So, my questions are the following:

1. Is there any internal Beam functionality to pre-stage or, at least cache, 
already staged artifacts? Since the same pipeline will be executed many times 
in a row, there is no reason to stage the same artifacts every run.

2. Is it possible to pre-run SDK Harness containers and reuse them for every 
Portable Runner pipeline? I could win quite a lot of time on this for more 
complicated pipelines.



Well, I guess I can find some workarounds for that but I wished to ask before 
that perhaps there is a better way to do that in Beam.


Regards,
Alexey

Reply via email to