Worker Pool Lifecycle Management on Kubernetes

Ke Wu Fri, 14 May 2021 10:51:29 -0700

Hello All,

I came across this question when I am reading Beam on Flink on Kubernetes 
<https://docs.google.com/document/d/1z3LNrRtr8kkiFHonZ5JJM_L4NWNBBNcqRc_yAf6G0VI/edit#heading=h.x9qy4wlfgc1g>
 and flink-on-k8s-operator 
<https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/tree/0310df76d6e2128cd5d2bc51fae4e842d370c463>
 and realized that there seems no retry/wait logic built in PortableRunner nor 
ExternalEnvironmentFactory, (correct me if I am wrong) which creates 
implications that:


1. Job Server needs to be ready to accept request before SDK Client could 
submit request.
2. External Worker Pool Service needs to be ready to accept start/stop worker 
request before runner starts to request.

This may bring some challenges on k8s since Flink opt to use multi containers 
pattern when bringing up a beam portable pipeline, in addition, I don’t find 
any special lifecycle management in place to guarantee the order, e.g. External 
Worker Pool Service container to start and ready before the task manager 
container to start making requests. 

I am wondering if I missed anything to guarantee the readiness of the dependent 
service or we are relying on that dependent containers are much lighter weigh 
so it should, in most time, be ready before the other container start to make 
requests. 

Best,
Ke

[DISCUSS] Client SDK/Job Server/Worker Pool Lifecycle Management on Kubernetes

Reply via email to