Hello All, I came across this question when I am reading Beam on Flink on Kubernetes <https://docs.google.com/document/d/1z3LNrRtr8kkiFHonZ5JJM_L4NWNBBNcqRc_yAf6G0VI/edit#heading=h.x9qy4wlfgc1g> and flink-on-k8s-operator <https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/tree/0310df76d6e2128cd5d2bc51fae4e842d370c463> and realized that there seems no retry/wait logic built in PortableRunner nor ExternalEnvironmentFactory, (correct me if I am wrong) which creates implications that:
1. Job Server needs to be ready to accept request before SDK Client could submit request. 2. External Worker Pool Service needs to be ready to accept start/stop worker request before runner starts to request. This may bring some challenges on k8s since Flink opt to use multi containers pattern when bringing up a beam portable pipeline, in addition, I don’t find any special lifecycle management in place to guarantee the order, e.g. External Worker Pool Service container to start and ready before the task manager container to start making requests. I am wondering if I missed anything to guarantee the readiness of the dependent service or we are relying on that dependent containers are much lighter weigh so it should, in most time, be ready before the other container start to make requests. Best, Ke