Hi, I did not try on another vendor, so I can't say if it's only related to gke, and no, I did not notice anything on the kubelet or kube-dns processes...
Regards Le ven. 3 mai 2019 à 03:05, Li Gao <ligao...@gmail.com> a écrit : > hi Olivier, > > This seems a GKE specific issue? have you tried on other vendors ? Also on > the kubelet nodes did you notice any pressure on the DNS side? > > Li > > > On Mon, Apr 29, 2019, 5:43 AM Olivier Girardot < > o.girar...@lateral-thoughts.com> wrote: > >> Hi everyone, >> I have ~300 spark job on Kubernetes (GKE) using the cluster auto-scaler, >> and sometimes while running these jobs a pretty bad thing happens, the >> driver (in cluster mode) gets scheduled on Kubernetes and launches many >> executor pods. >> So far so good, but the k8s "Service" associated to the driver does not >> seem to be propagated in terms of DNS resolution so all the executor fails >> with a "spark-application-......cluster.svc.local" does not exists. >> >> All executors failing the driver should be failing too, but it considers >> that it's a "pending" initial allocation and stay stuck forever in a loop >> of "Initial job has not accepted any resources, please check Cluster UI" >> >> Has anyone else observed this king of behaviour ? >> We had it on 2.3.1 and I upgraded to 2.4.1 but this issue still seems to >> exist even after the "big refactoring" in the kubernetes cluster scheduler >> backend. >> >> I can work on a fix / workaround but I'd like to check with you the >> proper way forward : >> >> - Some processes (like the airflow helm recipe) rely on a "sleep 30s" >> before launching the dependent pods (that could be added to >> /opt/entrypoint.sh used in the kubernetes packing) >> - We can add a simple step to the init container trying to do the DNS >> resolution and failing after 60s if it did not work >> >> But these steps won't change the fact that the driver will stay stuck >> thinking we're still in the case of the Initial allocation delay. >> >> Thoughts ? >> >> -- >> *Olivier Girardot* >> o.girar...@lateral-thoughts.com >> >