ahipp13 commented on issue #40957: URL: https://github.com/apache/airflow/issues/40957#issuecomment-2313239833
@dirrao When I ran the describe command it would look normal expect for the events which would show this: ``` Normal Created 31m kubelet Created container base Normal Started 31m kubelet Started container base Warning Unhealthy 30m kubelet Readiness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "0a65c5a2dfde46a83b5b1130ba705de7d780e4fcc5d90596cf06db91e1d3a449": OCI runtime exec failed: exec failed: cannot exec in a stopped container: unknown Warning Unhealthy 30m kubelet Liveness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "4939253e0a46ef39954ab8fb8ff4b7d35578767315a0fb40c0e053eb00425671": OCI runtime exec failed: exec failed: cannot exec in a stopped container: unknown Warning Unhealthy 30m kubelet Readiness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "a75eb22e87e74157394f80e745ade7a992050d4f6035077a1558b582b3385f00": OCI runtime exec failed: exec failed: cannot exec in a stopped container: unknown Warning Unhealthy 30m kubelet Liveness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "9465cef7fcaeb162a70125714d1e109582282c01246f8f8ca3c9ebd4a4770b0e": OCI runtime exec failed: exec failed: cannot exec in a stopped container: unknown Warning Unhealthy 30m kubelet Liveness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "af405bb2320967255325cf2ea82cf34e07506f4b9b0b1f35e4223682ce679613": OCI runtime exec failed: exec failed: cannot exec in a stopped container: unknown Warning Unhealthy 30m kubelet Readiness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "777cf5acd0622ab30260db58730cffd89f0ab93be8cba8767659e5f0de03027e": OCI runtime exec failed: exec failed: cannot exec in a stopped container: unknown Warning Unhealthy 30m kubelet Liveness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "a79560a2e579ab41c38a065fcb170f416c11ebc4e5adb9bfd273b0277d05d8da": OCI runtime exec failed: exec failed: cannot exec in a stopped container: unknown Warning Unhealthy 30m kubelet Readiness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "eb614b6e172250c3da62d401901227660798bc4b53124f0a0dbd9b1f73c93f65": OCI runtime exec failed: exec failed: cannot exec in a stopped container: unknown Warning Unhealthy 30m kubelet Readiness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "4c4bbee388a9d95721eb1ee4cd1f2dae2d000305238cd7fdb5ce8077679f6ed7": OCI runtime exec failed: exec failed: cannot exec in a stopped container: unknown Warning Unhealthy 77s (x704 over 30m) kubelet (combined from similar events): Liveness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "48ef51c25c4b02b9e2c93fd40c738d7cb36904d85bf4fa152d46eba47f879322": OCI runtime exec failed: exec failed: cannot exec in a stopped container: unknown ``` @databius and all those in the future looking, I talked with the team that runs the Kubernetes cluster I am on and they said that there is a bug with exec prob failure: " So we think you may be running into this bug: https://github.com/kubernetes/kubernetes/issues/122591 For troubleshooting the exec probe failure, here's a link from Google: https://cloud.google.com/kubernetes-engine/docs/troubleshooting/container-runtime#exec-probe-timeout Here's the important piece of that "On containerd images, probe results returned after the declared timeoutSeconds threshold are ignored." So what we think is happening, is that the container is completing because it executes the command + args supplied to it, but this bug with liveness exec command thinks that the container is still running and ready, hence the pod not being marked as complete. " I am required to have liveness and readiness probes for all pods, so I have probes on my DAG containers. So what I did was increased the timeout and times for them like this: ``` livenessProbe: exec: command: - cat - /etc/os-release initialDelaySeconds: 60 periodSeconds: 60 timeoutSeconds: 600 readinessProbe: exec: command: - cat - /etc/os-release initialDelaySeconds: 60 periodSeconds: 60 timeoutSeconds: 600 ``` And now we rarely see one of our pods do it... If you don't have liveness or readiness probes on your DAGs I am not for sure, but this is what helped fix it on our end. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org