ahipp13 commented on issue #40957:
URL: https://github.com/apache/airflow/issues/40957#issuecomment-2313239833

   @dirrao When I ran the describe command it would look normal expect for the 
events which would show this:
   
   ```
   Normal Created 31m kubelet Created container base
   Normal Started 31m kubelet Started container base
   Warning Unhealthy 30m kubelet Readiness probe errored: rpc error: code = 
Unknown desc = failed to exec in container: failed to start exec 
"0a65c5a2dfde46a83b5b1130ba705de7d780e4fcc5d90596cf06db91e1d3a449": OCI runtime 
exec failed: exec failed: cannot exec in a stopped container: unknown
   Warning Unhealthy 30m kubelet Liveness probe errored: rpc error: code = 
Unknown desc = failed to exec in container: failed to start exec 
"4939253e0a46ef39954ab8fb8ff4b7d35578767315a0fb40c0e053eb00425671": OCI runtime 
exec failed: exec failed: cannot exec in a stopped container: unknown
   Warning Unhealthy 30m kubelet Readiness probe errored: rpc error: code = 
Unknown desc = failed to exec in container: failed to start exec 
"a75eb22e87e74157394f80e745ade7a992050d4f6035077a1558b582b3385f00": OCI runtime 
exec failed: exec failed: cannot exec in a stopped container: unknown
   Warning Unhealthy 30m kubelet Liveness probe errored: rpc error: code = 
Unknown desc = failed to exec in container: failed to start exec 
"9465cef7fcaeb162a70125714d1e109582282c01246f8f8ca3c9ebd4a4770b0e": OCI runtime 
exec failed: exec failed: cannot exec in a stopped container: unknown
   Warning Unhealthy 30m kubelet Liveness probe errored: rpc error: code = 
Unknown desc = failed to exec in container: failed to start exec 
"af405bb2320967255325cf2ea82cf34e07506f4b9b0b1f35e4223682ce679613": OCI runtime 
exec failed: exec failed: cannot exec in a stopped container: unknown
   Warning Unhealthy 30m kubelet Readiness probe errored: rpc error: code = 
Unknown desc = failed to exec in container: failed to start exec 
"777cf5acd0622ab30260db58730cffd89f0ab93be8cba8767659e5f0de03027e": OCI runtime 
exec failed: exec failed: cannot exec in a stopped container: unknown
   Warning Unhealthy 30m kubelet Liveness probe errored: rpc error: code = 
Unknown desc = failed to exec in container: failed to start exec 
"a79560a2e579ab41c38a065fcb170f416c11ebc4e5adb9bfd273b0277d05d8da": OCI runtime 
exec failed: exec failed: cannot exec in a stopped container: unknown
   Warning Unhealthy 30m kubelet Readiness probe errored: rpc error: code = 
Unknown desc = failed to exec in container: failed to start exec 
"eb614b6e172250c3da62d401901227660798bc4b53124f0a0dbd9b1f73c93f65": OCI runtime 
exec failed: exec failed: cannot exec in a stopped container: unknown
   Warning Unhealthy 30m kubelet Readiness probe errored: rpc error: code = 
Unknown desc = failed to exec in container: failed to start exec 
"4c4bbee388a9d95721eb1ee4cd1f2dae2d000305238cd7fdb5ce8077679f6ed7": OCI runtime 
exec failed: exec failed: cannot exec in a stopped container: unknown
   Warning Unhealthy 77s (x704 over 30m) kubelet (combined from similar 
events): Liveness probe errored: rpc error: code = Unknown desc = failed to 
exec in container: failed to start exec 
"48ef51c25c4b02b9e2c93fd40c738d7cb36904d85bf4fa152d46eba47f879322": OCI runtime 
exec failed: exec failed: cannot exec in a stopped container: unknown
   ```
   @databius  and all those in the future looking, I talked with the team that 
runs the Kubernetes cluster I am on and they said that there is a bug with exec 
prob failure:
   
   " So we think you may be running into this bug: 
https://github.com/kubernetes/kubernetes/issues/122591
   
   For troubleshooting the exec probe failure, here's a link from Google: 
https://cloud.google.com/kubernetes-engine/docs/troubleshooting/container-runtime#exec-probe-timeout
   
   Here's the important piece of that "On containerd images, probe results 
returned after the declared timeoutSeconds threshold are ignored."
   
   So what we think is happening, is that the container is completing because 
it executes the command + args supplied to it, but this bug with liveness exec 
command thinks that the container is still running and ready, hence the pod not 
being marked as complete. "
   
   I am required to have liveness and readiness probes for all pods, so I have 
probes on my DAG containers. So what I did was increased the timeout and times 
for them like this:
   
   ```
   livenessProbe:
   exec:
   command:
   - cat
   - /etc/os-release
   initialDelaySeconds: 60
   periodSeconds: 60
   timeoutSeconds: 600
   readinessProbe:
   exec:
   command:
   - cat
   - /etc/os-release
   initialDelaySeconds: 60
   periodSeconds: 60
   timeoutSeconds: 600
   ```
   
   And now we rarely see one of our pods do it... If you don't have liveness or 
readiness probes on your DAGs I am not for sure, but this is what helped fix it 
on our end.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to