Prashant Sharma created SPARK-32371:
---------------------------------------

             Summary: Autodetect persistently failing executor pods and fail 
the application logging the cause.
                 Key: SPARK-32371
                 URL: https://issues.apache.org/jira/browse/SPARK-32371
             Project: Spark
          Issue Type: Improvement
          Components: Kubernetes
    Affects Versions: 3.1.0
            Reporter: Prashant Sharma


{code:java}
[root@kyok-test-1 ~]# kubectl get po -w

NAME                                   READY   STATUS    RESTARTS   AGE

spark-shell-a3962a736bf9e775-exec-36   1/1     Running   0          5s

spark-shell-a3962a736bf9e775-exec-37   1/1     Running   0          3s

spark-shell-a3962a736bf9e775-exec-36   0/1     Error     0          5s

spark-shell-a3962a736bf9e775-exec-38   0/1     Pending   0          1s

spark-shell-a3962a736bf9e775-exec-38   0/1     Pending   0          1s

spark-shell-a3962a736bf9e775-exec-38   0/1     ContainerCreating   0          1s

spark-shell-a3962a736bf9e775-exec-36   0/1     Terminating         0          6s

spark-shell-a3962a736bf9e775-exec-36   0/1     Terminating         0          6s

spark-shell-a3962a736bf9e775-exec-37   0/1     Error               0          5s

spark-shell-a3962a736bf9e775-exec-38   1/1     Running             0          2s

spark-shell-a3962a736bf9e775-exec-39   0/1     Pending             0          0s

spark-shell-a3962a736bf9e775-exec-39   0/1     Pending             0          0s

spark-shell-a3962a736bf9e775-exec-39   0/1     ContainerCreating   0          0s

spark-shell-a3962a736bf9e775-exec-37   0/1     Terminating         0          6s

spark-shell-a3962a736bf9e775-exec-37   0/1     Terminating         0          6s

spark-shell-a3962a736bf9e775-exec-38   0/1     Error               0          4s

spark-shell-a3962a736bf9e775-exec-39   1/1     Running             0          1s

spark-shell-a3962a736bf9e775-exec-40   0/1     Pending             0          0s

spark-shell-a3962a736bf9e775-exec-40   0/1     Pending             0          0s

spark-shell-a3962a736bf9e775-exec-40   0/1     ContainerCreating   0          0s

spark-shell-a3962a736bf9e775-exec-38   0/1     Terminating         0          5s

spark-shell-a3962a736bf9e775-exec-38   0/1     Terminating         0          5s

spark-shell-a3962a736bf9e775-exec-39   0/1     Error               0          3s

spark-shell-a3962a736bf9e775-exec-40   1/1     Running             0          1s

spark-shell-a3962a736bf9e775-exec-41   0/1     Pending             0          0s

spark-shell-a3962a736bf9e775-exec-41   0/1     Pending             0          0s

spark-shell-a3962a736bf9e775-exec-41   0/1     ContainerCreating   0          0s

spark-shell-a3962a736bf9e775-exec-39   0/1     Terminating         0          4s

spark-shell-a3962a736bf9e775-exec-39   0/1     Terminating         0          4s

spark-shell-a3962a736bf9e775-exec-41   1/1     Running             0          2s

spark-shell-a3962a736bf9e775-exec-40   0/1     Error               0          4s

spark-shell-a3962a736bf9e775-exec-42   0/1     Pending             0          0s

spark-shell-a3962a736bf9e775-exec-42   0/1     Pending             0          0s

spark-shell-a3962a736bf9e775-exec-42   0/1     ContainerCreating   0          0s

spark-shell-a3962a736bf9e775-exec-40   0/1     Terminating         0          4s

spark-shell-a3962a736bf9e775-exec-40   0/1     Terminating         0          4s

{code}
A cascade of creating and terminating pods within 3-4 seconds, is created. It 
is difficult to see the logs of these constantly created and terminated pods. 
Thankfully, there is an option
{code:java}
spark.kubernetes.executor.deleteOnTermination false  {code}
to turn off the auto deletion of executor pods, and gives us opportunity to 
diagnose the problem. However, this is not turned on by default, and sometimes 
one may need to guess what caused the problem the previous run and steps to 
reproduce it and then re run the application with exact same setup to reproduce.

So, it might be good, if we could somehow detect this situation, of pod failing 
as soon as they start or failing on particular task and capture the error that 
caused the pod to terminate and relay it back to driver and log it. 

Alternatively, if we could auto-detect this situation, we can also auto stop 
creating more executor pods and fail with appropriate error also retaining the 
last failed pod for user's further investigation.

So far it is not yet evaluated how this can be achieved, but, this feature 
might be useful for K8s growing as a preferred choice for deploying spark. 
Logging this issue for further investigation and work.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to