[ https://issues.apache.org/jira/browse/SPARK-27574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16829337#comment-16829337 ]
Will Zhang edited comment on SPARK-27574 at 4/29/19 3:21 PM: ------------------------------------------------------------- Hi [~Udbhav Agrawal], the driver log is nothing special, the first container ran successfully and exited. The second failed cause it checks the filepath of the output and returns error if already existed. What I can see from the log is that the second container starts shortly after the first one exited. I attached the driver log files. Thank you. below is the output of kubectl describe pod, it only contains the second container id: Name: com-xxxx-cloud-mf-trainer-submit-1555666719424-driver Namespace: default Node: yq01-m12-ai2b-service02.yq01.xxxx.com/10.155.197.12 Start Time: Fri, 19 Apr 2019 17:38:40 +0800 Labels: DagTask_ID=54f854e2-0bce-4bd6-50e7-57b521b216f7 spark-app-selector=spark-4343fe80572c4240bd933246efd975da spark-role=driver Annotations: <none> Status: Failed IP: 10.244.12.106 Containers: spark-kubernetes-driver: Container ID: docker://23c9ea6767a274f8e8759da39dee90f403d9d28b1fec97c1fa4cd8746b41c8c3 Image: 10.96.0.100:5000/spark:spark-2.4.0 Image ID: docker-pullable://10.96.0.100:5000/spark-2.4.0@sha256:5b47e2a29aeb1c644fc3853933be2ad08f9cd233dec0977908803e9a1f870b0f Ports: 7078/TCP, 7079/TCP, 4040/TCP Host Ports: 0/TCP, 0/TCP, 0/TCP Args: driver --properties-file /opt/spark/conf/spark.properties --class com.xxxx.cloud.mf.trainer.Submit spark-internal --ak 970f5e4c-7171-4c61-603e-f101b65a573b --tracking_server_url [http://10.155.197.12:8080|http://10.155.197.12:8080/] --graph hdfs://yq01-m12-ai2b-service02.yq01.xxxx.com:9000/project/62247e3a-e322-4456-6387-a66e9490652e/exp/62c37ae9-12aa-43f7-671f-d187e1bf1f84/graph/08e1dfad-c272-45ca-4201-1a8bc691a56e/meta/node1555661669082/graph.json --sk 56305f9f-b755-4b42-4218-592555f5c4a8 --mode train State: Terminated Reason: Error Exit Code: 1 Started: Fri, 19 Apr 2019 17:39:57 +0800 Finished: Fri, 19 Apr 2019 17:40:48 +0800 Ready: False Restart Count: 0 Limits: memory: 2432Mi Requests: cpu: 1 memory: 2432Mi Environment: xxxx_KUBERNETES_LOG_ENDPOINT: yq01-m12-ai2b-service02.yq01.xxxx.com:8070 xxxx_KUBERNETES_LOG_FLUSH_FREQUENCY: 10s xxxx_KUBERNETES_LOG_PATH: /project/62247e3a-e322-4456-6387-a66e9490652e/exp/62c37ae9-12aa-43f7-671f-d187e1bf1f84/graph/08e1dfad-c272-45ca-4201-1a8bc691a56e/log/driver SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP) SPARK_LOCAL_DIRS: /var/data/spark-b7e8109a-57c8-439d-b5a8-c0135a7a6e7f SPARK_CONF_DIR: /opt/spark/conf Mounts: /opt/spark/conf from spark-conf-volume (rw) /var/data/spark-b7e8109a-57c8-439d-b5a8-c0135a7a6e7f from spark-local-dir-1 (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-q7drh (ro) Conditions: Type Status Initialized True Ready False PodScheduled True Volumes: spark-local-dir-1: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: spark-conf-volume: Type: ConfigMap (a volume populated by a ConfigMap) Name: com-xxxx-cloud-mf-trainer-submit-1555666719424-driver-conf-map Optional: false default-token-q7drh: Type: Secret (a volume populated by a Secret) SecretName: default-token-q7drh Optional: false QoS Class: Burstable Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: <none> was (Author: zyfo2): Hi [~Udbhav Agrawal], the driver log is nothing special, the first container ran successfully and exited. The second failed cause it checks the filepath of the output and returns error if already existed. What I can see from the log is that the second container starts shortly after the first one exited. I attached the driver log files. Thank you. below is the output of kubectl describe pod, it only contains the second container id: Name: com-xxxx-cloud-mf-trainer-submit-1555666719424-driver Namespace: default Node: yq01-m12-ai2b-service02.yq01.xxxx[^driver-pod-logs.zip].com/10.155.197.12 Start Time: Fri, 19 Apr 2019 17:38:40 +0800 Labels: DagTask_ID=54f854e2-0bce-4bd6-50e7-57b521b216f7 spark-app-selector=spark-4343fe80572c4240bd933246efd975da spark-role=driver Annotations: <none> Status: Failed IP: 10.244.12.106 Containers: spark-kubernetes-driver: Container ID: docker://23c9ea6767a274f8e8759da39dee90f403d9d28b1fec97c1fa4cd8746b41c8c3 Image: 10.96.0.100:5000/spark:spark-2.4.0 Image ID: docker-pullable://10.96.0.100:5000/spark-2.4.0@sha256:5b47e2a29aeb1c644fc3853933be2ad08f9cd233dec0977908803e9a1f870b0f Ports: 7078/TCP, 7079/TCP, 4040/TCP Host Ports: 0/TCP, 0/TCP, 0/TCP Args: driver --properties-file /opt/spark/conf/spark.properties --class com.xxxx.cloud.mf.trainer.Submit spark-internal --ak 970f5e4c-7171-4c61-603e-f101b65a573b --tracking_server_url [http://10.155.197.12:8080|http://10.155.197.12:8080/] --graph hdfs://yq01-m12-ai2b-service02.yq01.xxxx.com:9000/project/62247e3a-e322-4456-6387-a66e9490652e/exp/62c37ae9-12aa-43f7-671f-d187e1bf1f84/graph/08e1dfad-c272-45ca-4201-1a8bc691a56e/meta/node1555661669082/graph.json --sk 56305f9f-b755-4b42-4218-592555f5c4a8 --mode train State: Terminated Reason: Error Exit Code: 1 Started: Fri, 19 Apr 2019 17:39:57 +0800 Finished: Fri, 19 Apr 2019 17:40:48 +0800 Ready: False Restart Count: 0 Limits: memory: 2432Mi Requests: cpu: 1 memory: 2432Mi Environment: xxxx_KUBERNETES_LOG_ENDPOINT: yq01-m12-ai2b-service02.yq01.xxxx.com:8070 xxxx_KUBERNETES_LOG_FLUSH_FREQUENCY: 10s xxxx_KUBERNETES_LOG_PATH: /project/62247e3a-e322-4456-6387-a66e9490652e/exp/62c37ae9-12aa-43f7-671f-d187e1bf1f84/graph/08e1dfad-c272-45ca-4201-1a8bc691a56e/log/driver SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP) SPARK_LOCAL_DIRS: /var/data/spark-b7e8109a-57c8-439d-b5a8-c0135a7a6e7f SPARK_CONF_DIR: /opt/spark/conf Mounts: /opt/spark/conf from spark-conf-volume (rw) /var/data/spark-b7e8109a-57c8-439d-b5a8-c0135a7a6e7f from spark-local-dir-1 (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-q7drh (ro) Conditions: Type Status Initialized True Ready False PodScheduled True Volumes: spark-local-dir-1: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: spark-conf-volume: Type: ConfigMap (a volume populated by a ConfigMap) Name: com-xxxx-cloud-mf-trainer-submit-1555666719424-driver-conf-map Optional: false default-token-q7drh: Type: Secret (a volume populated by a Secret) SecretName: default-token-q7drh Optional: false QoS Class: Burstable Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: <none> > spark on kubernetes driver pod phase changed from running to pending and > starts another container in pod > -------------------------------------------------------------------------------------------------------- > > Key: SPARK-27574 > URL: https://issues.apache.org/jira/browse/SPARK-27574 > Project: Spark > Issue Type: Bug > Components: Kubernetes > Affects Versions: 2.4.0 > Environment: Kubernetes version (use kubectl version): > v1.10.0 > OS (e.g: cat /etc/os-release): > CentOS-7 > Kernel (e.g. uname -a): > 4.17.11-1.el7.elrepo.x86_64 > Spark-2.4.0 > Reporter: Will Zhang > Priority: Major > Attachments: driver-pod-logs.zip > > > I'm using spark-on-kubernetes to submit spark app to kubernetes. > most of the time, it runs smoothly. > but sometimes, I see logs after submitting: the driver pod phase changed from > running to pending and starts another container in the pod though the first > container exited successfully. > I use the standard spark-submit to kubernetes like: > /opt/spark/spark-2.4.0-bin-hadoop2.7/bin/spark-submit --deploy-mode cluster > --class xxx ... > > log is below: > > > 2019-04-25 13:37:01 INFO LoggingPodStatusWatcherImpl:54 - State changed, new > state: > pod name: com-xxxx-cloud-mf-trainer-submit-1556199419847-driver > namespace: default > labels: DagTask_ID -> 5fd12b90-fbbb-41f0-41ad-7bc5bd0abfe0, > spark-app-selector -> spark-3c8350a62ab44c139ce073d654fddebb, spark-role -> > driver > pod uid: 348cdcf5-675f-11e9-ae72-e8611f1fbb2a > creation time: 2019-04-25T13:37:01Z > service account name: default > volumes: spark-local-dir-1, spark-conf-volume, default-token-q7drh > node name: N/A > start time: N/A > container images: N/A > phase: Pending > status: [] > 2019-04-25 13:37:01 INFO LoggingPodStatusWatcherImpl:54 - State changed, new > state: > pod name: com-xxxx-cloud-mf-trainer-submit-1556199419847-driver > namespace: default > labels: DagTask_ID -> 5fd12b90-fbbb-41f0-41ad-7bc5bd0abfe0, > spark-app-selector -> spark-3c8350a62ab44c139ce073d654fddebb, spark-role -> > driver > pod uid: 348cdcf5-675f-11e9-ae72-e8611f1fbb2a > creation time: 2019-04-25T13:37:01Z > service account name: default > volumes: spark-local-dir-1, spark-conf-volume, default-token-q7drh > node name: yq01-m12-ai2b-service02.yq01.xxxx.com > start time: N/A > container images: N/A > phase: Pending > status: [] > 2019-04-25 13:37:01 INFO Client:54 - Waiting for application > com.xxxx.cloud.mf.trainer.Submit to finish... > 2019-04-25 13:37:01 INFO LoggingPodStatusWatcherImpl:54 - State changed, new > state: > pod name: com-xxxx-cloud-mf-trainer-submit-1556199419847-driver > namespace: default > labels: DagTask_ID -> 5fd12b90-fbbb-41f0-41ad-7bc5bd0abfe0, > spark-app-selector -> spark-3c8350a62ab44c139ce073d654fddebb, spark-role -> > driver > pod uid: 348cdcf5-675f-11e9-ae72-e8611f1fbb2a > creation time: 2019-04-25T13:37:01Z > service account name: default > volumes: spark-local-dir-1, spark-conf-volume, default-token-q7drh > node name: yq01-m12-ai2b-service02.yq01.xxxx.com > start time: 2019-04-25T13:37:01Z > container images: 10.96.0.100:5000/spark:spark-2.4.0 > phase: Pending > status: [ContainerStatus(containerID=null, > image=10.96.0.100:5000/spark:spark-2.4.0, imageID=, > lastState=ContainerState(running=null, terminated=null, waiting=null, > additionalProperties={}), name=spark-kubernetes-driver, ready=false, > restartCount=0, state=ContainerState(running=null, terminated=null, > waiting=ContainerStateWaiting(message=null, reason=ContainerCreating, > additionalProperties={}), additionalProperties={}), additionalProperties={})] > 2019-04-25 13:37:04 INFO LoggingPodStatusWatcherImpl:54 - State changed, new > state: > pod name: com-xxxx-cloud-mf-trainer-submit-1556199419847-driver > namespace: default > labels: DagTask_ID -> 5fd12b90-fbbb-41f0-41ad-7bc5bd0abfe0, > spark-app-selector -> spark-3c8350a62ab44c139ce073d654fddebb, spark-role -> > driver > pod uid: 348cdcf5-675f-11e9-ae72-e8611f1fbb2a > creation time: 2019-04-25T13:37:01Z > service account name: default > volumes: spark-local-dir-1, spark-conf-volume, default-token-q7drh > node name: yq01-m12-ai2b-service02.yq01.xxxx.com > start time: 2019-04-25T13:37:01Z > container images: 10.96.0.100:5000/spark:spark-2.4.0 > phase: Running > status: > [ContainerStatus(containerID=docker://120dbf8cb11cf8ef9b26cff3354e096a979beb35279de34be64b3c06e896b991, > image=10.96.0.100:5000/spark:spark-2.4.0, > imageID=docker-pullable://10.96.0.100:5000/spark@sha256:5b47e2a29aeb1c644fc3853933be2ad08f9cd233dec0977908803e9a1f870b0f, > lastState=ContainerState(running=null, terminated=null, waiting=null, > additionalProperties={}), name=spark-kubernetes-driver, ready=true, > restartCount=0, > state=ContainerState(running=ContainerStateRunning(startedAt=Time(time=2019-04-25T13:37:03Z, > additionalProperties={}), additionalProperties={}), terminated=null, > waiting=null, additionalProperties={}), additionalProperties={})] > 2019-04-25 13:37:27 INFO LoggingPodStatusWatcherImpl:54 - State changed, new > state: > pod name: com-xxxx-cloud-mf-trainer-submit-1556199419847-driver > namespace: default > labels: DagTask_ID -> 5fd12b90-fbbb-41f0-41ad-7bc5bd0abfe0, > spark-app-selector -> spark-3c8350a62ab44c139ce073d654fddebb, spark-role -> > driver > pod uid: 348cdcf5-675f-11e9-ae72-e8611f1fbb2a > creation time: 2019-04-25T13:37:01Z > service account name: default > volumes: spark-local-dir-1, spark-conf-volume, default-token-q7drh > node name: yq01-m12-ai2b-service02.yq01.xxxx.com > start time: 2019-04-25T13:37:01Z > container images: 10.96.0.100:5000/spark:spark-2.4.0 > phase: Pending > status: [ContainerStatus(containerID=null, > image=10.96.0.100:5000/spark:spark-2.4.0, imageID=, > lastState=ContainerState(running=null, terminated=null, waiting=null, > additionalProperties={}), name=spark-kubernetes-driver, ready=false, > restartCount=0, state=ContainerState(running=null, terminated=null, > waiting=ContainerStateWaiting(message=null, reason=ContainerCreating, > additionalProperties={}), additionalProperties={}), additionalProperties={})] > 2019-04-25 13:37:29 INFO LoggingPodStatusWatcherImpl:54 - State changed, new > state: > pod name: com-xxxx-cloud-mf-trainer-submit-1556199419847-driver > namespace: default > labels: DagTask_ID -> 5fd12b90-fbbb-41f0-41ad-7bc5bd0abfe0, > spark-app-selector -> spark-3c8350a62ab44c139ce073d654fddebb, spark-role -> > driver > pod uid: 348cdcf5-675f-11e9-ae72-e8611f1fbb2a > creation time: 2019-04-25T13:37:01Z > service account name: default > volumes: spark-local-dir-1, spark-conf-volume, default-token-q7drh > node name: yq01-m12-ai2b-service02.yq01.xxxx.com > start time: 2019-04-25T13:37:01Z > container images: 10.96.0.100:5000/spark:spark-2.4.0 > phase: Running > status: > [ContainerStatus(containerID=docker://43753f5336c41eaec8cdcdfd271b34ac465de331aad2d612fe0c7ad1c3706aac, > image=10.96.0.100:5000/spark:spark-2.4.0, > imageID=docker-pullable://10.96.0.100:5000/spark@sha256:5b47e2a29aeb1c644fc3853933be2ad08f9cd233dec0977908803e9a1f870b0f, > lastState=ContainerState(running=null, terminated=null, waiting=null, > additionalProperties={}), name=spark-kubernetes-driver, ready=true, > restartCount=0, > state=ContainerState(running=ContainerStateRunning(startedAt=Time(time=2019-04-25T13:37:28Z, > additionalProperties={}), additionalProperties={}), terminated=null, > waiting=null, additionalProperties={}), additionalProperties={})] > 2019-04-25 13:37:52 INFO LoggingPodStatusWatcherImpl:54 - State changed, new > state: > pod name: com-xxxx-cloud-mf-trainer-submit-1556199419847-driver > namespace: default > labels: DagTask_ID -> 5fd12b90-fbbb-41f0-41ad-7bc5bd0abfe0, > spark-app-selector -> spark-3c8350a62ab44c139ce073d654fddebb, spark-role -> > driver > pod uid: 348cdcf5-675f-11e9-ae72-e8611f1fbb2a > creation time: 2019-04-25T13:37:01Z > service account name: default > volumes: spark-local-dir-1, spark-conf-volume, default-token-q7drh > node name: yq01-m12-ai2b-service02.yq01.xxxx.com > start time: 2019-04-25T13:37:01Z > container images: 10.96.0.100:5000/spark:spark-2.4.0 > phase: Failed > status: > [ContainerStatus(containerID=docker://43753f5336c41eaec8cdcdfd271b34ac465de331aad2d612fe0c7ad1c3706aac, > image=10.96.0.100:5000/spark:spark-2.4.0, > imageID=docker-pullable://10.96.0.100:5000/spark@sha256:5b47e2a29aeb1c644fc3853933be2ad08f9cd233dec0977908803e9a1f870b0f, > lastState=ContainerState(running=null, terminated=null, waiting=null, > additionalProperties={}), name=spark-kubernetes-driver, ready=false, > restartCount=0, state=ContainerState(running=null, > terminated=ContainerStateTerminated(containerID=docker://43753f5336c41eaec8cdcdfd271b34ac465de331aad2d612fe0c7ad1c3706aac, > exitCode=1, finishedAt=Time(time=2019-04-25T13:37:48Z, > additionalProperties={}), message=null, reason=Error, signal=null, > startedAt=Time(time=2019-04-25T13:37:28Z, additionalProperties={}), > additionalProperties={}), waiting=null, additionalProperties={}), > additionalProperties={})] > 2019-04-25 13:37:52 INFO LoggingPodStatusWatcherImpl:54 - Container final > statuses: > Container name: spark-kubernetes-driver > Container image: 10.96.0.100:5000/spark:spark-2.4.0 > Container state: Terminated > Exit code: 1 > 2019-04-25 13:37:52 INFO Client:54 - Application > com.xxxx.cloud.mf.trainer.Submit finished. > 2019-04-25 13:37:52 INFO ShutdownHookManager:54 - Shutdown hook called > 2019-04-25 13:37:52 INFO ShutdownHookManager:54 - Deleting directory > /tmp/spark-84727675-4ced-491c-8993-22e8f3539bf3 > bash-4.4# > > > Please let me know if I miss anything. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org