I'm using spark-on-kubernetes to submit spark app to kubernetes. most of the time, it runs smoothly. but sometimes, I see logs after submitting: the driver pod phase changed from running to pending and starts another container in the pod though the first container exited successfully.
The driver log is nothing special, the first container ran successfully and exited. The second failed cause it checks the filepath of the output and returns error if already existed. What I can see from the log is that the second container starts shortly after the first one exited. I attached the driver log files. I use the standard spark-submit to kubernetes like: /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=10.244.12.106 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class com.xxxx.cloud.mf.trainer.Submit spark-internal --ak 970f5e4c-7171-4c61-603e-f101b65a573b --tracking_server_url http://10.155.197.12:8080 --graph hdfs://yq01-m12-ai2b-service02.yq01.xxxx.com:9000/project/62247e3a-e322-4456-6387-a66e9490652e/exp/62c37ae9-12aa-43f7-671f-d187e1bf1f84/graph/08e1dfad-c272-45ca-4201-1a8bc691a56e/meta/node1555661669082/graph.json --sk 56305f9f-b755-4b42-4218-592555f5c4a8 --mode train My env: Kubernetes version (use kubectl version): v1.10.0 OS (e.g: cat /etc/os-release): CentOS-7 Kernel (e.g. uname -a): 4.17.11-1.el7.elrepo.x86_64 Spark-2.4.0 I uploaded the driver logs and kubectl describe pod output and spark-submit output: driver-pod-logs.zip <http://apache-spark-user-list.1001560.n3.nabble.com/file/t10087/driver-pod-logs.zip> describe-pod.log <http://apache-spark-user-list.1001560.n3.nabble.com/file/t10087/describe-pod.log> spark-submit-output.log <http://apache-spark-user-list.1001560.n3.nabble.com/file/t10087/spark-submit-output.log> Any help appreciated. Thank you. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org