[ https://issues.apache.org/jira/browse/SPARK-27900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Stavros Kontopoulos updated SPARK-27900: ---------------------------------------- Description: This affects Spark on K8s at least as pods will run forever. A spark pi job is running: spark-pi-driver 1/1 Running 0 1h spark-pi2-1559309337787-exec-1 1/1 Running 0 1h spark-pi2-1559309337787-exec-2 1/1 Running 0 1h with the following setup: {quote}apiVersion: "sparkoperator.k8s.io/v1beta1" kind: SparkApplication metadata: name: spark-pi namespace: spark spec: type: Scala mode: cluster image: "skonto/spark:k8s-3.0.0-sa" imagePullPolicy: Always mainClass: org.apache.spark.examples.SparkPi mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0-SNAPSHOT.jar" arguments: - "1000000" sparkVersion: "2.4.0" restartPolicy: type: Never nodeSelector: "spark": "autotune" driver: memory: "1g" labels: version: 2.4.0 serviceAccount: spark-sa executor: instances: 2 memory: "1g" labels: version: 2.4.0{quote} At some point the driver fails but it is still running and so the pods are still running: 19/05/31 13:29:20 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents 19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 3.0 KiB, free 110.0 MiB) 19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1765.0 B, free 110.0 MiB) 19/05/31 13:29:23 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on spark-pi2-1559309337787-driver-svc.spark.svc:7079 (size: 1765.0 B, free: 110.0 MiB) 19/05/31 13:29:23 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1180 19/05/31 13:29:25 INFO DAGScheduler: Submitting 1000000 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)) 19/05/31 13:29:25 INFO TaskSchedulerImpl: Adding task set 0.0 with 1000000 tasks Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: Java heap space at scala.collection.mutable.ResizableArray.ensureSize(ResizableArray.scala:106) at scala.collection.mutable.ResizableArray.ensureSize$(ResizableArray.scala:96) at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:49) Mem: 2295260K used, 24458144K free, 1636K shrd, 48052K buff, 899424K cached $ kubectl describe pod spark-pi2-driver -n spark Name: spark-pi2-driver Namespace: spark Priority: 0 PriorityClassName: <none> Node: gke-test-cluster-1-spark-autotune-46c36f4f-x3z9/10.138.0.44 Start Time: Fri, 31 May 2019 16:28:59 +0300 Labels: spark-app-selector=spark-74d8e5a8f1af428d91093dfa6ee9d661 spark-role=driver sparkoperator.k8s.io/app-name=spark-pi2 sparkoperator.k8s.io/launched-by-spark-operator=true sparkoperator.k8s.io/submission-id=spark-pi2-1559309336226927526 version=2.4.0 Annotations: <none> Status: Running IP: 10.12.103.4 Controlled By: SparkApplication/spark-pi2 Containers: spark-kubernetes-driver: Container ID: docker://55dadb603290b42f9ddb71959edf0224ddc7ea621ee15429941d3bcc7db9b71f Image: skonto/spark:k8s-3.0.0-sa Image ID: docker-pullable://skonto/spark@sha256:6268d760d1a006b69c7086f946e4d5d9a3b99f149832c63cfc7fe39671f5cda9 Ports: 7078/TCP, 7079/TCP, 4040/TCP Host Ports: 0/TCP, 0/TCP, 0/TCP Args: driver --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.examples.SparkPi spark-internal 1000000 State: Running In the container processes are in _interruptible sleep_: PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND 15 1 185 S 2114m 7% 0 0% /usr/lib/jvm/java-1.8-openjdk/bin/java -cp /opt/spark/conf/:/opt/spark/jars/* -Xmx500m org.apache.spark.deploy.SparkSubmit --deploy-mode client --conf spar 287 0 185 S 2344 0% 3 0% sh 294 287 185 R 1536 0% 3 0% top 1 0 185 S 776 0% 0 0% /sbin/tini -s – /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=10.12.103.4 --deploy-mode client --properties-file /opt/spark/conf/spark.prope Liveness checks might be a workaround but rest apis may be still working if threads in jvm still are running as in this case (I did check the spark ui and it was there). was: A spark pi job is running: spark-pi-driver 1/1 Running 0 1h spark-pi2-1559309337787-exec-1 1/1 Running 0 1h spark-pi2-1559309337787-exec-2 1/1 Running 0 1h with the following setup: {quote}apiVersion: "sparkoperator.k8s.io/v1beta1" kind: SparkApplication metadata: name: spark-pi namespace: spark spec: type: Scala mode: cluster image: "skonto/spark:k8s-3.0.0-sa" imagePullPolicy: Always mainClass: org.apache.spark.examples.SparkPi mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0-SNAPSHOT.jar" arguments: - "1000000" sparkVersion: "2.4.0" restartPolicy: type: Never nodeSelector: "spark": "autotune" driver: memory: "1g" labels: version: 2.4.0 serviceAccount: spark-sa executor: instances: 2 memory: "1g" labels: version: 2.4.0{quote} At some point the driver fails but it is still running and so the pods are still running: 19/05/31 13:29:20 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents 19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 3.0 KiB, free 110.0 MiB) 19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1765.0 B, free 110.0 MiB) 19/05/31 13:29:23 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on spark-pi2-1559309337787-driver-svc.spark.svc:7079 (size: 1765.0 B, free: 110.0 MiB) 19/05/31 13:29:23 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1180 19/05/31 13:29:25 INFO DAGScheduler: Submitting 1000000 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)) 19/05/31 13:29:25 INFO TaskSchedulerImpl: Adding task set 0.0 with 1000000 tasks Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: Java heap space at scala.collection.mutable.ResizableArray.ensureSize(ResizableArray.scala:106) at scala.collection.mutable.ResizableArray.ensureSize$(ResizableArray.scala:96) at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:49) Mem: 2295260K used, 24458144K free, 1636K shrd, 48052K buff, 899424K cached $ kubectl describe pod spark-pi2-driver -n spark Name: spark-pi2-driver Namespace: spark Priority: 0 PriorityClassName: <none> Node: gke-test-cluster-1-spark-autotune-46c36f4f-x3z9/10.138.0.44 Start Time: Fri, 31 May 2019 16:28:59 +0300 Labels: spark-app-selector=spark-74d8e5a8f1af428d91093dfa6ee9d661 spark-role=driver sparkoperator.k8s.io/app-name=spark-pi2 sparkoperator.k8s.io/launched-by-spark-operator=true sparkoperator.k8s.io/submission-id=spark-pi2-1559309336226927526 version=2.4.0 Annotations: <none> Status: Running IP: 10.12.103.4 Controlled By: SparkApplication/spark-pi2 Containers: spark-kubernetes-driver: Container ID: docker://55dadb603290b42f9ddb71959edf0224ddc7ea621ee15429941d3bcc7db9b71f Image: skonto/spark:k8s-3.0.0-sa Image ID: docker-pullable://skonto/spark@sha256:6268d760d1a006b69c7086f946e4d5d9a3b99f149832c63cfc7fe39671f5cda9 Ports: 7078/TCP, 7079/TCP, 4040/TCP Host Ports: 0/TCP, 0/TCP, 0/TCP Args: driver --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.examples.SparkPi spark-internal 1000000 State: Running In the container processes are in _interruptible sleep_: PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND 15 1 185 S 2114m 7% 0 0% /usr/lib/jvm/java-1.8-openjdk/bin/java -cp /opt/spark/conf/:/opt/spark/jars/* -Xmx500m org.apache.spark.deploy.SparkSubmit --deploy-mode client --conf spar 287 0 185 S 2344 0% 3 0% sh 294 287 185 R 1536 0% 3 0% top 1 0 185 S 776 0% 0 0% /sbin/tini -s – /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=10.12.103.4 --deploy-mode client --properties-file /opt/spark/conf/spark.prope Liveness checks might be a workaround but rest apis may be still working if threads in jvm still are running as in this case (I did check the spark ui and it was there). > Spark will not exit due to an oom error > --------------------------------------- > > Key: SPARK-27900 > URL: https://issues.apache.org/jira/browse/SPARK-27900 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 3.0.0, 2.4.3 > Reporter: Stavros Kontopoulos > Priority: Major > > This affects Spark on K8s at least as pods will run forever. > A spark pi job is running: > spark-pi-driver 1/1 Running 0 1h > spark-pi2-1559309337787-exec-1 1/1 Running 0 1h > spark-pi2-1559309337787-exec-2 1/1 Running 0 1h > with the following setup: > {quote}apiVersion: "sparkoperator.k8s.io/v1beta1" > kind: SparkApplication > metadata: > name: spark-pi > namespace: spark > spec: > type: Scala > mode: cluster > image: "skonto/spark:k8s-3.0.0-sa" > imagePullPolicy: Always > mainClass: org.apache.spark.examples.SparkPi > mainApplicationFile: > "local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0-SNAPSHOT.jar" > arguments: > - "1000000" > sparkVersion: "2.4.0" > restartPolicy: > type: Never > nodeSelector: > "spark": "autotune" > driver: > memory: "1g" > labels: > version: 2.4.0 > serviceAccount: spark-sa > executor: > instances: 2 > memory: "1g" > labels: > version: 2.4.0{quote} > At some point the driver fails but it is still running and so the pods are > still running: > 19/05/31 13:29:20 INFO DAGScheduler: Submitting ResultStage 0 > (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents > 19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0 stored as values in > memory (estimated size 3.0 KiB, free 110.0 MiB) > 19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes > in memory (estimated size 1765.0 B, free 110.0 MiB) > 19/05/31 13:29:23 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory > on spark-pi2-1559309337787-driver-svc.spark.svc:7079 (size: 1765.0 B, free: > 110.0 MiB) > 19/05/31 13:29:23 INFO SparkContext: Created broadcast 0 from broadcast at > DAGScheduler.scala:1180 > 19/05/31 13:29:25 INFO DAGScheduler: Submitting 1000000 missing tasks from > ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 > tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, > 14)) > 19/05/31 13:29:25 INFO TaskSchedulerImpl: Adding task set 0.0 with 1000000 > tasks > Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: > Java heap space > at > scala.collection.mutable.ResizableArray.ensureSize(ResizableArray.scala:106) > at > scala.collection.mutable.ResizableArray.ensureSize$(ResizableArray.scala:96) > at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:49) > Mem: 2295260K used, 24458144K free, 1636K shrd, 48052K buff, 899424K cached > $ kubectl describe pod spark-pi2-driver -n spark > Name: spark-pi2-driver > Namespace: spark > Priority: 0 > PriorityClassName: <none> > Node: gke-test-cluster-1-spark-autotune-46c36f4f-x3z9/10.138.0.44 > Start Time: Fri, 31 May 2019 16:28:59 +0300 > Labels: spark-app-selector=spark-74d8e5a8f1af428d91093dfa6ee9d661 > spark-role=driver > sparkoperator.k8s.io/app-name=spark-pi2 > sparkoperator.k8s.io/launched-by-spark-operator=true > sparkoperator.k8s.io/submission-id=spark-pi2-1559309336226927526 > version=2.4.0 > Annotations: <none> > Status: Running > IP: 10.12.103.4 > Controlled By: SparkApplication/spark-pi2 > Containers: > spark-kubernetes-driver: > Container ID: > docker://55dadb603290b42f9ddb71959edf0224ddc7ea621ee15429941d3bcc7db9b71f > Image: skonto/spark:k8s-3.0.0-sa > Image ID: > docker-pullable://skonto/spark@sha256:6268d760d1a006b69c7086f946e4d5d9a3b99f149832c63cfc7fe39671f5cda9 > Ports: 7078/TCP, 7079/TCP, 4040/TCP > Host Ports: 0/TCP, 0/TCP, 0/TCP > Args: > driver > --properties-file > /opt/spark/conf/spark.properties > --class > org.apache.spark.examples.SparkPi > spark-internal > 1000000 > State: Running > In the container processes are in _interruptible sleep_: > PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND > 15 1 185 S 2114m 7% 0 0% /usr/lib/jvm/java-1.8-openjdk/bin/java -cp > /opt/spark/conf/:/opt/spark/jars/* -Xmx500m > org.apache.spark.deploy.SparkSubmit --deploy-mode client --conf spar > 287 0 185 S 2344 0% 3 0% sh > 294 287 185 R 1536 0% 3 0% top > 1 0 185 S 776 0% 0 0% /sbin/tini -s – /opt/spark/bin/spark-submit --conf > spark.driver.bindAddress=10.12.103.4 --deploy-mode client --properties-file > /opt/spark/conf/spark.prope > Liveness checks might be a workaround but rest apis may be still working if > threads in jvm still are running as in this case (I did check the spark ui > and it was there). > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org