[jira] [Updated] (SPARK-27900) Spark will not exit due to an oom error

2019-06-04 Thread Stavros Kontopoulos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-27900:

Description: 
This affects Spark on K8s at least as pods will run forever.

A spark pi job is running:

spark-pi-driver 1/1 Running 0 1h
 spark-pi2-1559309337787-exec-1 1/1 Running 0 1h
 spark-pi2-1559309337787-exec-2 1/1 Running 0 1h

with the following setup:
{quote}apiVersion: "sparkoperator.k8s.io/v1beta1"
 kind: SparkApplication
 metadata:
 name: spark-pi
 namespace: spark
 spec:
 type: Scala
 mode: cluster
 image: "skonto/spark:k8s-3.0.0-sa"
 imagePullPolicy: Always
 mainClass: org.apache.spark.examples.SparkPi
 mainApplicationFile: 
"local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0-SNAPSHOT.jar"
 arguments:
 - "100"
 sparkVersion: "2.4.0"
 restartPolicy:
 type: Never
 nodeSelector:
 "spark": "autotune"
 driver:
 memory: "1g"
 labels:
 version: 2.4.0
 serviceAccount: spark-sa
 executor:
 instances: 2
 memory: "1g"
 labels:
 version: 2.4.0{quote}
At some point the driver fails but it is still running and so the pods are 
still running:

19/05/31 13:29:20 INFO DAGScheduler: Submitting ResultStage 0 
(MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
 19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0 stored as values in 
memory (estimated size 3.0 KiB, free 110.0 MiB)
 19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
in memory (estimated size 1765.0 B, free 110.0 MiB)
 19/05/31 13:29:23 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 
spark-pi2-1559309337787-driver-svc.spark.svc:7079 (size: 1765.0 B, free: 110.0 
MiB)
 19/05/31 13:29:23 INFO SparkContext: Created broadcast 0 from broadcast at 
DAGScheduler.scala:1180
 19/05/31 13:29:25 INFO DAGScheduler: Submitting 100 missing tasks from 
ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 tasks 
are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
 19/05/31 13:29:25 INFO TaskSchedulerImpl: Adding task set 0.0 with 100 
tasks
 Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: 
Java heap space
 at scala.collection.mutable.ResizableArray.ensureSize(ResizableArray.scala:106)
 at scala.collection.mutable.ResizableArray.ensureSize$(ResizableArray.scala:96)
 at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:49)
 Mem: 2295260K used, 24458144K free, 1636K shrd, 48052K buff, 899424K cached

$ kubectl describe pod spark-pi2-driver -n spark
 Name: spark-pi2-driver
 Namespace: spark
 Priority: 0
 PriorityClassName: 
 Node: gke-test-cluster-1-spark-autotune-46c36f4f-x3z9/10.138.0.44
 Start Time: Fri, 31 May 2019 16:28:59 +0300
 Labels: spark-app-selector=spark-74d8e5a8f1af428d91093dfa6ee9d661
 spark-role=driver
 sparkoperator.k8s.io/app-name=spark-pi2
 sparkoperator.k8s.io/launched-by-spark-operator=true
 sparkoperator.k8s.io/submission-id=spark-pi2-1559309336226927526
 version=2.4.0
 Annotations: 
 Status: Running
 IP: 10.12.103.4
 Controlled By: SparkApplication/spark-pi2
 Containers:
 spark-kubernetes-driver:
 Container ID: 
docker://55dadb603290b42f9ddb71959edf0224ddc7ea621ee15429941d3bcc7db9b71f
 Image: skonto/spark:k8s-3.0.0-sa
 Image ID: 
docker-pullable://skonto/spark@sha256:6268d760d1a006b69c7086f946e4d5d9a3b99f149832c63cfc7fe39671f5cda9
 Ports: 7078/TCP, 7079/TCP, 4040/TCP
 Host Ports: 0/TCP, 0/TCP, 0/TCP
 Args:
 driver
 --properties-file
 /opt/spark/conf/spark.properties
 --class
 org.apache.spark.examples.SparkPi
 spark-internal
 100
 State: Running

In the container processes are in _interruptible sleep_:

PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND
 15 1 185 S 2114m 7% 0 0% /usr/lib/jvm/java-1.8-openjdk/bin/java -cp 
/opt/spark/conf/:/opt/spark/jars/* -Xmx500m org.apache.spark.deploy.SparkSubmit 
--deploy-mode client --conf spar
 287 0 185 S 2344 0% 3 0% sh
 294 287 185 R 1536 0% 3 0% top
 1 0 185 S 776 0% 0 0% /sbin/tini -s – /opt/spark/bin/spark-submit --conf 
spark.driver.bindAddress=10.12.103.4 --deploy-mode client --properties-file 
/opt/spark/conf/spark.prope

Liveness checks might be a workaround but rest apis may be still working if 
threads in jvm still are running as in this case (I did check the spark ui and 
it was there).

 

 

  was:
A spark pi job is running:

spark-pi-driver 1/1 Running 0 1h
 spark-pi2-1559309337787-exec-1 1/1 Running 0 1h
 spark-pi2-1559309337787-exec-2 1/1 Running 0 1h

with the following setup:
{quote}apiVersion: "sparkoperator.k8s.io/v1beta1"
 kind: SparkApplication
 metadata:
 name: spark-pi
 namespace: spark
 spec:
 type: Scala
 mode: cluster
 image: "skonto/spark:k8s-3.0.0-sa"
 imagePullPolicy: Always
 mainClass: org.apache.spark.examples.SparkPi
 mainApplicationFile: 
"local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0-SNAPSHOT.jar"
 arguments:
 - "100"
 sparkVersion: "2.4.0"
 restartPolicy:
 type: 

[jira] [Updated] (SPARK-27900) Spark will not exit due to an oom error

2019-06-04 Thread Stavros Kontopoulos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-27900:

Description: 
This affects Spark on K8s at least as pods will run forever and makes 
impossible for tools like Spark Operator to report back

job status.

A spark pi job is running:

spark-pi-driver 1/1 Running 0 1h
 spark-pi2-1559309337787-exec-1 1/1 Running 0 1h
 spark-pi2-1559309337787-exec-2 1/1 Running 0 1h

with the following setup:
{quote}apiVersion: "sparkoperator.k8s.io/v1beta1"
 kind: SparkApplication
 metadata:
 name: spark-pi
 namespace: spark
 spec:
 type: Scala
 mode: cluster
 image: "skonto/spark:k8s-3.0.0-sa"
 imagePullPolicy: Always
 mainClass: org.apache.spark.examples.SparkPi
 mainApplicationFile: 
"local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0-SNAPSHOT.jar"
 arguments:
 - "100"
 sparkVersion: "2.4.0"
 restartPolicy:
 type: Never
 nodeSelector:
 "spark": "autotune"
 driver:
 memory: "1g"
 labels:
 version: 2.4.0
 serviceAccount: spark-sa
 executor:
 instances: 2
 memory: "1g"
 labels:
 version: 2.4.0{quote}
At some point the driver fails but it is still running and so the pods are 
still running:

19/05/31 13:29:20 INFO DAGScheduler: Submitting ResultStage 0 
(MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
 19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0 stored as values in 
memory (estimated size 3.0 KiB, free 110.0 MiB)
 19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
in memory (estimated size 1765.0 B, free 110.0 MiB)
 19/05/31 13:29:23 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 
spark-pi2-1559309337787-driver-svc.spark.svc:7079 (size: 1765.0 B, free: 110.0 
MiB)
 19/05/31 13:29:23 INFO SparkContext: Created broadcast 0 from broadcast at 
DAGScheduler.scala:1180
 19/05/31 13:29:25 INFO DAGScheduler: Submitting 100 missing tasks from 
ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 tasks 
are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
 19/05/31 13:29:25 INFO TaskSchedulerImpl: Adding task set 0.0 with 100 
tasks
 Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: 
Java heap space
 at scala.collection.mutable.ResizableArray.ensureSize(ResizableArray.scala:106)
 at scala.collection.mutable.ResizableArray.ensureSize$(ResizableArray.scala:96)
 at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:49)
 Mem: 2295260K used, 24458144K free, 1636K shrd, 48052K buff, 899424K cached

$ kubectl describe pod spark-pi2-driver -n spark
 Name: spark-pi2-driver
 Namespace: spark
 Priority: 0
 PriorityClassName: 
 Node: gke-test-cluster-1-spark-autotune-46c36f4f-x3z9/10.138.0.44
 Start Time: Fri, 31 May 2019 16:28:59 +0300
 Labels: spark-app-selector=spark-74d8e5a8f1af428d91093dfa6ee9d661
 spark-role=driver
 sparkoperator.k8s.io/app-name=spark-pi2
 sparkoperator.k8s.io/launched-by-spark-operator=true
 sparkoperator.k8s.io/submission-id=spark-pi2-1559309336226927526
 version=2.4.0
 Annotations: 
 Status: Running
 IP: 10.12.103.4
 Controlled By: SparkApplication/spark-pi2
 Containers:
 spark-kubernetes-driver:
 Container ID: 
docker://55dadb603290b42f9ddb71959edf0224ddc7ea621ee15429941d3bcc7db9b71f
 Image: skonto/spark:k8s-3.0.0-sa
 Image ID: 
docker-pullable://skonto/spark@sha256:6268d760d1a006b69c7086f946e4d5d9a3b99f149832c63cfc7fe39671f5cda9
 Ports: 7078/TCP, 7079/TCP, 4040/TCP
 Host Ports: 0/TCP, 0/TCP, 0/TCP
 Args:
 driver
 --properties-file
 /opt/spark/conf/spark.properties
 --class
 org.apache.spark.examples.SparkPi
 spark-internal
 100
 State: Running

In the container processes are in _interruptible sleep_:

PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND
 15 1 185 S 2114m 7% 0 0% /usr/lib/jvm/java-1.8-openjdk/bin/java -cp 
/opt/spark/conf/:/opt/spark/jars/* -Xmx500m org.apache.spark.deploy.SparkSubmit 
--deploy-mode client --conf spar
 287 0 185 S 2344 0% 3 0% sh
 294 287 185 R 1536 0% 3 0% top
 1 0 185 S 776 0% 0 0% /sbin/tini -s – /opt/spark/bin/spark-submit --conf 
spark.driver.bindAddress=10.12.103.4 --deploy-mode client --properties-file 
/opt/spark/conf/spark.prope

Liveness checks might be a workaround but rest apis may be still working if 
threads in jvm still are running as in this case (I did check the spark ui and 
it was there).

 

 

  was:
This affects Spark on K8s at least as pods will run forever.

A spark pi job is running:

spark-pi-driver 1/1 Running 0 1h
 spark-pi2-1559309337787-exec-1 1/1 Running 0 1h
 spark-pi2-1559309337787-exec-2 1/1 Running 0 1h

with the following setup:
{quote}apiVersion: "sparkoperator.k8s.io/v1beta1"
 kind: SparkApplication
 metadata:
 name: spark-pi
 namespace: spark
 spec:
 type: Scala
 mode: cluster
 image: "skonto/spark:k8s-3.0.0-sa"
 imagePullPolicy: Always
 mainClass: org.apache.spark.examples.SparkPi
 mainApplicationFile: 

[jira] [Updated] (SPARK-27900) Spark will not exit due to an oom error

2019-06-04 Thread Stavros Kontopoulos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-27900:

Summary: Spark will not exit due to an oom error  (was: Spark on K8s will 
not report container failure due to an oom error)

> Spark will not exit due to an oom error
> ---
>
> Key: SPARK-27900
> URL: https://issues.apache.org/jira/browse/SPARK-27900
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> A spark pi job is running:
> spark-pi-driver 1/1 Running 0 1h
>  spark-pi2-1559309337787-exec-1 1/1 Running 0 1h
>  spark-pi2-1559309337787-exec-2 1/1 Running 0 1h
> with the following setup:
> {quote}apiVersion: "sparkoperator.k8s.io/v1beta1"
>  kind: SparkApplication
>  metadata:
>  name: spark-pi
>  namespace: spark
>  spec:
>  type: Scala
>  mode: cluster
>  image: "skonto/spark:k8s-3.0.0-sa"
>  imagePullPolicy: Always
>  mainClass: org.apache.spark.examples.SparkPi
>  mainApplicationFile: 
> "local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0-SNAPSHOT.jar"
>  arguments:
>  - "100"
>  sparkVersion: "2.4.0"
>  restartPolicy:
>  type: Never
>  nodeSelector:
>  "spark": "autotune"
>  driver:
>  memory: "1g"
>  labels:
>  version: 2.4.0
>  serviceAccount: spark-sa
>  executor:
>  instances: 2
>  memory: "1g"
>  labels:
>  version: 2.4.0{quote}
> At some point the driver fails but it is still running and so the pods are 
> still running:
> 19/05/31 13:29:20 INFO DAGScheduler: Submitting ResultStage 0 
> (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
>  19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0 stored as values in 
> memory (estimated size 3.0 KiB, free 110.0 MiB)
>  19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
> in memory (estimated size 1765.0 B, free 110.0 MiB)
>  19/05/31 13:29:23 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on spark-pi2-1559309337787-driver-svc.spark.svc:7079 (size: 1765.0 B, free: 
> 110.0 MiB)
>  19/05/31 13:29:23 INFO SparkContext: Created broadcast 0 from broadcast at 
> DAGScheduler.scala:1180
>  19/05/31 13:29:25 INFO DAGScheduler: Submitting 100 missing tasks from 
> ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 
> tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 
> 14))
>  19/05/31 13:29:25 INFO TaskSchedulerImpl: Adding task set 0.0 with 100 
> tasks
>  Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: 
> Java heap space
>  at 
> scala.collection.mutable.ResizableArray.ensureSize(ResizableArray.scala:106)
>  at 
> scala.collection.mutable.ResizableArray.ensureSize$(ResizableArray.scala:96)
>  at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:49)
>  Mem: 2295260K used, 24458144K free, 1636K shrd, 48052K buff, 899424K cached
> $ kubectl describe pod spark-pi2-driver -n spark
>  Name: spark-pi2-driver
>  Namespace: spark
>  Priority: 0
>  PriorityClassName: 
>  Node: gke-test-cluster-1-spark-autotune-46c36f4f-x3z9/10.138.0.44
>  Start Time: Fri, 31 May 2019 16:28:59 +0300
>  Labels: spark-app-selector=spark-74d8e5a8f1af428d91093dfa6ee9d661
>  spark-role=driver
>  sparkoperator.k8s.io/app-name=spark-pi2
>  sparkoperator.k8s.io/launched-by-spark-operator=true
>  sparkoperator.k8s.io/submission-id=spark-pi2-1559309336226927526
>  version=2.4.0
>  Annotations: 
>  Status: Running
>  IP: 10.12.103.4
>  Controlled By: SparkApplication/spark-pi2
>  Containers:
>  spark-kubernetes-driver:
>  Container ID: 
> docker://55dadb603290b42f9ddb71959edf0224ddc7ea621ee15429941d3bcc7db9b71f
>  Image: skonto/spark:k8s-3.0.0-sa
>  Image ID: 
> docker-pullable://skonto/spark@sha256:6268d760d1a006b69c7086f946e4d5d9a3b99f149832c63cfc7fe39671f5cda9
>  Ports: 7078/TCP, 7079/TCP, 4040/TCP
>  Host Ports: 0/TCP, 0/TCP, 0/TCP
>  Args:
>  driver
>  --properties-file
>  /opt/spark/conf/spark.properties
>  --class
>  org.apache.spark.examples.SparkPi
>  spark-internal
>  100
>  State: Running
> In the container processes are in _interruptible sleep_:
> PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND
>  15 1 185 S 2114m 7% 0 0% /usr/lib/jvm/java-1.8-openjdk/bin/java -cp 
> /opt/spark/conf/:/opt/spark/jars/* -Xmx500m 
> org.apache.spark.deploy.SparkSubmit --deploy-mode client --conf spar
>  287 0 185 S 2344 0% 3 0% sh
>  294 287 185 R 1536 0% 3 0% top
>  1 0 185 S 776 0% 0 0% /sbin/tini -s – /opt/spark/bin/spark-submit --conf 
> spark.driver.bindAddress=10.12.103.4 --deploy-mode client --properties-file 
> /opt/spark/conf/spark.prope
> Liveness checks might be a workaround but rest apis may be still working if 
> threads in jvm still are running as in this case (I did check the