[jira] [Commented] (SPARK-33349) ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed

2022-11-04 Thread Yilun Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628700#comment-17628700
 ] 

Yilun Fan commented on SPARK-33349:
---

I also met this problem in Spark 3.2.1, kubernetes-client 5.4.1.

 
{code:java}
ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is 
expected if the application is shutting down.) 
io.fabric8.kubernetes.client.WatcherException: too old resource version: 
63993943 (64057995) 
at 
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$TypedWatcherWebSocketListener.onMessage(WatchConnectionManager.java:103){code}
I think we have to add some retry in ExecutorPodsWatchSnapshotSource. 
Especially when we close spark.kubernetes.executor.enableApiPolling,  only this 
watcher can receive executor pod status.

Just like what spark has done in the submit client.  
[https://github.com/apache/spark/pull/29533/files] 

 

> ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed
> --
>
> Key: SPARK-33349
> URL: https://issues.apache.org/jira/browse/SPARK-33349
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.1, 3.0.2, 3.1.0
>Reporter: Nicola Bova
>Priority: Critical
>
> I launch my spark application with the 
> [spark-on-kubernetes-operator|https://github.com/GoogleCloudPlatform/spark-on-k8s-operator]
>  with the following yaml file:
> {code:yaml}
> apiVersion: sparkoperator.k8s.io/v1beta2
> kind: SparkApplication
> metadata:
>    name: spark-kafka-streamer-test
>    namespace: kafka2hdfs
> spec: 
>    type: Scala
>    mode: cluster
>    image: /spark:3.0.2-SNAPSHOT-2.12-0.1.0
>    imagePullPolicy: Always
>    timeToLiveSeconds: 259200
>    mainClass: path.to.my.class.KafkaStreamer
>    mainApplicationFile: spark-kafka-streamer_2.12-spark300-assembly.jar
>    sparkVersion: 3.0.1
>    restartPolicy:
>  type: Always
>    sparkConf:
>  "spark.kafka.consumer.cache.capacity": "8192"
>  "spark.kubernetes.memoryOverheadFactor": "0.3"
>    deps:
>    jars:
>  - my
>  - jar
>  - list
>    hadoopConfigMap: hdfs-config
>    driver:
>  cores: 4
>  memory: 12g
>  labels:
>    version: 3.0.1
>  serviceAccount: default
>  javaOptions: 
> "-Dlog4j.configuration=file:///opt/spark/log4j/log4j.properties"
>   executor:
>  instances: 4
>     cores: 4
>     memory: 16g
>     labels:
>   version: 3.0.1
>     javaOptions: 
> "-Dlog4j.configuration=file:///opt/spark/log4j/log4j.properties"
> {code}
>  I have tried with both Spark `3.0.1` and `3.0.2-SNAPSHOT` with the ["Restart 
> the watcher when we receive a version changed from 
> k8s"|https://github.com/apache/spark/pull/29533] patch.
> This is the driver log:
> {code}
> 20/11/04 12:16:02 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> ... // my app log, it's a structured streaming app reading from kafka and 
> writing to hdfs
> 20/11/04 13:12:12 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has 
> been closed (this is expected if the application is shutting down.)
> io.fabric8.kubernetes.client.KubernetesClientException: too old resource 
> version: 1574101276 (1574213896)
>  at 
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259)
>  at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)
>  at 
> okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)
>  at 
> okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)
>  at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
>  at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
>  at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203)
>  at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
>  at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown 
> Source)
>  at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown 
> Source)
>  at java.base/java.lang.Thread.run(Unknown Source)
> {code}
> The error above appears after roughly 50 minutes.
> After the exception above, no more logs are produced and the app hangs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39546) Respect port defininitions on K8S pod templates for both driver and executor

2022-09-05 Thread Yilun Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600346#comment-17600346
 ] 

Yilun Fan commented on SPARK-39546:
---

I made a PR, I think it can resolve this issue.

> Respect port defininitions on K8S pod templates for both driver and executor
> 
>
> Key: SPARK-39546
> URL: https://issues.apache.org/jira/browse/SPARK-39546
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Oliver Koeth
>Priority: Minor
>
> *Description:*
> Spark on K8S allows to open additional ports for custom purposes on the 
> driver pod via the pod template, but ignores the port specification in the 
> executor pod template. Port specifications from the pod template should be 
> preserved (and extended) for both drivers and executors.
> *Scenario:*
> I want to run functionality in the executor that exposes data on an 
> additional port. In my case, this is monitoring data exposed by Spark's JMX 
> metrics sink via the JMX prometheus exporter java agent 
> https://github.com/prometheus/jmx_exporter -- the java agent opens an extra 
> port inside the container, but for prometheus to detect and scrape the port, 
> it must be exposed in the K8S pod resource.
> (More background if desired: This seems to be the "classic" Spark 2 way to 
> expose prometheus metrics. Spark 3 introduced a native equivalent servlet for 
> the driver, but for the executor, only a rather limited set of metrics is 
> forwarded via the driver, and that also follows a completely different naming 
> scheme. So the JMX + exporter approach still turns out to be more useful for 
> me, even in Spark 3)
> Expected behavior:
> I add the following to my pod template to expose the extra port opened by the 
> JMX exporter java agent
> spec:
>   containers:
>   - ...
>     ports:
>     - containerPort: 8090
>       name: jmx-prometheus
>       protocol: TCP
> Observed behavior:
> The port is exposed for driver pods but not for executor pods
> *Corresponding code:*
> driver pod creation just adds ports
> [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala]
>  (currently line 115)
> val driverContainer = new ContainerBuilder(pod.container)
> ...
>   .addNewPort()
> ...
>   .addNewPort()
> while executor pod creation replaces the ports
> [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala]
>  (currently line 211)
> val executorContainer = new ContainerBuilder(pod.container)
> ...
>   .withPorts(requiredPorts.asJava)
> The current handling is incosistent and unnecessarily limiting. It seems that 
> the executor creation could/should just as well preserve pods from the 
> template and add extra required ports.
> *Workaround:*
> It is possible to work around this limitation by adding a full sidecar 
> container to the executor pod spec which declares the port. Sidecar 
> containers are left unchanged by pod template handling.
> As all containers in a pod share the same network, it does not matter which 
> container actually declares to expose the port.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26365) spark-submit for k8s cluster doesn't propagate exit code

2022-02-10 Thread Yilun Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17490154#comment-17490154
 ] 

Yilun Fan commented on SPARK-26365:
---

Can anyone review and merge this patch? We had the same problem with exit code.

> spark-submit for k8s cluster doesn't propagate exit code
> 
>
> Key: SPARK-26365
> URL: https://issues.apache.org/jira/browse/SPARK-26365
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core, Spark Submit
>Affects Versions: 2.3.2, 2.4.0, 3.0.0, 3.1.0
>Reporter: Oscar Bonilla
>Priority: Major
> Attachments: spark-2.4.5-raise-exception-k8s-failure.patch, 
> spark-3.0.0-raise-exception-k8s-failure.patch
>
>
> When launching apps using spark-submit in a kubernetes cluster, if the Spark 
> applications fails (returns exit code = 1 for example), spark-submit will 
> still exit gracefully and return exit code = 0.
> This is problematic, since there's no way to know if there's been a problem 
> with the Spark application.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org