[jira] [Comment Edited] (SPARK-23153) Support application dependencies in submission client's local file system
[ https://issues.apache.org/jira/browse/SPARK-23153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416846#comment-17416846 ] Stavros Kontopoulos edited comment on SPARK-23153 at 9/17/21, 6:25 PM: --- [~xuzhoyin] sorry for the late reply, the local scheme in the past meant local in the container, had a different meaning (https://github.com/apache/spark/pull/21378). So this was intentional. Not sure the status now. Btw regarding the S3 prefix, if I remember correctly the idea was not to download files from a remote location locally and then store them again eg. S3, this was intended for local files only. Feel free to add any other capabilities. was (Author: skonto): [~xuzhoyin] sorry for the late reply, the local scheme in the past meant local in the container, had a different meaning (https://github.com/apache/spark/pull/21378). So this was intentional. Not sure the status now. Btw i I remember correctly the idea was not to download files from a remote location locally and then store them again eg. S3. > Support application dependencies in submission client's local file system > - > > Key: SPARK-23153 > URL: https://issues.apache.org/jira/browse/SPARK-23153 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Spark Core >Affects Versions: 2.4.0 >Reporter: Yinan Li >Assignee: Stavros Kontopoulos >Priority: Major > Fix For: 3.0.0 > > > Currently local dependencies are not supported with Spark on K8S i.e. if the > user has code or dependencies only on the client where they run > {{spark-submit}} then the current implementation has no way to make those > visible to the Spark application running inside the K8S pods that get > launched. This limits users to only running applications where the code and > dependencies are either baked into the Docker images used or where those are > available via some external and globally accessible file system e.g. HDFS > which are not viable options for many users and environments -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23153) Support application dependencies in submission client's local file system
[ https://issues.apache.org/jira/browse/SPARK-23153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416846#comment-17416846 ] Stavros Kontopoulos edited comment on SPARK-23153 at 9/17/21, 6:23 PM: --- [~xuzhoyin] sorry for the late reply, the local scheme in the past meant local in the container, had a different meaning (https://github.com/apache/spark/pull/21378). So this was intentional. Not sure the status now. Btw i I remember correctly the idea was not to download files from a remote location locally and then store them again eg. S3. was (Author: skonto): [~xuzhoyin] sorry for the late reply, the local scheme in the past meant local in the container, had a different meaning (https://github.com/apache/spark/pull/21378). So this was intentional. Not sure the status now. > Support application dependencies in submission client's local file system > - > > Key: SPARK-23153 > URL: https://issues.apache.org/jira/browse/SPARK-23153 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Spark Core >Affects Versions: 2.4.0 >Reporter: Yinan Li >Assignee: Stavros Kontopoulos >Priority: Major > Fix For: 3.0.0 > > > Currently local dependencies are not supported with Spark on K8S i.e. if the > user has code or dependencies only on the client where they run > {{spark-submit}} then the current implementation has no way to make those > visible to the Spark application running inside the K8S pods that get > launched. This limits users to only running applications where the code and > dependencies are either baked into the Docker images used or where those are > available via some external and globally accessible file system e.g. HDFS > which are not viable options for many users and environments -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23153) Support application dependencies in submission client's local file system
[ https://issues.apache.org/jira/browse/SPARK-23153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416846#comment-17416846 ] Stavros Kontopoulos edited comment on SPARK-23153 at 9/17/21, 6:21 PM: --- [~xuzhoyin] sorry for the late reply, the local scheme in the past meant local in the container, had a different meaning (https://github.com/apache/spark/pull/21378). So this was intentional. Not sure the status now. was (Author: skonto): [~xuzhoyin] sorry for the late reply, the local scheme in the past meant local in the container, had a different meaning (https://github.com/apache/spark/pull/21378). Not sure the status now. > Support application dependencies in submission client's local file system > - > > Key: SPARK-23153 > URL: https://issues.apache.org/jira/browse/SPARK-23153 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Spark Core >Affects Versions: 2.4.0 >Reporter: Yinan Li >Assignee: Stavros Kontopoulos >Priority: Major > Fix For: 3.0.0 > > > Currently local dependencies are not supported with Spark on K8S i.e. if the > user has code or dependencies only on the client where they run > {{spark-submit}} then the current implementation has no way to make those > visible to the Spark application running inside the K8S pods that get > launched. This limits users to only running applications where the code and > dependencies are either baked into the Docker images used or where those are > available via some external and globally accessible file system e.g. HDFS > which are not viable options for many users and environments -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23153) Support application dependencies in submission client's local file system
[ https://issues.apache.org/jira/browse/SPARK-23153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416846#comment-17416846 ] Stavros Kontopoulos commented on SPARK-23153: - [~xuzhoyin] sorry for the late reply, the local scheme in the past meant local in the container, had a different meaning (https://github.com/apache/spark/pull/21378). Not sure the status now. > Support application dependencies in submission client's local file system > - > > Key: SPARK-23153 > URL: https://issues.apache.org/jira/browse/SPARK-23153 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Spark Core >Affects Versions: 2.4.0 >Reporter: Yinan Li >Assignee: Stavros Kontopoulos >Priority: Major > Fix For: 3.0.0 > > > Currently local dependencies are not supported with Spark on K8S i.e. if the > user has code or dependencies only on the client where they run > {{spark-submit}} then the current implementation has no way to make those > visible to the Spark application running inside the K8S pods that get > launched. This limits users to only running applications where the code and > dependencies are either baked into the Docker images used or where those are > available via some external and globally accessible file system e.g. HDFS > which are not viable options for many users and environments -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33737) Use an Informer+Lister API in the ExecutorPodWatcher
[ https://issues.apache.org/jira/browse/SPARK-33737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247759#comment-17247759 ] Stavros Kontopoulos commented on SPARK-33737: - In addition current implementation has been out for long and it is stable. Need to be sure that any updates will not cause any issues. I can work on a PR and see how things integrate. > Use an Informer+Lister API in the ExecutorPodWatcher > > > Key: SPARK-33737 > URL: https://issues.apache.org/jira/browse/SPARK-33737 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.2.0 >Reporter: Stavros Kontopoulos >Priority: Major > > Kubernetes backend uses Fabric8 client and a > [watch|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSource.scala#L42] > to monitor the K8s Api server for pod changes. Every watcher keeps a > websocket connection open and has no caching mechanism at that part. Caching > at the Spark K8s resource manager exists in other areas where we are hitting > the Api Server for Pod CRUD ops like > [here|https://github.com/apache/spark/blob/b8ccd755244d3cd8a81a9f4a1eafa2a4e48759d2/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala#L49]. > In an env where a lot of connections are kept due to large scale jobs this > could be problematic and impose a lot of load against the API server. A lot > of long running jobs should not create pod changes eg. Streaming jobs to > justify a continuous watching mechanism. > Latest Frabric8 client versions have implemented a SharedInformer API+Lister, > an example can be found > [here|https://github.com/fabric8io/kubernetes-client/blob/master/kubernetes-examples/src/main/java/io/fabric8/kubernetes/examples/InformerExample.java#L37]. > This new API follows the implementation of the official java K8s client and > the go counterpart and it is backed up by a caching mechanism which is > re-synced after a configurable period to avoid hitting the API server all the > time. There is also a lister that keeps track of current status of resources. > Using such a mechanism is common place when implementing a K8s controller. > The suggestion is to update to v4.13.0 the client (has all updates in wrt > that API) and use the informer+lister API where applicable. > I think the lister could also replace part of the snapshotting/notification > mechanism. > /cc [~dongjoon] [~eje] [~holden] WDYTH? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33737) Use an Informer+Lister API in the ExecutorPodWatcher
[ https://issues.apache.org/jira/browse/SPARK-33737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-33737: Description: Kubernetes backend uses Fabric8 client and a [watch|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSource.scala#L42] to monitor the K8s Api server for pod changes. Every watcher keeps a websocket connection open and has no caching mechanism at that part. Caching at the Spark K8s resource manager exists in other areas where we are hitting the Api Server for Pod CRUD ops like [here|https://github.com/apache/spark/blob/b8ccd755244d3cd8a81a9f4a1eafa2a4e48759d2/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala#L49]. In an env where a lot of connections are kept due to large scale jobs this could be problematic and impose a lot of load against the API server. A lot of long running jobs should not create pod changes eg. Streaming jobs to justify a continuous watching mechanism. Latest Frabric8 client versions have implemented a SharedInformer API+Lister, an example can be found [here|https://github.com/fabric8io/kubernetes-client/blob/master/kubernetes-examples/src/main/java/io/fabric8/kubernetes/examples/InformerExample.java#L37]. This new API follows the implementation of the official java K8s client and the go counterpart and it is backed up by a caching mechanism which is re-synced after a configurable period to avoid hitting the API server all the time. There is also a lister that keeps track of current status of resources. Using such a mechanism is common place when implementing a K8s controller. The suggestion is to update to v4.13.0 the client (has all updates in wrt that API) and use the informer+lister API where applicable. I think the lister could also replace part of the snapshotting/notification mechanism. /cc [~dongjoon] [~eje] [~holden] WDYTH? was: Kubernetes backend uses Fabric8 client and a [watch|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSource.scala#L42] to monitor the K8s Api server for pod changes. Every watcher keeps a websocket connection open and has no caching mechanism at that part. Caching at the K8s resource manager exists in other areas where we are hitting the Api Server for Pod CRUD ops like [here|https://github.com/apache/spark/blob/b8ccd755244d3cd8a81a9f4a1eafa2a4e48759d2/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala#L49]. In an env where a lot of connections are kept due to large scale jobs this could be problematic and impose a lot of load against the API server. A lot of long running jobs should not create pod changes eg. Streaming jobs to justify a continuous watching mechanism. Latest Frabric8 client versions have implemented a SharedInformer API+Lister, an example can be found [here|https://github.com/fabric8io/kubernetes-client/blob/master/kubernetes-examples/src/main/java/io/fabric8/kubernetes/examples/InformerExample.java#L37]. This new API follows the implementation of the official java K8s client and the go counterpart and it is backed up by a caching mechanism which is re-synced after a configurable period to avoid hitting the API server all the time. There is also a lister that keeps track of current status of resources. Using such a mechanism is common place when implementing a K8s controller. The suggestion is to update to v4.13.0 the client (has all updates in wrt that API) and use the informer+lister API where applicable. I think the lister could also replace part of the snapshotting/notification mechanism. /cc [~dongjoon] [~eje] [~holden] WDYTH? > Use an Informer+Lister API in the ExecutorPodWatcher > > > Key: SPARK-33737 > URL: https://issues.apache.org/jira/browse/SPARK-33737 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.2 >Reporter: Stavros Kontopoulos >Priority: Major > > Kubernetes backend uses Fabric8 client and a > [watch|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSource.scala#L42] > to monitor the K8s Api server for pod changes. Every watcher keeps a > websocket connection open and has no caching mechanism at that part. Caching > at the Spark K8s resource manager exists in other areas where we are hitting > the Api Server for Pod CRUD ops like >
[jira] [Updated] (SPARK-33737) Use an Informer+Lister API in the ExecutorPodWatcher
[ https://issues.apache.org/jira/browse/SPARK-33737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-33737: Description: Kubernetes backend uses Fabric8 client and a [watch|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSource.scala#L42] to monitor the K8s Api server for pod changes. Every watcher keeps a websocket connection open and has no caching mechanism at that part. Caching at the K8s resource manager exists in other areas where we are hitting the Api Server for Pod CRUD ops like [here|https://github.com/apache/spark/blob/b8ccd755244d3cd8a81a9f4a1eafa2a4e48759d2/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala#L49]. In an env where a lot of connections are kept due to large scale jobs this could be problematic and impose a lot of load against the API server. A lot of long running jobs should not create pod changes eg. Streaming jobs to justify a continuous watching mechanism. Latest Frabric8 client versions have implemented a SharedInformer API+Lister, an example can be found [here|https://github.com/fabric8io/kubernetes-client/blob/master/kubernetes-examples/src/main/java/io/fabric8/kubernetes/examples/InformerExample.java#L37]. This new API follows the implementation of the official java K8s client and the go counterpart and it is backed up by a caching mechanism which is re-synced after a configurable period to avoid hitting the API server all the time. There is also a lister that keeps track of current status of resources. Using such a mechanism is common place when implementing a K8s controller. The suggestion is to update to v4.13.0 the client (has all updates in wrt that API) and use the informer+lister API where applicable. I think the lister could also replace part of the snapshotting/notification mechanism. /cc [~dongjoon] [~eje] [~holden] WDYTH? was: Kubernetes backend uses Fabric8 client and a [watch|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSource.scala#L42] to monitor the K8s Api server for pod changes. Every watcher keeps a websocket connection open and has no caching mechanism at that part. Caching at the K8s resource manager exists in other areas where we are hitting the Api Server for Pod CRUD ops like [here|https://github.com/apache/spark/blob/b8ccd755244d3cd8a81a9f4a1eafa2a4e48759d2/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala#L49]. In an env where a lot of connections are kept due to large scale jobs this could be problematic and impose a lot of load against the API server. A lot of long running jobs should not create pod changes eg. Streaming jobs to justify a continuous watching mechanism. Latest Frabric8 client versions have implemented a SharedInformer API+Lister, an example can be found [here|https://github.com/fabric8io/kubernetes-client/blob/master/kubernetes-examples/src/main/java/io/fabric8/kubernetes/examples/InformerExample.java#L37]. This new API follows the implementation of the official java K8s client and the go counterpart and it is backed up by a caching mechanism which is resynced after a configurble period to avoid hitting the API server all the time. There is also a lister that keeps track of current status of resources. Using such a mechanism is common place when implementing a K8s controller. The suggestion is to update to v4.13.0 the client (has all updates in wrt that API) and use the informer+lister API where applicable. I think the lister could also replace part of the snapshotting/notification mechanism. /cc [~dongjoon] [~eje] [~holden] WDYTH? > Use an Informer+Lister API in the ExecutorPodWatcher > > > Key: SPARK-33737 > URL: https://issues.apache.org/jira/browse/SPARK-33737 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.2 >Reporter: Stavros Kontopoulos >Priority: Major > > Kubernetes backend uses Fabric8 client and a > [watch|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSource.scala#L42] > to monitor the K8s Api server for pod changes. Every watcher keeps a > websocket connection open and has no caching mechanism at that part. Caching > at the K8s resource manager exists in other areas where we are hitting the > Api Server for Pod CRUD ops like >
[jira] [Updated] (SPARK-33737) Use an Informer+Lister API in the ExecutorPodWatcher
[ https://issues.apache.org/jira/browse/SPARK-33737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-33737: Description: Kubernetes backend uses Fabric8 client and a [watch|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSource.scala#L42] to monitor the K8s Api server for pod changes. Every watcher keeps a websocket connection open and has no caching mechanism at that part. Caching at the K8s resource manager exists in other areas where we are hitting the Api Server for Pod CRUD ops like [here|https://github.com/apache/spark/blob/b8ccd755244d3cd8a81a9f4a1eafa2a4e48759d2/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala#L49]. In an env where a lot of connections are kept due to large scale jobs this could be problematic and impose a lot of load against the API server. A lot of long running jobs should not create pod changes eg. Streaming jobs to justify a continuous watching mechanism. Latest Frabric8 client versions have implemented a SharedInformer API+Lister, an example can be found [here|https://github.com/fabric8io/kubernetes-client/blob/master/kubernetes-examples/src/main/java/io/fabric8/kubernetes/examples/InformerExample.java#L37]. This new API follows the implementation of the official java K8s client and the go counterpart and it is backed up by a caching mechanism which is resynced after a configurble period to avoid hitting the API server all the time. There is also a lister that keeps track of current status of resources. Using such a mechanism is common place when implementing a K8s controller. The suggestion is to update to v4.13.0 the client (has all updates in wrt that API) and use the informer+lister API where applicable. I think the lister could also replace part of the snapshotting/notification mechanism. /cc [~dongjoon] [~eje] [~holden] WDYTH? was: Kubernetes backend uses Fabric8 client and a [watch|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSource.scala#L42] to monitor the K8s Api server for pod changes. Every watcher keeps a websocket connection open and has no caching mechanism at that part. Caching at the K8s resource manager exists in other areas where we are hitting the Api Server for Pod CRUD ops like [here|https://github.com/apache/spark/blob/b8ccd755244d3cd8a81a9f4a1eafa2a4e48759d2/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala#L49]. In an env where a lot of connections are kept due to large scale jobs this could be problematic. A lot of long running jobs should not create pod changes eg. Streaming jobs to justify a continuous watching mechanism. Latest Frabric8 client versions have implemented a SharedInformer API+Lister, an example can be found [here|https://github.com/fabric8io/kubernetes-client/blob/master/kubernetes-examples/src/main/java/io/fabric8/kubernetes/examples/InformerExample.java#L37]. This new API follows the implementation of the official java K8s client and the go counterpart and it is backed up by a caching mechanism which is resynced after a configurble period to avoid hitting the API server all the time. There is also a lister that keeps track of current status of resources. The suggestion is to update to v4.13.0 the client (has all updates in wrt that API) and use the informer+lister API where applicable. I think the lister could also replace part of the snapshotting/notification mechanism. /cc [~dongjoon] [~eje] [~holden] WDYTH? > Use an Informer+Lister API in the ExecutorPodWatcher > > > Key: SPARK-33737 > URL: https://issues.apache.org/jira/browse/SPARK-33737 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.2 >Reporter: Stavros Kontopoulos >Priority: Major > > Kubernetes backend uses Fabric8 client and a > [watch|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSource.scala#L42] > to monitor the K8s Api server for pod changes. Every watcher keeps a > websocket connection open and has no caching mechanism at that part. Caching > at the K8s resource manager exists in other areas where we are hitting the > Api Server for Pod CRUD ops like >
[jira] [Updated] (SPARK-33737) Use an Informer+Lister API in the ExecutorPodWatcher
[ https://issues.apache.org/jira/browse/SPARK-33737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-33737: Description: Kubernetes backend uses Fabric8 client and a [watch|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSource.scala#L42] to monitor the K8s Api server for pod changes. Every watcher keeps a websocket connection open and has no caching mechanism at that part. Caching at the K8s resource manager exists in other areas where we are hitting the Api Server for Pod CRUD ops like [here|https://github.com/apache/spark/blob/b8ccd755244d3cd8a81a9f4a1eafa2a4e48759d2/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala#L49]. In an env where a lot of connections are kept due to large scale jobs this could be problematic. A lot of long running jobs should not create pod changes eg. Streaming jobs to justify a continuous watching mechanism. Latest Frabric8 client versions have implemented a SharedInformer API+Lister, an example can be found [here|https://github.com/fabric8io/kubernetes-client/blob/master/kubernetes-examples/src/main/java/io/fabric8/kubernetes/examples/InformerExample.java#L37]. This new API follows the implementation of the official java K8s client and the go counterpart and it is backed up by a caching mechanism which is resynced after a configurble period to avoid hitting the API server all the time. There is also a lister that keeps track of current status of resources. The suggestion is to update to v4.13.0 the client (has all updates in wrt that API) and use the informer+lister API where applicable. I think the lister could also replace part of the snapshotting/notification mechanism. /cc [~dongjoon] [~eje] [~holden] WDYTH? was: Kubernetes backend uses Fabric8 client and a [watch|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSource.scala#L42] to monitor the K8s Api server for pod changes. Every watcher keeps a websocket connection open and has no caching mechanism at that part. Caching exists in other areas where hitting the Api Server for Pod CRUD ops like [here|https://github.com/apache/spark/blob/b8ccd755244d3cd8a81a9f4a1eafa2a4e48759d2/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala#L49]. In an env where a lot of connections are kept due to large scale jobs this could be problematic. A lot of long running jobs should not create pod changes eg. Streaming jobs to justify a continuous watching mechanism. Latest Frabric8 client versions have implemented a SharedInformer API+Lister, an example can be found [here|https://github.com/fabric8io/kubernetes-client/blob/master/kubernetes-examples/src/main/java/io/fabric8/kubernetes/examples/InformerExample.java#L37]. This new API follows the implementation of the official java K8s client and the go counterpart and it is backed up by a caching mechanism which is resynced after a configurble period to avoid hitting the API server all the time. There is also a lister that keeps track of current status of resources. The suggestion is to update to v4.13.0 the client (has all updates in wrt that API) and use the informer+lister API where applicable. I think the lister could also replace part of the snapshotting/notification mechanism. /cc [~dongjoon] [~eje] [~holden] WDYTH? > Use an Informer+Lister API in the ExecutorPodWatcher > > > Key: SPARK-33737 > URL: https://issues.apache.org/jira/browse/SPARK-33737 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.2 >Reporter: Stavros Kontopoulos >Priority: Major > > Kubernetes backend uses Fabric8 client and a > [watch|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSource.scala#L42] > to monitor the K8s Api server for pod changes. Every watcher keeps a > websocket connection open and has no caching mechanism at that part. Caching > at the K8s resource manager exists in other areas where we are hitting the > Api Server for Pod CRUD ops like > [here|https://github.com/apache/spark/blob/b8ccd755244d3cd8a81a9f4a1eafa2a4e48759d2/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala#L49]. > In an env where a lot of connections are kept due to large scale jobs this > could be problematic. > A lot of long running jobs should not create pod
[jira] [Updated] (SPARK-33737) Use an Informer+Lister API in the ExecutorPodWatcher
[ https://issues.apache.org/jira/browse/SPARK-33737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-33737: Description: Kubernetes backend uses Fabric8 client and a [watch|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSource.scala#L42] to monitor the K8s Api server for pod changes. Every watcher keeps a websocket connection open and has no caching mechanism at that part. Caching exists in other areas where hitting the Api Server for Pod CRUD ops like [here|https://github.com/apache/spark/blob/b8ccd755244d3cd8a81a9f4a1eafa2a4e48759d2/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala#L49]. In an env where a lot of connections are kept due to large scale jobs this could be problematic. A lot of long running jobs should not create pod changes eg. Streaming jobs to justify a continuous watching mechanism. Latest Frabric8 client versions have implemented a SharedInformer API+Lister, an example can be found [here|https://github.com/fabric8io/kubernetes-client/blob/master/kubernetes-examples/src/main/java/io/fabric8/kubernetes/examples/InformerExample.java#L37]. This new API follows the implementation of the official java K8s client and the go counterpart and it is backed up by a caching mechanism which is resynced after a configurble period to avoid hitting the API server all the time. There is also a lister that keeps track of current status of resources. The suggestion is to update to v4.13.0 the client (has all updates in wrt that API) and use the informer+lister API where applicable. I think the lister could also replace part of the snapshotting/notification mechanism. /cc [~dongjoon] [~eje] [~holden] WDYTH? was: Kubernetes backend uses Fabric8 client and a [watch|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSource.scala#L42] to monitor the K8s Api server for pod changes. Every watcher keeps a websocket connection open and has no caching mechanism at that part. Caching exists in other areas where hitting the Api Server for Pod CRUD ops like [here|https://github.com/apache/spark/blob/b8ccd755244d3cd8a81a9f4a1eafa2a4e48759d2/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala#L49]. In an env where a lot of connections are kept due to large scale jobs this could be problematic. A lot of long running jobs should not create pod changes eg. Streaming jobs to justify a continuous watching mechanism. Latest Frabric8 client versions have implemented a SharedInformer API+Lister, an example can be found [here|https://github.com/fabric8io/kubernetes-client/blob/master/kubernetes-examples/src/main/java/io/fabric8/kubernetes/examples/InformerExample.java#L37]. This new API follows the implementation of the official java K8s client and the go counterpart and it is backed up by a caching mechanism which is resynced after a configurble period to avoid hitting the API server all the time. There is also a lister that keeps track of current status of resources. The suggestion is to update to v4.13.0 the client (has all updates in wrt that API) and use the informer+lister API where applicable. I think the lister could also replace part of the snapshotting/notification mechanism. /cc [~dongjoon] [~eje] [~holden]] WDYTH? > Use an Informer+Lister API in the ExecutorPodWatcher > > > Key: SPARK-33737 > URL: https://issues.apache.org/jira/browse/SPARK-33737 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.2 >Reporter: Stavros Kontopoulos >Priority: Major > > Kubernetes backend uses Fabric8 client and a > [watch|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSource.scala#L42] > to monitor the K8s Api server for pod changes. Every watcher keeps a > websocket connection open and has no caching mechanism at that part. Caching > exists in other areas where hitting the Api Server for Pod CRUD ops like > [here|https://github.com/apache/spark/blob/b8ccd755244d3cd8a81a9f4a1eafa2a4e48759d2/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala#L49]. > In an env where a lot of connections are kept due to large scale jobs this > could be problematic. > A lot of long running jobs should not create pod changes eg. Streaming jobs > to justify a continuous watching mechanism.
[jira] [Updated] (SPARK-33737) Use an Informer+Lister API in the ExecutorPodWatcher
[ https://issues.apache.org/jira/browse/SPARK-33737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-33737: Description: Kubernetes backend uses Fabric8 client and a [watch|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSource.scala#L42] to monitor the K8s Api server for pod changes. Every watcher keeps a websocket connection open and has no caching mechanism at that part. Caching exists in other areas where hitting the Api Server for Pod CRUD ops like [here|https://github.com/apache/spark/blob/b8ccd755244d3cd8a81a9f4a1eafa2a4e48759d2/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala#L49]. In an env where a lot of connections are kept due to large scale jobs this could be problematic. A lot of long running jobs should not create pod changes eg. Streaming jobs to justify a continuous watching mechanism. Latest Frabric8 client versions have implemented a SharedInformer API+Lister, an example can be found [here|https://github.com/fabric8io/kubernetes-client/blob/master/kubernetes-examples/src/main/java/io/fabric8/kubernetes/examples/InformerExample.java#L37]. This new API follows the implementation of the official java K8s client and the go counterpart and it is backed up by a caching mechanism which is resynced after a configurble period to avoid hitting the API server all the time. There is also a lister that keeps track of current status of resources. The suggestion is to update to v4.13.0 the client (has all updates in wrt that API) and use the informer+lister API where applicable. I think the lister could replace part of the snapshotting/notification mechanism. /cc [~dongjoon] [~eje] [~holden]] WDYTH? was: Kubernetes backend uses Fabric8 client and a [watch|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSource.scala#L42] to monitor the K8s Api server for pod changes. Every watcher keeps a websocket connection open and has no caching mechanism at that part. Caching exists in other areas where hitting the Api Server for Pod CRUD ops like [here|https://github.com/apache/spark/blob/b8ccd755244d3cd8a81a9f4a1eafa2a4e48759d2/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala#L49]. In an env where a lot of connections are kept due to large scale jobs this could be problematic. A lot of long running jobs should not create pod changes eg. Streaming jobs to justify a continuous watching mechanism. Latest Frabric8 client versions have implemented a SharedInformer API+Lister, an example can be found [here|https://github.com/fabric8io/kubernetes-client/blob/master/kubernetes-examples/src/main/java/io/fabric8/kubernetes/examples/InformerExample.java#L37]. This new API follows the implementation of the official java K8s client and the go counterpart and it is backed up by a caching mechanism which is resynced after a configurble period. There is also a lister that keeps track of current status of resources. The suggestion is to update to v4.13.0 the client (has all updates in wrt that API) and use the informer+lister API where applicable. I think the lister could replace part of the snapshotting/notification mechanism. /cc [~dongjoon] [~eje] [~holden]] WDYTH? > Use an Informer+Lister API in the ExecutorPodWatcher > > > Key: SPARK-33737 > URL: https://issues.apache.org/jira/browse/SPARK-33737 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.2 >Reporter: Stavros Kontopoulos >Priority: Major > > Kubernetes backend uses Fabric8 client and a > [watch|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSource.scala#L42] > to monitor the K8s Api server for pod changes. Every watcher keeps a > websocket connection open and has no caching mechanism at that part. Caching > exists in other areas where hitting the Api Server for Pod CRUD ops like > [here|https://github.com/apache/spark/blob/b8ccd755244d3cd8a81a9f4a1eafa2a4e48759d2/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala#L49]. > In an env where a lot of connections are kept due to large scale jobs this > could be problematic. > A lot of long running jobs should not create pod changes eg. Streaming jobs > to justify a continuous watching mechanism. > Latest Frabric8 client versions have implemented a
[jira] [Updated] (SPARK-33737) Use an Informer+Lister API in the ExecutorPodWatcher
[ https://issues.apache.org/jira/browse/SPARK-33737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-33737: Description: Kubernetes backend uses Fabric8 client and a [watch|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSource.scala#L42] to monitor the K8s Api server for pod changes. Every watcher keeps a websocket connection open and has no caching mechanism at that part. Caching exists in other areas where hitting the Api Server for Pod CRUD ops like [here|https://github.com/apache/spark/blob/b8ccd755244d3cd8a81a9f4a1eafa2a4e48759d2/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala#L49]. In an env where a lot of connections are kept due to large scale jobs this could be problematic. A lot of long running jobs should not create pod changes eg. Streaming jobs to justify a continuous watching mechanism. Latest Frabric8 client versions have implemented a SharedInformer API+Lister, an example can be found [here|https://github.com/fabric8io/kubernetes-client/blob/master/kubernetes-examples/src/main/java/io/fabric8/kubernetes/examples/InformerExample.java#L37]. This new API follows the implementation of the official java K8s client and the go counterpart and it is backed up by a caching mechanism which is resynced after a configurble period to avoid hitting the API server all the time. There is also a lister that keeps track of current status of resources. The suggestion is to update to v4.13.0 the client (has all updates in wrt that API) and use the informer+lister API where applicable. I think the lister could also replace part of the snapshotting/notification mechanism. /cc [~dongjoon] [~eje] [~holden]] WDYTH? was: Kubernetes backend uses Fabric8 client and a [watch|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSource.scala#L42] to monitor the K8s Api server for pod changes. Every watcher keeps a websocket connection open and has no caching mechanism at that part. Caching exists in other areas where hitting the Api Server for Pod CRUD ops like [here|https://github.com/apache/spark/blob/b8ccd755244d3cd8a81a9f4a1eafa2a4e48759d2/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala#L49]. In an env where a lot of connections are kept due to large scale jobs this could be problematic. A lot of long running jobs should not create pod changes eg. Streaming jobs to justify a continuous watching mechanism. Latest Frabric8 client versions have implemented a SharedInformer API+Lister, an example can be found [here|https://github.com/fabric8io/kubernetes-client/blob/master/kubernetes-examples/src/main/java/io/fabric8/kubernetes/examples/InformerExample.java#L37]. This new API follows the implementation of the official java K8s client and the go counterpart and it is backed up by a caching mechanism which is resynced after a configurble period to avoid hitting the API server all the time. There is also a lister that keeps track of current status of resources. The suggestion is to update to v4.13.0 the client (has all updates in wrt that API) and use the informer+lister API where applicable. I think the lister could replace part of the snapshotting/notification mechanism. /cc [~dongjoon] [~eje] [~holden]] WDYTH? > Use an Informer+Lister API in the ExecutorPodWatcher > > > Key: SPARK-33737 > URL: https://issues.apache.org/jira/browse/SPARK-33737 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.2 >Reporter: Stavros Kontopoulos >Priority: Major > > Kubernetes backend uses Fabric8 client and a > [watch|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSource.scala#L42] > to monitor the K8s Api server for pod changes. Every watcher keeps a > websocket connection open and has no caching mechanism at that part. Caching > exists in other areas where hitting the Api Server for Pod CRUD ops like > [here|https://github.com/apache/spark/blob/b8ccd755244d3cd8a81a9f4a1eafa2a4e48759d2/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala#L49]. > In an env where a lot of connections are kept due to large scale jobs this > could be problematic. > A lot of long running jobs should not create pod changes eg. Streaming jobs > to justify a continuous watching mechanism. >
[jira] [Updated] (SPARK-33737) Use an Informer+Lister API in the ExecutorPodWatcher
[ https://issues.apache.org/jira/browse/SPARK-33737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-33737: Description: Kubernetes backend uses Fabric8 client and a [watch|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSource.scala#L42] to monitor the K8s Api server for pod changes. Every watcher keeps a websocket connection open and has no caching mechanism at that part. Caching exists in other areas where hitting the Api Server for Pod CRUD ops like [here|https://github.com/apache/spark/blob/b8ccd755244d3cd8a81a9f4a1eafa2a4e48759d2/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala#L49]. In an env where a lot of connections are kept due to large scale jobs this could be problematic. A lot of long running jobs should not create pod changes eg. Streaming jobs to justify a continuous watching mechanism. Latest Frabric8 client versions have implemented a SharedInformer API+Lister, an example can be found [here|https://github.com/fabric8io/kubernetes-client/blob/master/kubernetes-examples/src/main/java/io/fabric8/kubernetes/examples/InformerExample.java#L37]. This new API follows the implementation of the official java K8s client and the go counterpart and it is backed up by a caching mechanism which is resynced after a configurble period. There is also a lister that keeps track of current status of resources. The suggestion is to update to v4.13.0 the client (has all updates in wrt that API) and use the informer+lister API where applicable. I think the lister could replace part of the snapshotting/notification mechanism. /cc [~dongjoon] [~eje] [~holden]] WDYTH? was: Kubernetes backend uses Fabric8 client and a [watch|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSource.scala#L42] to monitor the K8s Api server for pod changes. Every watcher keeps a websocket connection open and has no caching mechanism at that part. Caching exists in other areas where hitting the Api Server for Pod CRUD ops like [here|https://github.com/apache/spark/blob/b8ccd755244d3cd8a81a9f4a1eafa2a4e48759d2/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala#L49]. In an env where a lot of connections are kept due to large scale jobs this could be problematic. A lot of long running jobs should not create pod changes eg. Streaming jobs to justify a continuous watching mechanism. Latest Frabric8 client versions have implemented a SharedInformer API+Lister, an example can be found [here|https://github.com/fabric8io/kubernetes-client/blob/master/kubernetes-examples/src/main/java/io/fabric8/kubernetes/examples/InformerExample.java#L37]. This new API follows the implementation of the official java K8s client and the go counterpart and it is backed up by a caching mechanism which is resynced after a configurble period. There is also a lister that keeps track of current status of resources. The suggestion is to update to v4.13.0 the client (has all updates in wrt that API) and use the informer+lister API where applicable. I think the lister could replace part of the snapshotting/notification mechanism. /cc [~dongjoon] [~eje]] WDYTH? > Use an Informer+Lister API in the ExecutorPodWatcher > > > Key: SPARK-33737 > URL: https://issues.apache.org/jira/browse/SPARK-33737 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.2 >Reporter: Stavros Kontopoulos >Priority: Major > > Kubernetes backend uses Fabric8 client and a > [watch|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSource.scala#L42] > to monitor the K8s Api server for pod changes. Every watcher keeps a > websocket connection open and has no caching mechanism at that part. Caching > exists in other areas where hitting the Api Server for Pod CRUD ops like > [here|https://github.com/apache/spark/blob/b8ccd755244d3cd8a81a9f4a1eafa2a4e48759d2/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala#L49]. > In an env where a lot of connections are kept due to large scale jobs this > could be problematic. > A lot of long running jobs should not create pod changes eg. Streaming jobs > to justify a continuous watching mechanism. > Latest Frabric8 client versions have implemented a SharedInformer API+Lister, > an example can be found
[jira] [Updated] (SPARK-33737) Use an Informer+Lister API in the ExecutorPodWatcher
[ https://issues.apache.org/jira/browse/SPARK-33737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-33737: Description: Kubernetes backend uses Fabric8 client and a [watch|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSource.scala#L42] to monitor the K8s Api server for pod changes. Every watcher keeps a websocket connection open and has no caching mechanism at that part. Caching exists in other areas where hitting the Api Server for Pod CRUD ops like [here|https://github.com/apache/spark/blob/b8ccd755244d3cd8a81a9f4a1eafa2a4e48759d2/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala#L49]. In an env where a lot of connections are kept due to large scale jobs this could be problematic. A lot of long running jobs should not create pod changes eg. Streaming jobs to justify a continuous watching mechanism. Latest Frabric8 client versions have implemented a SharedInformer API+Lister, an example can be found [here|https://github.com/fabric8io/kubernetes-client/blob/master/kubernetes-examples/src/main/java/io/fabric8/kubernetes/examples/InformerExample.java#L37]. This new API follows the implementation of the official java K8s client and the go counterpart and it is backed up by a caching mechanism which is resynced after a configurble period. There is also a lister that keeps track of current status of resources. The suggestion is to update to v4.13.0 the client (has all updates in wrt that API) and use the informer+lister API where applicable. I think the lister could replace part of the snapshotting/notification mechanism. /cc [~dongjoon] [~eje]] WDYTH? was: Kubernetes backend uses Fabric8 client and a [watch|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSource.scala#L42] to monitor the K8s Api server for pod changes. Every watcher keeps a websocket connection open and has no caching mechanism at that part. Caching exists in other areas where hitting the Api Server for Pod CRUD ops like [here|https://github.com/apache/spark/blob/b8ccd755244d3cd8a81a9f4a1eafa2a4e48759d2/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala#L49]. In an env where a lot of connections are kept due to large scale jobs this could be problematic. A lot of long running jobs should not create pod changes eg. Streaming jobs to justify a continuous watching mechanism. Latest Frabric8 client versions have implemented a SharedInformer API+Lister, an example can be found [here|https://github.com/fabric8io/kubernetes-client/blob/master/kubernetes-examples/src/main/java/io/fabric8/kubernetes/examples/InformerExample.java#L37]. This new API follows the implementation of the official java K8s client and the go counterpart and it is backed up by a caching mechanism which is resynced after a configurble period. There is also a lister that keeps track of current status of resources. The suggestion is to update to v4.13.0 the client (has all updates in wrt that API) and use the informer+lister API where applicable. I think the lister could replace part of the snapshotting/notification mechanism. /cc [~dongjoon] WDYTH? > Use an Informer+Lister API in the ExecutorPodWatcher > > > Key: SPARK-33737 > URL: https://issues.apache.org/jira/browse/SPARK-33737 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.2 >Reporter: Stavros Kontopoulos >Priority: Major > > Kubernetes backend uses Fabric8 client and a > [watch|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSource.scala#L42] > to monitor the K8s Api server for pod changes. Every watcher keeps a > websocket connection open and has no caching mechanism at that part. Caching > exists in other areas where hitting the Api Server for Pod CRUD ops like > [here|https://github.com/apache/spark/blob/b8ccd755244d3cd8a81a9f4a1eafa2a4e48759d2/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala#L49]. > In an env where a lot of connections are kept due to large scale jobs this > could be problematic. > A lot of long running jobs should not create pod changes eg. Streaming jobs > to justify a continuous watching mechanism. > Latest Frabric8 client versions have implemented a SharedInformer API+Lister, > an example can be found >
[jira] [Updated] (SPARK-33737) Use an Informer+Lister API in the ExecutorPodWatcher
[ https://issues.apache.org/jira/browse/SPARK-33737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-33737: Description: Kubernetes backend uses Fabric8 client and a [watch|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSource.scala#L42] to monitor the K8s Api server for pod changes. Every watcher keeps a websocket connection open and has no caching mechanism at that part. Caching exists in other areas where hitting the Api Server for Pod CRUD ops like [here|https://github.com/apache/spark/blob/b8ccd755244d3cd8a81a9f4a1eafa2a4e48759d2/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala#L49]. In an env where a lot of connections are kept due to large scale jobs this could be problematic. A lot of long running jobs should not create pod changes eg. Streaming jobs to justify a continuous watching mechanism. Latest Frabric8 client versions have implemented a SharedInformer API+Lister, an example can be found [here|https://github.com/fabric8io/kubernetes-client/blob/master/kubernetes-examples/src/main/java/io/fabric8/kubernetes/examples/InformerExample.java#L37]. This new API follows the implementation of the official java K8s client and the go counterpart and it is backed up by a caching mechanism which is resynced after a configurble period. There is also a lister that keeps track of current status of resources. The suggestion is to update to v4.13.0 the client (has all updates in wrt that API) and use the informer+lister API where applicable. I think the lister could replace part of the snapshotting/notification mechanism. /cc [~dongjoon] WDYTH? was: Kubernetes backend uses Fabric8 client and a [watch|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSource.scala#L42] to monitor the K8s Api server for pod changes. Every watcher keeps a websocket connection open and has no caching mechanism at that part. Caching exists in other areas where hitting the Api Server for Pod CRUD ops like [here|https://github.com/apache/spark/blob/b8ccd755244d3cd8a81a9f4a1eafa2a4e48759d2/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala#L49]. In an env where a lot of connections are kept due to large scale jobs this could be problematic. A lot of long running jobs should not create pod changes eg. Streaming jobs to justify a continuous watching mechanism. Latest Frabric8 client versions have implemented a SharedInformer API+Lister, an example can be found [here|https://github.com/fabric8io/kubernetes-client/blob/master/kubernetes-examples/src/main/java/io/fabric8/kubernetes/examples/InformerExample.java#L37]. This new API follows the implementation of the official java K8s client and the go counterpart and it is backed up by a caching mechanism which is resynced after a configurble period. There is also a lister that keeps track of current status of resources. The suggestion is to update to v4.13.0 the client (has all updates in wrt that API) and use the informer+lister API where applicable. I think the lister could replace part of the snapshotting/notification mechanism. /cc [~erikerlandson] [~dongjoon] WDYTH? > Use an Informer+Lister API in the ExecutorPodWatcher > > > Key: SPARK-33737 > URL: https://issues.apache.org/jira/browse/SPARK-33737 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.2 >Reporter: Stavros Kontopoulos >Priority: Major > > Kubernetes backend uses Fabric8 client and a > [watch|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSource.scala#L42] > to monitor the K8s Api server for pod changes. Every watcher keeps a > websocket connection open and has no caching mechanism at that part. Caching > exists in other areas where hitting the Api Server for Pod CRUD ops like > [here|https://github.com/apache/spark/blob/b8ccd755244d3cd8a81a9f4a1eafa2a4e48759d2/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala#L49]. > In an env where a lot of connections are kept due to large scale jobs this > could be problematic. > A lot of long running jobs should not create pod changes eg. Streaming jobs > to justify a continuous watching mechanism. > Latest Frabric8 client versions have implemented a SharedInformer API+Lister, > an example can be found >
[jira] [Created] (SPARK-33737) Use an Informer+Lister API in the ExecutorPodWatcher
Stavros Kontopoulos created SPARK-33737: --- Summary: Use an Informer+Lister API in the ExecutorPodWatcher Key: SPARK-33737 URL: https://issues.apache.org/jira/browse/SPARK-33737 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 3.0.2 Reporter: Stavros Kontopoulos Kubernetes backend uses Fabric8 client and a [watch|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSource.scala#L42] to monitor the K8s Api server for pod changes. Every watcher keeps a websocket connection open and has no caching mechanism at that part. Caching exists in other areas where hitting the Api Server for Pod CRUD ops like [here|https://github.com/apache/spark/blob/b8ccd755244d3cd8a81a9f4a1eafa2a4e48759d2/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala#L49]. In an env where a lot of connections are kept due to large scale jobs this could be problematic. A lot of long running jobs should not create pod changes eg. Streaming jobs to justify a continuous watching mechanism. Latest Frabric8 client versions have implemented a SharedInformer API+Lister, an example can be found [here|https://github.com/fabric8io/kubernetes-client/blob/master/kubernetes-examples/src/main/java/io/fabric8/kubernetes/examples/InformerExample.java#L37]. This new API follows the implementation of the official java K8s client and the go counterpart and it is backed up by a caching mechanism which is resynced after a configurble period. There is also a lister that keeps track of current status of resources. The suggestion is to update to v4.13.0 the client (has all updates in wrt that API) and use the informer+lister API where applicable. I think the lister could replace part of the snapshotting/notification mechanism. /cc [~erikerlandson] [~dongjoon] WDYTH? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-27936) Support local dependency uploading from --py-files
[ https://issues.apache.org/jira/browse/SPARK-27936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-27936: Comment: was deleted (was: Will create a PR shortly.) > Support local dependency uploading from --py-files > -- > > Key: SPARK-27936 > URL: https://issues.apache.org/jira/browse/SPARK-27936 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Erik Erlandson >Priority: Major > > Support python dependency uploads, as in SPARK-23153 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27936) Support local dependency uploading from --py-files
[ https://issues.apache.org/jira/browse/SPARK-27936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16934227#comment-16934227 ] Stavros Kontopoulos commented on SPARK-27936: - Will create a PR shortly. > Support local dependency uploading from --py-files > -- > > Key: SPARK-27936 > URL: https://issues.apache.org/jira/browse/SPARK-27936 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Erik Erlandson >Priority: Major > > Support python dependency uploads, as in SPARK-23153 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28953) Integration tests fail due to malformed URL
[ https://issues.apache.org/jira/browse/SPARK-28953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922547#comment-16922547 ] Stavros Kontopoulos edited comment on SPARK-28953 at 9/4/19 3:43 PM: - [~srowen] thanks I will have a look need to test if it removes the extra text, the question is why the command via java is different compared to the bash one [~holden.ka...@gmail.com] fyi. was (Author: skonto): [~srowen] thanks I will have a look need to test if ti removes the extra text, the question is why the command via java is different compared to the bash one [~holden.ka...@gmail.com] fyi. > Integration tests fail due to malformed URL > --- > > Key: SPARK-28953 > URL: https://issues.apache.org/jira/browse/SPARK-28953 > Project: Spark > Issue Type: Bug > Components: jenkins, Kubernetes >Affects Versions: 3.0.0 >Reporter: Stavros Kontopoulos >Priority: Major > > Tests failed on Ubuntu, verified on two different machines: > KubernetesSuite: > - Launcher client dependencies *** FAILED *** > java.net.MalformedURLException: no protocol: * http://172.31.46.91:30706 > at java.net.URL.(URL.java:600) > at java.net.URL.(URL.java:497) > at java.net.URL.(URL.java:446) > at > org.apache.spark.deploy.k8s.integrationtest.DepsTestsSuite.$anonfun$$init$$1(DepsTestsSuite.scala:160) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 3.0.0-SNAPSHOT > /_/ > > Using Scala version 2.12.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_222) > Type in expressions to have them evaluated. > Type :help for more information. > > scala> val pb = new ProcessBuilder().command("bash", "-c", "minikube service > ceph-nano-s3 -n spark --url") > pb: ProcessBuilder = java.lang.ProcessBuilder@46092840 > scala> pb.redirectErrorStream(true) > res0: ProcessBuilder = java.lang.ProcessBuilder@46092840 > scala> val proc = pb.start() > proc: Process = java.lang.UNIXProcess@5e9650d3 > scala> val r = org.apache.commons.io.IOUtils.toString(proc.getInputStream()) > r: String = > "* http://172.31.46.91:30706 > " > Although (no asterisk): > $ minikube service ceph-nano-s3 -n spark --url > [http://172.31.46.91:30706|http://172.31.46.91:30706/] > > This is weird because it fails at the java level, where does the asterisk > come from? > $ minikube version > minikube version: v1.3.1 > commit: ca60a424ce69a4d79f502650199ca2b52f29e631 > > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28953) Integration tests fail due to malformed URL
[ https://issues.apache.org/jira/browse/SPARK-28953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922547#comment-16922547 ] Stavros Kontopoulos commented on SPARK-28953: - [~srowen] thanks I will have a look need to test if ti removes the extra text, the question is why the command via java is different compared to the bash one [~holden.ka...@gmail.com] fyi. > Integration tests fail due to malformed URL > --- > > Key: SPARK-28953 > URL: https://issues.apache.org/jira/browse/SPARK-28953 > Project: Spark > Issue Type: Bug > Components: jenkins, Kubernetes >Affects Versions: 3.0.0 >Reporter: Stavros Kontopoulos >Priority: Major > > Tests failed on Ubuntu, verified on two different machines: > KubernetesSuite: > - Launcher client dependencies *** FAILED *** > java.net.MalformedURLException: no protocol: * http://172.31.46.91:30706 > at java.net.URL.(URL.java:600) > at java.net.URL.(URL.java:497) > at java.net.URL.(URL.java:446) > at > org.apache.spark.deploy.k8s.integrationtest.DepsTestsSuite.$anonfun$$init$$1(DepsTestsSuite.scala:160) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 3.0.0-SNAPSHOT > /_/ > > Using Scala version 2.12.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_222) > Type in expressions to have them evaluated. > Type :help for more information. > > scala> val pb = new ProcessBuilder().command("bash", "-c", "minikube service > ceph-nano-s3 -n spark --url") > pb: ProcessBuilder = java.lang.ProcessBuilder@46092840 > scala> pb.redirectErrorStream(true) > res0: ProcessBuilder = java.lang.ProcessBuilder@46092840 > scala> val proc = pb.start() > proc: Process = java.lang.UNIXProcess@5e9650d3 > scala> val r = org.apache.commons.io.IOUtils.toString(proc.getInputStream()) > r: String = > "* http://172.31.46.91:30706 > " > Although (no asterisk): > $ minikube service ceph-nano-s3 -n spark --url > [http://172.31.46.91:30706|http://172.31.46.91:30706/] > > This is weird because it fails at the java level, where does the asterisk > come from? > $ minikube version > minikube version: v1.3.1 > commit: ca60a424ce69a4d79f502650199ca2b52f29e631 > > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28953) Integration tests fail due to malformed URL
[ https://issues.apache.org/jira/browse/SPARK-28953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922292#comment-16922292 ] Stavros Kontopoulos commented on SPARK-28953: - [~shaneknapp] I can attach the build log but this fails in the internal CI and on two other machines my local machine and on an ubuntu aws instance. What version of minikube do we use on the test machines? Btw this is failing constantly. > Integration tests fail due to malformed URL > --- > > Key: SPARK-28953 > URL: https://issues.apache.org/jira/browse/SPARK-28953 > Project: Spark > Issue Type: Bug > Components: jenkins, Kubernetes >Affects Versions: 3.0.0 >Reporter: Stavros Kontopoulos >Priority: Major > > Tests failed on Ubuntu, verified on two different machines: > KubernetesSuite: > - Launcher client dependencies *** FAILED *** > java.net.MalformedURLException: no protocol: * http://172.31.46.91:30706 > at java.net.URL.(URL.java:600) > at java.net.URL.(URL.java:497) > at java.net.URL.(URL.java:446) > at > org.apache.spark.deploy.k8s.integrationtest.DepsTestsSuite.$anonfun$$init$$1(DepsTestsSuite.scala:160) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 3.0.0-SNAPSHOT > /_/ > > Using Scala version 2.12.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_222) > Type in expressions to have them evaluated. > Type :help for more information. > > scala> val pb = new ProcessBuilder().command("bash", "-c", "minikube service > ceph-nano-s3 -n spark --url") > pb: ProcessBuilder = java.lang.ProcessBuilder@46092840 > scala> pb.redirectErrorStream(true) > res0: ProcessBuilder = java.lang.ProcessBuilder@46092840 > scala> val proc = pb.start() > proc: Process = java.lang.UNIXProcess@5e9650d3 > scala> val r = org.apache.commons.io.IOUtils.toString(proc.getInputStream()) > r: String = > "* http://172.31.46.91:30706 > " > Although (no asterisk): > $ minikube service ceph-nano-s3 -n spark --url > [http://172.31.46.91:30706|http://172.31.46.91:30706/] > > This is weird because it fails at the java level, where does the asterisk > come from? > $ minikube version > minikube version: v1.3.1 > commit: ca60a424ce69a4d79f502650199ca2b52f29e631 > > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28895) Spark client process is unable to upload jars to hdfs while using ConfigMap not HADOOP_CONF_DIR
[ https://issues.apache.org/jira/browse/SPARK-28895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921394#comment-16921394 ] Stavros Kontopoulos edited comment on SPARK-28895 at 9/3/19 1:24 PM: - I changed the version to Spark 3.0.0 as this does not exist in 2.4.3. was (Author: skonto): I changed the version to Spark 3.0.0 as this does not exist in 2.4.3. I havent used spark.kubernetes.hadoop.configMapName before so it is good that you have reported this. We can enhance the feature. I would mark this as a Improvement btw. > Spark client process is unable to upload jars to hdfs while using ConfigMap > not HADOOP_CONF_DIR > --- > > Key: SPARK-28895 > URL: https://issues.apache.org/jira/browse/SPARK-28895 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Kent Yao >Priority: Major > > The *BasicDriverFeatureStep* for Spark on Kubernetes will upload the > files/jars specified by --files/–jars to a hadoop compatible file system > configured by spark.kubernetes.file.upload.path. While using HADOOP_CONF_DIR, > the spark-submit process can recognize the file system, but when using > spark.kubernetes.hadoop.configMapName which only will be mount on the Pods > not applied back to our client process. > > ||Heading 1||Heading 2|| > |HADOOP_CONF_DIR=/path/to/etc/hadoop|OK| > |spark.kubernetes.hadoop.configMapName=hz10-hadoop-dir |FAILED| > > {code:java} > Kent@KentsMacBookPro > ~/Documents/spark-on-k8s/spark-3.0.0-SNAPSHOT-bin-2.7.3 bin/spark-submit > --conf spark.kubernetes.file.upload.path=hdfs://hz-cluster10/user/kyuubi/udf > --jars > /Users/Kent/Documents/spark-on-k8s/spark-3.0.0-SNAPSHOT-bin-2.7.3/hadoop-lzo-0.4.20-SNAPSHOT.jar > --conf spark.kerberos.keytab=/Users/Kent/Downloads/kyuubi.keytab --conf > spark.kerberos.principal=kyuubi/d...@hadoop.hz.netease.com --conf > spark.kubernetes.kerberos.krb5.path=/etc/krb5.conf --name hehe --deploy-mode > cluster --class org.apache.spark.examples.HdfsTest > local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0-SNAPSHOT.jar > hdfs://hz-cluster10/user/kyuubi/hive_db/kyuubi.db/hive_tbl > Listening for transport dt_socket at address: 50014 > # spark.master=k8s://https://10.120.238.100:7443 > 19/08/27 17:21:06 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > 19/08/27 17:21:07 INFO SparkKubernetesClientFactory: Auto-configuring K8S > client using current context from users K8S config file > Listening for transport dt_socket at address: 50014 > Exception in thread "main" org.apache.spark.SparkException: Uploading file > /Users/Kent/Documents/spark-on-k8s/spark-3.0.0-SNAPSHOT-bin-2.7.3/hadoop-lzo-0.4.20-SNAPSHOT.jar > failed... > at > org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:287) > at > org.apache.spark.deploy.k8s.KubernetesUtils$.$anonfun$uploadAndTransformFileUris$1(KubernetesUtils.scala:246) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237) > at > scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at > scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at scala.collection.TraversableLike.map(TraversableLike.scala:237) > at scala.collection.TraversableLike.map$(TraversableLike.scala:230) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.deploy.k8s.KubernetesUtils$.uploadAndTransformFileUris(KubernetesUtils.scala:245) > at > org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.$anonfun$getAdditionalPodSystemProperties$1(BasicDriverFeatur# > spark.master=k8s://https://10.120.238.100:7443 > eStep.scala:165) > at scala.collection.immutable.List.foreach(List.scala:392) > at > org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.getAdditionalPodSystemProperties(BasicDriverFeatureStep.scala:163) > at > org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.$anonfun$buildFromFeatures$3(KubernetesDriverBuilder.scala:60) > at > scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126) > at > scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122) > at scala.collection.immutable.List.foldLeft(List.scala:89) > at > org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.buildFromFeatures(KubernetesDriverBuilder.scala:58) > at >
[jira] [Commented] (SPARK-28896) Spark client process is unable to upload jars to hdfs while using ConfigMap not HADOOP_CONF_DIR
[ https://issues.apache.org/jira/browse/SPARK-28896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921397#comment-16921397 ] Stavros Kontopoulos commented on SPARK-28896: - Changed to Spark 3.0.0. I will review the PR. > Spark client process is unable to upload jars to hdfs while using ConfigMap > not HADOOP_CONF_DIR > --- > > Key: SPARK-28896 > URL: https://issues.apache.org/jira/browse/SPARK-28896 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Kent Yao >Priority: Major > > The *BasicDriverFeatureStep* for Spark on Kubernetes will upload the > files/jars specified by --files/–jars to a hadoop compatible file system > configured by spark.kubernetes.file.upload.path. While using HADOOP_CONF_DIR, > the spark-submit process can recognize the file system, but when using > spark.kubernetes.hadoop.configMapName which only will be mount on the Pods > not applied back to our client process. > > ||Heading 1||Heading 2|| > |HADOOP_CONF_DIR=/path/to/etc/hadoop|OK| > |spark.kubernetes.hadoop.configMapName=hz10-hadoop-dir |FAILED| > > {code:java} > Kent@KentsMacBookPro > ~/Documents/spark-on-k8s/spark-3.0.0-SNAPSHOT-bin-2.7.3 bin/spark-submit > --conf spark.kubernetes.file.upload.path=hdfs://hz-cluster10/user/kyuubi/udf > --jars > /Users/Kent/Documents/spark-on-k8s/spark-3.0.0-SNAPSHOT-bin-2.7.3/hadoop-lzo-0.4.20-SNAPSHOT.jar > --conf spark.kerberos.keytab=/Users/Kent/Downloads/kyuubi.keytab --conf > spark.kerberos.principal=kyuubi/d...@hadoop.hz.netease.com --conf > spark.kubernetes.kerberos.krb5.path=/etc/krb5.conf --name hehe --deploy-mode > cluster --class org.apache.spark.examples.HdfsTest > local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0-SNAPSHOT.jar > hdfs://hz-cluster10/user/kyuubi/hive_db/kyuubi.db/hive_tbl > Listening for transport dt_socket at address: 50014 > # spark.master=k8s://https://10.120.238.100:7443 > 19/08/27 17:21:06 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > 19/08/27 17:21:07 INFO SparkKubernetesClientFactory: Auto-configuring K8S > client using current context from users K8S config file > Listening for transport dt_socket at address: 50014 > Exception in thread "main" org.apache.spark.SparkException: Uploading file > /Users/Kent/Documents/spark-on-k8s/spark-3.0.0-SNAPSHOT-bin-2.7.3/hadoop-lzo-0.4.20-SNAPSHOT.jar > failed... > at > org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:287) > at > org.apache.spark.deploy.k8s.KubernetesUtils$.$anonfun$uploadAndTransformFileUris$1(KubernetesUtils.scala:246) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237) > at > scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at > scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at scala.collection.TraversableLike.map(TraversableLike.scala:237) > at scala.collection.TraversableLike.map$(TraversableLike.scala:230) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.deploy.k8s.KubernetesUtils$.uploadAndTransformFileUris(KubernetesUtils.scala:245) > at > org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.$anonfun$getAdditionalPodSystemProperties$1(BasicDriverFeatur# > spark.master=k8s://https://10.120.238.100:7443 > eStep.scala:165) > at scala.collection.immutable.List.foreach(List.scala:392) > at > org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.getAdditionalPodSystemProperties(BasicDriverFeatureStep.scala:163) > at > org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.$anonfun$buildFromFeatures$3(KubernetesDriverBuilder.scala:60) > at > scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126) > at > scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122) > at scala.collection.immutable.List.foldLeft(List.scala:89) > at > org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.buildFromFeatures(KubernetesDriverBuilder.scala:58) > at > org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:101) > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$10(KubernetesClientApplication.scala:236) > at >
[jira] [Updated] (SPARK-28896) Spark client process is unable to upload jars to hdfs while using ConfigMap not HADOOP_CONF_DIR
[ https://issues.apache.org/jira/browse/SPARK-28896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-28896: Affects Version/s: (was: 2.4.3) 3.0.0 > Spark client process is unable to upload jars to hdfs while using ConfigMap > not HADOOP_CONF_DIR > --- > > Key: SPARK-28896 > URL: https://issues.apache.org/jira/browse/SPARK-28896 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Kent Yao >Priority: Major > > The *BasicDriverFeatureStep* for Spark on Kubernetes will upload the > files/jars specified by --files/–jars to a hadoop compatible file system > configured by spark.kubernetes.file.upload.path. While using HADOOP_CONF_DIR, > the spark-submit process can recognize the file system, but when using > spark.kubernetes.hadoop.configMapName which only will be mount on the Pods > not applied back to our client process. > > ||Heading 1||Heading 2|| > |HADOOP_CONF_DIR=/path/to/etc/hadoop|OK| > |spark.kubernetes.hadoop.configMapName=hz10-hadoop-dir |FAILED| > > {code:java} > Kent@KentsMacBookPro > ~/Documents/spark-on-k8s/spark-3.0.0-SNAPSHOT-bin-2.7.3 bin/spark-submit > --conf spark.kubernetes.file.upload.path=hdfs://hz-cluster10/user/kyuubi/udf > --jars > /Users/Kent/Documents/spark-on-k8s/spark-3.0.0-SNAPSHOT-bin-2.7.3/hadoop-lzo-0.4.20-SNAPSHOT.jar > --conf spark.kerberos.keytab=/Users/Kent/Downloads/kyuubi.keytab --conf > spark.kerberos.principal=kyuubi/d...@hadoop.hz.netease.com --conf > spark.kubernetes.kerberos.krb5.path=/etc/krb5.conf --name hehe --deploy-mode > cluster --class org.apache.spark.examples.HdfsTest > local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0-SNAPSHOT.jar > hdfs://hz-cluster10/user/kyuubi/hive_db/kyuubi.db/hive_tbl > Listening for transport dt_socket at address: 50014 > # spark.master=k8s://https://10.120.238.100:7443 > 19/08/27 17:21:06 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > 19/08/27 17:21:07 INFO SparkKubernetesClientFactory: Auto-configuring K8S > client using current context from users K8S config file > Listening for transport dt_socket at address: 50014 > Exception in thread "main" org.apache.spark.SparkException: Uploading file > /Users/Kent/Documents/spark-on-k8s/spark-3.0.0-SNAPSHOT-bin-2.7.3/hadoop-lzo-0.4.20-SNAPSHOT.jar > failed... > at > org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:287) > at > org.apache.spark.deploy.k8s.KubernetesUtils$.$anonfun$uploadAndTransformFileUris$1(KubernetesUtils.scala:246) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237) > at > scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at > scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at scala.collection.TraversableLike.map(TraversableLike.scala:237) > at scala.collection.TraversableLike.map$(TraversableLike.scala:230) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.deploy.k8s.KubernetesUtils$.uploadAndTransformFileUris(KubernetesUtils.scala:245) > at > org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.$anonfun$getAdditionalPodSystemProperties$1(BasicDriverFeatur# > spark.master=k8s://https://10.120.238.100:7443 > eStep.scala:165) > at scala.collection.immutable.List.foreach(List.scala:392) > at > org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.getAdditionalPodSystemProperties(BasicDriverFeatureStep.scala:163) > at > org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.$anonfun$buildFromFeatures$3(KubernetesDriverBuilder.scala:60) > at > scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126) > at > scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122) > at scala.collection.immutable.List.foldLeft(List.scala:89) > at > org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.buildFromFeatures(KubernetesDriverBuilder.scala:58) > at > org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:101) > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$10(KubernetesClientApplication.scala:236) > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$10$adapted(KubernetesClientApplication.scala:229) > at
[jira] [Commented] (SPARK-28895) Spark client process is unable to upload jars to hdfs while using ConfigMap not HADOOP_CONF_DIR
[ https://issues.apache.org/jira/browse/SPARK-28895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921394#comment-16921394 ] Stavros Kontopoulos commented on SPARK-28895: - I changed the version to Spark 3.0.0 as this does not exist in 2.4.3. I havent used spark.kubernetes.hadoop.configMapName before so it is good that you have reported this. We can enhance the feature. I would mark this as a Improvement btw. > Spark client process is unable to upload jars to hdfs while using ConfigMap > not HADOOP_CONF_DIR > --- > > Key: SPARK-28895 > URL: https://issues.apache.org/jira/browse/SPARK-28895 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Kent Yao >Priority: Major > > The *BasicDriverFeatureStep* for Spark on Kubernetes will upload the > files/jars specified by --files/–jars to a hadoop compatible file system > configured by spark.kubernetes.file.upload.path. While using HADOOP_CONF_DIR, > the spark-submit process can recognize the file system, but when using > spark.kubernetes.hadoop.configMapName which only will be mount on the Pods > not applied back to our client process. > > ||Heading 1||Heading 2|| > |HADOOP_CONF_DIR=/path/to/etc/hadoop|OK| > |spark.kubernetes.hadoop.configMapName=hz10-hadoop-dir |FAILED| > > {code:java} > Kent@KentsMacBookPro > ~/Documents/spark-on-k8s/spark-3.0.0-SNAPSHOT-bin-2.7.3 bin/spark-submit > --conf spark.kubernetes.file.upload.path=hdfs://hz-cluster10/user/kyuubi/udf > --jars > /Users/Kent/Documents/spark-on-k8s/spark-3.0.0-SNAPSHOT-bin-2.7.3/hadoop-lzo-0.4.20-SNAPSHOT.jar > --conf spark.kerberos.keytab=/Users/Kent/Downloads/kyuubi.keytab --conf > spark.kerberos.principal=kyuubi/d...@hadoop.hz.netease.com --conf > spark.kubernetes.kerberos.krb5.path=/etc/krb5.conf --name hehe --deploy-mode > cluster --class org.apache.spark.examples.HdfsTest > local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0-SNAPSHOT.jar > hdfs://hz-cluster10/user/kyuubi/hive_db/kyuubi.db/hive_tbl > Listening for transport dt_socket at address: 50014 > # spark.master=k8s://https://10.120.238.100:7443 > 19/08/27 17:21:06 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > 19/08/27 17:21:07 INFO SparkKubernetesClientFactory: Auto-configuring K8S > client using current context from users K8S config file > Listening for transport dt_socket at address: 50014 > Exception in thread "main" org.apache.spark.SparkException: Uploading file > /Users/Kent/Documents/spark-on-k8s/spark-3.0.0-SNAPSHOT-bin-2.7.3/hadoop-lzo-0.4.20-SNAPSHOT.jar > failed... > at > org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:287) > at > org.apache.spark.deploy.k8s.KubernetesUtils$.$anonfun$uploadAndTransformFileUris$1(KubernetesUtils.scala:246) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237) > at > scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at > scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at scala.collection.TraversableLike.map(TraversableLike.scala:237) > at scala.collection.TraversableLike.map$(TraversableLike.scala:230) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.deploy.k8s.KubernetesUtils$.uploadAndTransformFileUris(KubernetesUtils.scala:245) > at > org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.$anonfun$getAdditionalPodSystemProperties$1(BasicDriverFeatur# > spark.master=k8s://https://10.120.238.100:7443 > eStep.scala:165) > at scala.collection.immutable.List.foreach(List.scala:392) > at > org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.getAdditionalPodSystemProperties(BasicDriverFeatureStep.scala:163) > at > org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.$anonfun$buildFromFeatures$3(KubernetesDriverBuilder.scala:60) > at > scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126) > at > scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122) > at scala.collection.immutable.List.foldLeft(List.scala:89) > at > org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.buildFromFeatures(KubernetesDriverBuilder.scala:58) > at > org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:101) > at >
[jira] [Updated] (SPARK-28895) Spark client process is unable to upload jars to hdfs while using ConfigMap not HADOOP_CONF_DIR
[ https://issues.apache.org/jira/browse/SPARK-28895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-28895: Affects Version/s: (was: 2.4.3) 3.0.0 > Spark client process is unable to upload jars to hdfs while using ConfigMap > not HADOOP_CONF_DIR > --- > > Key: SPARK-28895 > URL: https://issues.apache.org/jira/browse/SPARK-28895 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Kent Yao >Priority: Major > > The *BasicDriverFeatureStep* for Spark on Kubernetes will upload the > files/jars specified by --files/–jars to a hadoop compatible file system > configured by spark.kubernetes.file.upload.path. While using HADOOP_CONF_DIR, > the spark-submit process can recognize the file system, but when using > spark.kubernetes.hadoop.configMapName which only will be mount on the Pods > not applied back to our client process. > > ||Heading 1||Heading 2|| > |HADOOP_CONF_DIR=/path/to/etc/hadoop|OK| > |spark.kubernetes.hadoop.configMapName=hz10-hadoop-dir |FAILED| > > {code:java} > Kent@KentsMacBookPro > ~/Documents/spark-on-k8s/spark-3.0.0-SNAPSHOT-bin-2.7.3 bin/spark-submit > --conf spark.kubernetes.file.upload.path=hdfs://hz-cluster10/user/kyuubi/udf > --jars > /Users/Kent/Documents/spark-on-k8s/spark-3.0.0-SNAPSHOT-bin-2.7.3/hadoop-lzo-0.4.20-SNAPSHOT.jar > --conf spark.kerberos.keytab=/Users/Kent/Downloads/kyuubi.keytab --conf > spark.kerberos.principal=kyuubi/d...@hadoop.hz.netease.com --conf > spark.kubernetes.kerberos.krb5.path=/etc/krb5.conf --name hehe --deploy-mode > cluster --class org.apache.spark.examples.HdfsTest > local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0-SNAPSHOT.jar > hdfs://hz-cluster10/user/kyuubi/hive_db/kyuubi.db/hive_tbl > Listening for transport dt_socket at address: 50014 > # spark.master=k8s://https://10.120.238.100:7443 > 19/08/27 17:21:06 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > 19/08/27 17:21:07 INFO SparkKubernetesClientFactory: Auto-configuring K8S > client using current context from users K8S config file > Listening for transport dt_socket at address: 50014 > Exception in thread "main" org.apache.spark.SparkException: Uploading file > /Users/Kent/Documents/spark-on-k8s/spark-3.0.0-SNAPSHOT-bin-2.7.3/hadoop-lzo-0.4.20-SNAPSHOT.jar > failed... > at > org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:287) > at > org.apache.spark.deploy.k8s.KubernetesUtils$.$anonfun$uploadAndTransformFileUris$1(KubernetesUtils.scala:246) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237) > at > scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at > scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at scala.collection.TraversableLike.map(TraversableLike.scala:237) > at scala.collection.TraversableLike.map$(TraversableLike.scala:230) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.deploy.k8s.KubernetesUtils$.uploadAndTransformFileUris(KubernetesUtils.scala:245) > at > org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.$anonfun$getAdditionalPodSystemProperties$1(BasicDriverFeatur# > spark.master=k8s://https://10.120.238.100:7443 > eStep.scala:165) > at scala.collection.immutable.List.foreach(List.scala:392) > at > org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.getAdditionalPodSystemProperties(BasicDriverFeatureStep.scala:163) > at > org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.$anonfun$buildFromFeatures$3(KubernetesDriverBuilder.scala:60) > at > scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126) > at > scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122) > at scala.collection.immutable.List.foldLeft(List.scala:89) > at > org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.buildFromFeatures(KubernetesDriverBuilder.scala:58) > at > org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:101) > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$10(KubernetesClientApplication.scala:236) > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$10$adapted(KubernetesClientApplication.scala:229) > at
[jira] [Comment Edited] (SPARK-28921) Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, 1.12.10, 1.11.10)
[ https://issues.apache.org/jira/browse/SPARK-28921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921315#comment-16921315 ] Stavros Kontopoulos edited comment on SPARK-28921 at 9/3/19 10:25 AM: -- [~andygrove] could you please clarify what do you mean when you say? "jobs like Spark-Pi that do not launch executors run without a problem" I run a pi job and it creates executors fine: spark-pi-03afbd6cf6a72622-driver 1/1 Running 0 15s spark-pi-03afbd6cf6a72622-exec-1 1/1 Running 0 7s spark-pi-03afbd6cf6a72622-exec-2 1/1 Running 0 7s was (Author: skonto): [~andygrove] could you please clarify what do you mean when you say? "jobs like Spark-Pi that do not launch executors run without a problem" I run a pi job and it creates executors fine. > Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, > 1.12.10, 1.11.10) > --- > > Key: SPARK-28921 > URL: https://issues.apache.org/jira/browse/SPARK-28921 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.3, 2.4.3 >Reporter: Paul Schweigert >Assignee: Andy Grove >Priority: Major > Fix For: 2.4.5, 3.0.0 > > > Spark jobs are failing on latest versions of Kubernetes when jobs attempt to > provision executor pods (jobs like Spark-Pi that do not launch executors run > without a problem): > > Here's an example error message: > > {code:java} > 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors > from Kubernetes. > 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors > from Kubernetes.19/08/30 01:29:09 WARN WatchConnectionManager: Exec Failure: > HTTP 403, Status: 403 - > java.net.ProtocolException: Expected HTTP 101 response but was '403 > Forbidden' > at > okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216) > at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183) > at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) > at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > > at java.lang.Thread.run(Thread.java:748) > {code} > > Looks like the issue is caused by fixes for a recent CVE : > CVE: [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-14809] > Fix: [https://github.com/fabric8io/kubernetes-client/pull/1669] > > Looks like upgrading kubernetes-client to 4.4.2 would solve this issue. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28921) Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, 1.12.10, 1.11.10)
[ https://issues.apache.org/jira/browse/SPARK-28921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921315#comment-16921315 ] Stavros Kontopoulos commented on SPARK-28921: - [~andygrove] could you please clarify what do you mean when you say? "jobs like Spark-Pi that do not launch executors run without a problem" I run a pi job and it creates executors fine. > Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, > 1.12.10, 1.11.10) > --- > > Key: SPARK-28921 > URL: https://issues.apache.org/jira/browse/SPARK-28921 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.3, 2.4.3 >Reporter: Paul Schweigert >Assignee: Andy Grove >Priority: Major > Fix For: 2.4.5, 3.0.0 > > > Spark jobs are failing on latest versions of Kubernetes when jobs attempt to > provision executor pods (jobs like Spark-Pi that do not launch executors run > without a problem): > > Here's an example error message: > > {code:java} > 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors > from Kubernetes. > 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors > from Kubernetes.19/08/30 01:29:09 WARN WatchConnectionManager: Exec Failure: > HTTP 403, Status: 403 - > java.net.ProtocolException: Expected HTTP 101 response but was '403 > Forbidden' > at > okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216) > at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183) > at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) > at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > > at java.lang.Thread.run(Thread.java:748) > {code} > > Looks like the issue is caused by fixes for a recent CVE : > CVE: [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-14809] > Fix: [https://github.com/fabric8io/kubernetes-client/pull/1669] > > Looks like upgrading kubernetes-client to 4.4.2 would solve this issue. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28953) Integration tests fail due to malformed URL
[ https://issues.apache.org/jira/browse/SPARK-28953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921087#comment-16921087 ] Stavros Kontopoulos edited comment on SPARK-28953 at 9/3/19 12:04 AM: -- [~shaneknapp] [~eje] I can fix this since I am working on: SPARK-27936 but im wondering about the root cause. was (Author: skonto): [~shaneknapp] [~eje] I can fix this since I am working on: SPARK-27936 but im wondering of the root cause. > Integration tests fail due to malformed URL > --- > > Key: SPARK-28953 > URL: https://issues.apache.org/jira/browse/SPARK-28953 > Project: Spark > Issue Type: Bug > Components: jenkins, Kubernetes >Affects Versions: 3.0.0 >Reporter: Stavros Kontopoulos >Priority: Major > > Tests failed on Ubuntu, verified on two different machines: > KubernetesSuite: > - Launcher client dependencies *** FAILED *** > java.net.MalformedURLException: no protocol: * http://172.31.46.91:30706 > at java.net.URL.(URL.java:600) > at java.net.URL.(URL.java:497) > at java.net.URL.(URL.java:446) > at > org.apache.spark.deploy.k8s.integrationtest.DepsTestsSuite.$anonfun$$init$$1(DepsTestsSuite.scala:160) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 3.0.0-SNAPSHOT > /_/ > > Using Scala version 2.12.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_222) > Type in expressions to have them evaluated. > Type :help for more information. > > scala> val pb = new ProcessBuilder().command("bash", "-c", "minikube service > ceph-nano-s3 -n spark --url") > pb: ProcessBuilder = java.lang.ProcessBuilder@46092840 > scala> pb.redirectErrorStream(true) > res0: ProcessBuilder = java.lang.ProcessBuilder@46092840 > scala> val proc = pb.start() > proc: Process = java.lang.UNIXProcess@5e9650d3 > scala> val r = org.apache.commons.io.IOUtils.toString(proc.getInputStream()) > r: String = > "* http://172.31.46.91:30706 > " > Although (no asterisk): > $ minikube service ceph-nano-s3 -n spark --url > [http://172.31.46.91:30706|http://172.31.46.91:30706/] > > This is weird because it fails at the java level, where does the asterisk > come from? > $ minikube version > minikube version: v1.3.1 > commit: ca60a424ce69a4d79f502650199ca2b52f29e631 > > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28953) Integration tests fail due to malformed URL
[ https://issues.apache.org/jira/browse/SPARK-28953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921087#comment-16921087 ] Stavros Kontopoulos edited comment on SPARK-28953 at 9/3/19 12:03 AM: -- [~shaneknapp] [~eje] I can fix this since I am working on: SPARK-27936 but im wondering of the root cause. was (Author: skonto): [~shaneknapp] I can fix this since I am working on: SPARK-27936 but im wondering of the root cause. > Integration tests fail due to malformed URL > --- > > Key: SPARK-28953 > URL: https://issues.apache.org/jira/browse/SPARK-28953 > Project: Spark > Issue Type: Bug > Components: jenkins, Kubernetes >Affects Versions: 3.0.0 >Reporter: Stavros Kontopoulos >Priority: Major > > Tests failed on Ubuntu, verified on two different machines: > KubernetesSuite: > - Launcher client dependencies *** FAILED *** > java.net.MalformedURLException: no protocol: * http://172.31.46.91:30706 > at java.net.URL.(URL.java:600) > at java.net.URL.(URL.java:497) > at java.net.URL.(URL.java:446) > at > org.apache.spark.deploy.k8s.integrationtest.DepsTestsSuite.$anonfun$$init$$1(DepsTestsSuite.scala:160) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 3.0.0-SNAPSHOT > /_/ > > Using Scala version 2.12.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_222) > Type in expressions to have them evaluated. > Type :help for more information. > > scala> val pb = new ProcessBuilder().command("bash", "-c", "minikube service > ceph-nano-s3 -n spark --url") > pb: ProcessBuilder = java.lang.ProcessBuilder@46092840 > scala> pb.redirectErrorStream(true) > res0: ProcessBuilder = java.lang.ProcessBuilder@46092840 > scala> val proc = pb.start() > proc: Process = java.lang.UNIXProcess@5e9650d3 > scala> val r = org.apache.commons.io.IOUtils.toString(proc.getInputStream()) > r: String = > "* http://172.31.46.91:30706 > " > Although (no asterisk): > $ minikube service ceph-nano-s3 -n spark --url > [http://172.31.46.91:30706|http://172.31.46.91:30706/] > > This is weird because it fails at the java level, where does the asterisk > come from? > $ minikube version > minikube version: v1.3.1 > commit: ca60a424ce69a4d79f502650199ca2b52f29e631 > > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28953) Integration tests fail due to malformed URL
[ https://issues.apache.org/jira/browse/SPARK-28953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921087#comment-16921087 ] Stavros Kontopoulos commented on SPARK-28953: - [~shaneknapp] I can fix this since I am working on: SPARK-27936 but im wondering of the root cause. > Integration tests fail due to malformed URL > --- > > Key: SPARK-28953 > URL: https://issues.apache.org/jira/browse/SPARK-28953 > Project: Spark > Issue Type: Bug > Components: jenkins, Kubernetes >Affects Versions: 3.0.0 >Reporter: Stavros Kontopoulos >Priority: Major > > Tests failed on Ubuntu, verified on two different machines: > KubernetesSuite: > - Launcher client dependencies *** FAILED *** > java.net.MalformedURLException: no protocol: * http://172.31.46.91:30706 > at java.net.URL.(URL.java:600) > at java.net.URL.(URL.java:497) > at java.net.URL.(URL.java:446) > at > org.apache.spark.deploy.k8s.integrationtest.DepsTestsSuite.$anonfun$$init$$1(DepsTestsSuite.scala:160) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 3.0.0-SNAPSHOT > /_/ > > Using Scala version 2.12.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_222) > Type in expressions to have them evaluated. > Type :help for more information. > > scala> val pb = new ProcessBuilder().command("bash", "-c", "minikube service > ceph-nano-s3 -n spark --url") > pb: ProcessBuilder = java.lang.ProcessBuilder@46092840 > scala> pb.redirectErrorStream(true) > res0: ProcessBuilder = java.lang.ProcessBuilder@46092840 > scala> val proc = pb.start() > proc: Process = java.lang.UNIXProcess@5e9650d3 > scala> val r = org.apache.commons.io.IOUtils.toString(proc.getInputStream()) > r: String = > "* http://172.31.46.91:30706 > " > Although (no asterisk): > $ minikube service ceph-nano-s3 -n spark --url > [http://172.31.46.91:30706|http://172.31.46.91:30706/] > > This is weird because it fails at the java level, where does the asterisk > come from? > $ minikube version > minikube version: v1.3.1 > commit: ca60a424ce69a4d79f502650199ca2b52f29e631 > > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28953) Integration tests fail due to malformed URL
Stavros Kontopoulos created SPARK-28953: --- Summary: Integration tests fail due to malformed URL Key: SPARK-28953 URL: https://issues.apache.org/jira/browse/SPARK-28953 Project: Spark Issue Type: Bug Components: jenkins, Kubernetes Affects Versions: 3.0.0 Reporter: Stavros Kontopoulos Tests failed on Ubuntu, verified on two different machines: KubernetesSuite: - Launcher client dependencies *** FAILED *** java.net.MalformedURLException: no protocol: * http://172.31.46.91:30706 at java.net.URL.(URL.java:600) at java.net.URL.(URL.java:497) at java.net.URL.(URL.java:446) at org.apache.spark.deploy.k8s.integrationtest.DepsTestsSuite.$anonfun$$init$$1(DepsTestsSuite.scala:160) at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.0.0-SNAPSHOT /_/ Using Scala version 2.12.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_222) Type in expressions to have them evaluated. Type :help for more information. scala> val pb = new ProcessBuilder().command("bash", "-c", "minikube service ceph-nano-s3 -n spark --url") pb: ProcessBuilder = java.lang.ProcessBuilder@46092840 scala> pb.redirectErrorStream(true) res0: ProcessBuilder = java.lang.ProcessBuilder@46092840 scala> val proc = pb.start() proc: Process = java.lang.UNIXProcess@5e9650d3 scala> val r = org.apache.commons.io.IOUtils.toString(proc.getInputStream()) r: String = "* http://172.31.46.91:30706 " Although (no asterisk): $ minikube service ceph-nano-s3 -n spark --url [http://172.31.46.91:30706|http://172.31.46.91:30706/] This is weird because it fails at the java level, where does the asterisk come from? $ minikube version minikube version: v1.3.1 commit: ca60a424ce69a4d79f502650199ca2b52f29e631 -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28025) HDFSBackedStateStoreProvider should not leak .crc files
[ https://issues.apache.org/jira/browse/SPARK-28025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919583#comment-16919583 ] Stavros Kontopoulos commented on SPARK-28025: - Thanks I will have a look :) > HDFSBackedStateStoreProvider should not leak .crc files > > > Key: SPARK-28025 > URL: https://issues.apache.org/jira/browse/SPARK-28025 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.3 > Environment: Spark 2.4.3 > Kubernetes 1.11(?) (OpenShift) > StateStore storage on a mounted PVC. Viewed as a local filesystem by the > `FileContextBasedCheckpointFileManager` : > {noformat} > scala> glusterfm.isLocal > res17: Boolean = true{noformat} >Reporter: Gerard Maas >Assignee: Jungtaek Lim >Priority: Major > Fix For: 2.4.4, 3.0.0 > > > The HDFSBackedStateStoreProvider when using the default CheckpointFileManager > is leaving '.crc' files behind. There's a .crc file created for each > `atomicFile` operation of the CheckpointFileManager. > Over time, the number of files becomes very large. It makes the state store > file system constantly increase in size and, in our case, deteriorates the > file system performance. > Here's a sample of one of our spark storage volumes after 2 days of execution > (4 stateful streaming jobs, each on a different sub-dir): > # > {noformat} > Total files in PVC (used for checkpoints and state store) > $find . | wc -l > 431796 > # .crc files > $find . -name "*.crc" | wc -l > 418053{noformat} > With each .crc file taking one storage block, the used storage runs into the > GBs of data. > These jobs are running on Kubernetes. Our shared storage provider, GlusterFS, > shows serious performance deterioration with this large number of files: > {noformat} > DEBUG HDFSBackedStateStoreProvider: fetchFiles() took 29164ms{noformat} > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28025) HDFSBackedStateStoreProvider should not leak .crc files
[ https://issues.apache.org/jira/browse/SPARK-28025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919577#comment-16919577 ] Stavros Kontopoulos edited comment on SPARK-28025 at 8/30/19 2:15 PM: -- [~kabhwan] cool I have a look. was (Author: skonto): [~kabhwan] which PR? > HDFSBackedStateStoreProvider should not leak .crc files > > > Key: SPARK-28025 > URL: https://issues.apache.org/jira/browse/SPARK-28025 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.3 > Environment: Spark 2.4.3 > Kubernetes 1.11(?) (OpenShift) > StateStore storage on a mounted PVC. Viewed as a local filesystem by the > `FileContextBasedCheckpointFileManager` : > {noformat} > scala> glusterfm.isLocal > res17: Boolean = true{noformat} >Reporter: Gerard Maas >Assignee: Jungtaek Lim >Priority: Major > Fix For: 2.4.4, 3.0.0 > > > The HDFSBackedStateStoreProvider when using the default CheckpointFileManager > is leaving '.crc' files behind. There's a .crc file created for each > `atomicFile` operation of the CheckpointFileManager. > Over time, the number of files becomes very large. It makes the state store > file system constantly increase in size and, in our case, deteriorates the > file system performance. > Here's a sample of one of our spark storage volumes after 2 days of execution > (4 stateful streaming jobs, each on a different sub-dir): > # > {noformat} > Total files in PVC (used for checkpoints and state store) > $find . | wc -l > 431796 > # .crc files > $find . -name "*.crc" | wc -l > 418053{noformat} > With each .crc file taking one storage block, the used storage runs into the > GBs of data. > These jobs are running on Kubernetes. Our shared storage provider, GlusterFS, > shows serious performance deterioration with this large number of files: > {noformat} > DEBUG HDFSBackedStateStoreProvider: fetchFiles() took 29164ms{noformat} > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28025) HDFSBackedStateStoreProvider should not leak .crc files
[ https://issues.apache.org/jira/browse/SPARK-28025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919577#comment-16919577 ] Stavros Kontopoulos commented on SPARK-28025: - [~kabhwan] which PR? > HDFSBackedStateStoreProvider should not leak .crc files > > > Key: SPARK-28025 > URL: https://issues.apache.org/jira/browse/SPARK-28025 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.3 > Environment: Spark 2.4.3 > Kubernetes 1.11(?) (OpenShift) > StateStore storage on a mounted PVC. Viewed as a local filesystem by the > `FileContextBasedCheckpointFileManager` : > {noformat} > scala> glusterfm.isLocal > res17: Boolean = true{noformat} >Reporter: Gerard Maas >Assignee: Jungtaek Lim >Priority: Major > Fix For: 2.4.4, 3.0.0 > > > The HDFSBackedStateStoreProvider when using the default CheckpointFileManager > is leaving '.crc' files behind. There's a .crc file created for each > `atomicFile` operation of the CheckpointFileManager. > Over time, the number of files becomes very large. It makes the state store > file system constantly increase in size and, in our case, deteriorates the > file system performance. > Here's a sample of one of our spark storage volumes after 2 days of execution > (4 stateful streaming jobs, each on a different sub-dir): > # > {noformat} > Total files in PVC (used for checkpoints and state store) > $find . | wc -l > 431796 > # .crc files > $find . -name "*.crc" | wc -l > 418053{noformat} > With each .crc file taking one storage block, the used storage runs into the > GBs of data. > These jobs are running on Kubernetes. Our shared storage provider, GlusterFS, > shows serious performance deterioration with this large number of files: > {noformat} > DEBUG HDFSBackedStateStoreProvider: fetchFiles() took 29164ms{noformat} > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28025) HDFSBackedStateStoreProvider should not leak .crc files
[ https://issues.apache.org/jira/browse/SPARK-28025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919471#comment-16919471 ] Stavros Kontopoulos edited comment on SPARK-28025 at 8/30/19 11:54 AM: --- @[~dongjoon] [~zsxwing] this needs to be re-opened. When using the workaround we recently hit this issue: [https://github.com/broadinstitute/gatk/issues/1389] which can be fixed easily with a derived class like in this PR: [https://github.com/broadinstitute/gatk/pull/1421/files] but this is a bit of inconvenient. However, I believe as well that this should be fixed in Spark (less surprises) otherwise we need to document it as [~kabhwan] said above. was (Author: skonto): @[~dongjoon] [~zsxwing] this needs to be re-opened. When using the workaround we recently hit this issue: [https://github.com/broadinstitute/gatk/issues/1389] which can be fixed easily with a derived class like in this PR: [https://github.com/broadinstitute/gatk/pull/1421/files] However, I believe as well that this should be fixed in Spark (less surprises) otherwise we need to document it as [~kabhwan] said above. > HDFSBackedStateStoreProvider should not leak .crc files > > > Key: SPARK-28025 > URL: https://issues.apache.org/jira/browse/SPARK-28025 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.3 > Environment: Spark 2.4.3 > Kubernetes 1.11(?) (OpenShift) > StateStore storage on a mounted PVC. Viewed as a local filesystem by the > `FileContextBasedCheckpointFileManager` : > {noformat} > scala> glusterfm.isLocal > res17: Boolean = true{noformat} >Reporter: Gerard Maas >Assignee: Jungtaek Lim >Priority: Major > Fix For: 2.4.4, 3.0.0 > > > The HDFSBackedStateStoreProvider when using the default CheckpointFileManager > is leaving '.crc' files behind. There's a .crc file created for each > `atomicFile` operation of the CheckpointFileManager. > Over time, the number of files becomes very large. It makes the state store > file system constantly increase in size and, in our case, deteriorates the > file system performance. > Here's a sample of one of our spark storage volumes after 2 days of execution > (4 stateful streaming jobs, each on a different sub-dir): > # > {noformat} > Total files in PVC (used for checkpoints and state store) > $find . | wc -l > 431796 > # .crc files > $find . -name "*.crc" | wc -l > 418053{noformat} > With each .crc file taking one storage block, the used storage runs into the > GBs of data. > These jobs are running on Kubernetes. Our shared storage provider, GlusterFS, > shows serious performance deterioration with this large number of files: > {noformat} > DEBUG HDFSBackedStateStoreProvider: fetchFiles() took 29164ms{noformat} > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28025) HDFSBackedStateStoreProvider should not leak .crc files
[ https://issues.apache.org/jira/browse/SPARK-28025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919471#comment-16919471 ] Stavros Kontopoulos commented on SPARK-28025: - @[~dongjoon] [~zsxwing] this needs to be re-opened. When using the workaround we recently hit this issue: [https://github.com/broadinstitute/gatk/issues/1389] which can be fixed easily with a derived class like in this PR: [https://github.com/broadinstitute/gatk/pull/1421/files] However, I believe as well that this should be fixed in Spark (less surprises) otherwise we need to document it as [~kabhwan] said above. > HDFSBackedStateStoreProvider should not leak .crc files > > > Key: SPARK-28025 > URL: https://issues.apache.org/jira/browse/SPARK-28025 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.3 > Environment: Spark 2.4.3 > Kubernetes 1.11(?) (OpenShift) > StateStore storage on a mounted PVC. Viewed as a local filesystem by the > `FileContextBasedCheckpointFileManager` : > {noformat} > scala> glusterfm.isLocal > res17: Boolean = true{noformat} >Reporter: Gerard Maas >Assignee: Jungtaek Lim >Priority: Major > Fix For: 2.4.4, 3.0.0 > > > The HDFSBackedStateStoreProvider when using the default CheckpointFileManager > is leaving '.crc' files behind. There's a .crc file created for each > `atomicFile` operation of the CheckpointFileManager. > Over time, the number of files becomes very large. It makes the state store > file system constantly increase in size and, in our case, deteriorates the > file system performance. > Here's a sample of one of our spark storage volumes after 2 days of execution > (4 stateful streaming jobs, each on a different sub-dir): > # > {noformat} > Total files in PVC (used for checkpoints and state store) > $find . | wc -l > 431796 > # .crc files > $find . -name "*.crc" | wc -l > 418053{noformat} > With each .crc file taking one storage block, the used storage runs into the > GBs of data. > These jobs are running on Kubernetes. Our shared storage provider, GlusterFS, > shows serious performance deterioration with this large number of files: > {noformat} > DEBUG HDFSBackedStateStoreProvider: fetchFiles() took 29164ms{noformat} > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27936) Support local dependency uploading from --py-files
[ https://issues.apache.org/jira/browse/SPARK-27936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900411#comment-16900411 ] Stavros Kontopoulos edited comment on SPARK-27936 at 8/5/19 9:06 PM: - [~eje] I have started working on this, maybe we need another ticket for R as well, but will be off from 7-26 so will take some time, if someone else wants to do it let me know, otherwise I will do a PR when I get back. was (Author: skonto): [~eje] I have started working on this, maybe we need another ticket for R as well, but will be off from 7-26 so will take some time. > Support local dependency uploading from --py-files > -- > > Key: SPARK-27936 > URL: https://issues.apache.org/jira/browse/SPARK-27936 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Erik Erlandson >Priority: Major > > Support python dependency uploads, as in SPARK-23153 -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27936) Support local dependency uploading from --py-files
[ https://issues.apache.org/jira/browse/SPARK-27936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900411#comment-16900411 ] Stavros Kontopoulos commented on SPARK-27936: - [~eje] I have started working on this, maybe we need another ticket for R as well, but will be off from 7-26 so will take some time. > Support local dependency uploading from --py-files > -- > > Key: SPARK-27936 > URL: https://issues.apache.org/jira/browse/SPARK-27936 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Erik Erlandson >Priority: Major > > Support python dependency uploads, as in SPARK-23153 -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28465) K8s integration tests fail due to missing ceph-nano image
[ https://issues.apache.org/jira/browse/SPARK-28465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-28465: Summary: K8s integration tests fail due to missing ceph-nano image (was: K8s integration tests fail due to non existent image) > K8s integration tests fail due to missing ceph-nano image > - > > Key: SPARK-28465 > URL: https://issues.apache.org/jira/browse/SPARK-28465 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Stavros Kontopoulos >Priority: Major > > Image added here: > [https://github.com/lightbend/spark/blob/72c80ee81ca4c3c9569749b54e2db0ec91b128a5/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DepsTestsSuite.scala#L66] > needs to be updated to the latest as it was removed from dockerhub. > {quote}docker pull ceph/daemon:v4.0.0-stable-4.0-master-centos-7-x86_64 > Error response from daemon: manifest for > ceph/daemon:v4.0.0-stable-4.0-master-centos-7-x86_64 not found > {quote} > Also we need to apply this fix: > [https://github.com/ceph/cn/issues/115#issuecomment-497384369] > I will create a PR shortly. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28465) K8s integration tests fail due to non existent image
[ https://issues.apache.org/jira/browse/SPARK-28465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-28465: Description: Image added here: [https://github.com/lightbend/spark/blob/72c80ee81ca4c3c9569749b54e2db0ec91b128a5/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DepsTestsSuite.scala#L66] needs to be updated to the latest as it was removed from dockerhub. {quote}docker pull ceph/daemon:v4.0.0-stable-4.0-master-centos-7-x86_64 Error response from daemon: manifest for ceph/daemon:v4.0.0-stable-4.0-master-centos-7-x86_64 not found {quote} Also we need to apply this fix: [https://github.com/ceph/cn/issues/115#issuecomment-497384369] I will create a PR shortly. was: Image added here: [https://github.com/lightbend/spark/blob/72c80ee81ca4c3c9569749b54e2db0ec91b128a5/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DepsTestsSuite.scala#L66] needs to be updated to the latest as it was removed from dockerhub. {quote}docker pull ceph/daemon:v4.0.0-stable-4.0-master-centos-7-x86_64 Error response from daemon: manifest for ceph/daemon:v4.0.0-stable-4.0-master-centos-7-x86_64 not found {quote} Also we need to apply this fix: [https://github.com/ceph/cn/issues/115#issuecomment-497384369] > K8s integration tests fail due to non existent image > > > Key: SPARK-28465 > URL: https://issues.apache.org/jira/browse/SPARK-28465 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Stavros Kontopoulos >Priority: Major > > Image added here: > [https://github.com/lightbend/spark/blob/72c80ee81ca4c3c9569749b54e2db0ec91b128a5/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DepsTestsSuite.scala#L66] > needs to be updated to the latest as it was removed from dockerhub. > {quote}docker pull ceph/daemon:v4.0.0-stable-4.0-master-centos-7-x86_64 > Error response from daemon: manifest for > ceph/daemon:v4.0.0-stable-4.0-master-centos-7-x86_64 not found > {quote} > Also we need to apply this fix: > [https://github.com/ceph/cn/issues/115#issuecomment-497384369] > I will create a PR shortly. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28465) K8s integration tests fail due to non existent image
[ https://issues.apache.org/jira/browse/SPARK-28465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-28465: Description: Image added here: [https://github.com/lightbend/spark/blob/72c80ee81ca4c3c9569749b54e2db0ec91b128a5/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DepsTestsSuite.scala#L66] needs to be updated to the latest as it was removed from dockerhub. {quote}docker pull ceph/daemon:v4.0.0-stable-4.0-master-centos-7-x86_64 Error response from daemon: manifest for ceph/daemon:v4.0.0-stable-4.0-master-centos-7-x86_64 not found {quote} Also we need to apply this fix: [https://github.com/ceph/cn/issues/115#issuecomment-497384369] was:Image added here: [https://github.com/lightbend/spark/blob/72c80ee81ca4c3c9569749b54e2db0ec91b128a5/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DepsTestsSuite.scala#L66] needs to be updated to the latest as it was removed from dockerhub. Also we need to apply this fix: https://github.com/ceph/cn/issues/115#issuecomment-497384369 > K8s integration tests fail due to non existent image > > > Key: SPARK-28465 > URL: https://issues.apache.org/jira/browse/SPARK-28465 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Stavros Kontopoulos >Priority: Major > > Image added here: > [https://github.com/lightbend/spark/blob/72c80ee81ca4c3c9569749b54e2db0ec91b128a5/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DepsTestsSuite.scala#L66] > needs to be updated to the latest as it was removed from dockerhub. > {quote}docker pull ceph/daemon:v4.0.0-stable-4.0-master-centos-7-x86_64 > Error response from daemon: manifest for > ceph/daemon:v4.0.0-stable-4.0-master-centos-7-x86_64 not found > {quote} > Also we need to apply this fix: > [https://github.com/ceph/cn/issues/115#issuecomment-497384369] -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28465) K8s integration tests fail due to non existent image
Stavros Kontopoulos created SPARK-28465: --- Summary: K8s integration tests fail due to non existent image Key: SPARK-28465 URL: https://issues.apache.org/jira/browse/SPARK-28465 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 3.0.0 Reporter: Stavros Kontopoulos Image added here: [https://github.com/lightbend/spark/blob/72c80ee81ca4c3c9569749b54e2db0ec91b128a5/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DepsTestsSuite.scala#L66] needs to be updated to the latest as it was removed from dockerhub. Also we need to apply this fix: https://github.com/ceph/cn/issues/115#issuecomment-497384369 -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28444) Bump Kubernetes Client Version to 4.3.0
[ https://issues.apache.org/jira/browse/SPARK-28444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888970#comment-16888970 ] Stavros Kontopoulos edited comment on SPARK-28444 at 7/19/19 3:26 PM: -- Right now on master we have 4.1.2 [https://github.com/apache/spark/blob/453cbf3dd8df5ec4da844c93eb6000610b551541/resource-managers/kubernetes/core/pom.xml#L32]. Afaik this is the same version for 2.4.3. Something else is not right. was (Author: skonto): Right now on master we have 4.1.2 [https://github.com/apache/spark/blob/453cbf3dd8df5ec4da844c93eb6000610b551541/resource-managers/kubernetes/core/pom.xml#L32]. Afaik this is the same version for 2.4.3. > Bump Kubernetes Client Version to 4.3.0 > --- > > Key: SPARK-28444 > URL: https://issues.apache.org/jira/browse/SPARK-28444 > Project: Spark > Issue Type: Dependency upgrade > Components: Kubernetes >Affects Versions: 3.0.0, 2.4.3 >Reporter: Patrick Winter >Priority: Major > > Spark is currently using the Kubernetes client version 4.1.2. This client > does not support the current Kubernetes version 1.14, as can be seen on the > [compatibility > matrix|[https://github.com/fabric8io/kubernetes-client#compatibility-matrix]]. > Therefore the Kubernetes client should be bumped up to version 4.3.0. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28444) Bump Kubernetes Client Version to 4.3.0
[ https://issues.apache.org/jira/browse/SPARK-28444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888970#comment-16888970 ] Stavros Kontopoulos edited comment on SPARK-28444 at 7/19/19 3:26 PM: -- Right now on master we have 4.1.2 [https://github.com/apache/spark/blob/453cbf3dd8df5ec4da844c93eb6000610b551541/resource-managers/kubernetes/core/pom.xml#L32]. Afaik this is the same version for 2.4.3. was (Author: skonto): Right now on master we have 4.1.2 [https://github.com/apache/spark/blob/453cbf3dd8df5ec4da844c93eb6000610b551541/resource-managers/kubernetes/core/pom.xml#L32]. Did you try 3.0.0? > Bump Kubernetes Client Version to 4.3.0 > --- > > Key: SPARK-28444 > URL: https://issues.apache.org/jira/browse/SPARK-28444 > Project: Spark > Issue Type: Dependency upgrade > Components: Kubernetes >Affects Versions: 3.0.0, 2.4.3 >Reporter: Patrick Winter >Priority: Major > > Spark is currently using the Kubernetes client version 4.1.2. This client > does not support the current Kubernetes version 1.14, as can be seen on the > [compatibility > matrix|[https://github.com/fabric8io/kubernetes-client#compatibility-matrix]]. > Therefore the Kubernetes client should be bumped up to version 4.3.0. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28444) Bump Kubernetes Client Version to 4.3.0
[ https://issues.apache.org/jira/browse/SPARK-28444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888970#comment-16888970 ] Stavros Kontopoulos edited comment on SPARK-28444 at 7/19/19 3:25 PM: -- Right now on master we have [4.1.2 | [https://github.com/apache/spark/blob/453cbf3dd8df5ec4da844c93eb6000610b551541/resource-managers/kubernetes/core/pom.xml#L32]] was (Author: skonto): Right now on master we have [4.1.2 |[https://github.com/apache/spark/blob/453cbf3dd8df5ec4da844c93eb6000610b551541/resource-managers/kubernetes/core/pom.xml#L32]] > Bump Kubernetes Client Version to 4.3.0 > --- > > Key: SPARK-28444 > URL: https://issues.apache.org/jira/browse/SPARK-28444 > Project: Spark > Issue Type: Dependency upgrade > Components: Kubernetes >Affects Versions: 3.0.0, 2.4.3 >Reporter: Patrick Winter >Priority: Major > > Spark is currently using the Kubernetes client version 4.1.2. This client > does not support the current Kubernetes version 1.14, as can be seen on the > [compatibility > matrix|[https://github.com/fabric8io/kubernetes-client#compatibility-matrix]]. > Therefore the Kubernetes client should be bumped up to version 4.3.0. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28444) Bump Kubernetes Client Version to 4.3.0
[ https://issues.apache.org/jira/browse/SPARK-28444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888970#comment-16888970 ] Stavros Kontopoulos edited comment on SPARK-28444 at 7/19/19 3:25 PM: -- Right now on master we have 4.1.2 [https://github.com/apache/spark/blob/453cbf3dd8df5ec4da844c93eb6000610b551541/resource-managers/kubernetes/core/pom.xml#L32]. Did you try 3.0.0? was (Author: skonto): Right now on master we have [4.1.2 | [https://github.com/apache/spark/blob/453cbf3dd8df5ec4da844c93eb6000610b551541/resource-managers/kubernetes/core/pom.xml#L32]] > Bump Kubernetes Client Version to 4.3.0 > --- > > Key: SPARK-28444 > URL: https://issues.apache.org/jira/browse/SPARK-28444 > Project: Spark > Issue Type: Dependency upgrade > Components: Kubernetes >Affects Versions: 3.0.0, 2.4.3 >Reporter: Patrick Winter >Priority: Major > > Spark is currently using the Kubernetes client version 4.1.2. This client > does not support the current Kubernetes version 1.14, as can be seen on the > [compatibility > matrix|[https://github.com/fabric8io/kubernetes-client#compatibility-matrix]]. > Therefore the Kubernetes client should be bumped up to version 4.3.0. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28444) Bump Kubernetes Client Version to 4.3.0
[ https://issues.apache.org/jira/browse/SPARK-28444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888970#comment-16888970 ] Stavros Kontopoulos commented on SPARK-28444: - Right now on master we have [4.1.2 |[https://github.com/apache/spark/blob/453cbf3dd8df5ec4da844c93eb6000610b551541/resource-managers/kubernetes/core/pom.xml#L32]] > Bump Kubernetes Client Version to 4.3.0 > --- > > Key: SPARK-28444 > URL: https://issues.apache.org/jira/browse/SPARK-28444 > Project: Spark > Issue Type: Dependency upgrade > Components: Kubernetes >Affects Versions: 3.0.0, 2.4.3 >Reporter: Patrick Winter >Priority: Major > > Spark is currently using the Kubernetes client version 4.1.2. This client > does not support the current Kubernetes version 1.14, as can be seen on the > [compatibility > matrix|[https://github.com/fabric8io/kubernetes-client#compatibility-matrix]]. > Therefore the Kubernetes client should be bumped up to version 4.3.0. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28444) Bump Kubernetes Client Version to 4.3.0
[ https://issues.apache.org/jira/browse/SPARK-28444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1671#comment-1671 ] Stavros Kontopoulos edited comment on SPARK-28444 at 7/19/19 1:46 PM: -- Am I not sure this is a k8s client version issue, it seems more like a credentials issue. But let's find out. Have you tried to update the k8s client? Can you verify you can/cant create pods with a simple app (outside Spark) using the fabric8io k8s client in different versions? Does it work with minikube 1.14? was (Author: skonto): Am I not sure this is a k8s client version issue, it is more like a credentials issue. Have you tried to update the k8s client? Can you verify you can/cant create pods with a simple app (outside Spark) using the fabric8io k8s client in different versions? Does it work with minikube 1.14? > Bump Kubernetes Client Version to 4.3.0 > --- > > Key: SPARK-28444 > URL: https://issues.apache.org/jira/browse/SPARK-28444 > Project: Spark > Issue Type: Dependency upgrade > Components: Kubernetes >Affects Versions: 3.0.0, 2.4.3 >Reporter: Patrick Winter >Priority: Major > > Spark is currently using the Kubernetes client version 4.1.2. This client > does not support the current Kubernetes version 1.14, as can be seen on the > [compatibility > matrix|[https://github.com/fabric8io/kubernetes-client#compatibility-matrix]]. > Therefore the Kubernetes client should be bumped up to version 4.3.0. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28444) Bump Kubernetes Client Version to 4.3.0
[ https://issues.apache.org/jira/browse/SPARK-28444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1671#comment-1671 ] Stavros Kontopoulos edited comment on SPARK-28444 at 7/19/19 1:46 PM: -- Am I not sure this is a k8s client version issue, it is more like a credentials issue. Have you tried to update the k8s client? Can you verify you can/cant create pods with a simple app (outside Spark) using the fabric8io k8s client in different versions? Does it work with minikube 1.14? was (Author: skonto): Am I not sure this is a k8s client version issue, it is more like a credentials issue. Have you tried to update the k8s client? Can you verify you can/cant create pods with a simple app (outside Spark) using the fabric8io k8s client in different versions? > Bump Kubernetes Client Version to 4.3.0 > --- > > Key: SPARK-28444 > URL: https://issues.apache.org/jira/browse/SPARK-28444 > Project: Spark > Issue Type: Dependency upgrade > Components: Kubernetes >Affects Versions: 3.0.0, 2.4.3 >Reporter: Patrick Winter >Priority: Major > > Spark is currently using the Kubernetes client version 4.1.2. This client > does not support the current Kubernetes version 1.14, as can be seen on the > [compatibility > matrix|[https://github.com/fabric8io/kubernetes-client#compatibility-matrix]]. > Therefore the Kubernetes client should be bumped up to version 4.3.0. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28444) Bump Kubernetes Client Version to 4.3.0
[ https://issues.apache.org/jira/browse/SPARK-28444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1671#comment-1671 ] Stavros Kontopoulos edited comment on SPARK-28444 at 7/19/19 1:45 PM: -- Am I not sure this is a k8s client version issue, it is more like a credentials issue. Have you tried to update the k8s client? Can you verify you can/cant create pods with a simple app (outside Spark) using the fabric8io k8s client in different versions? was (Author: skonto): Am I not sure this is a k8s client version issue, it is more like a credentials issue. Have you tried to update the k8s client? Can you verify you can/cant create pods with a simple app (outside Spark) using the fabric8ios client in different versions? > Bump Kubernetes Client Version to 4.3.0 > --- > > Key: SPARK-28444 > URL: https://issues.apache.org/jira/browse/SPARK-28444 > Project: Spark > Issue Type: Dependency upgrade > Components: Kubernetes >Affects Versions: 3.0.0, 2.4.3 >Reporter: Patrick Winter >Priority: Major > > Spark is currently using the Kubernetes client version 4.1.2. This client > does not support the current Kubernetes version 1.14, as can be seen on the > [compatibility > matrix|[https://github.com/fabric8io/kubernetes-client#compatibility-matrix]]. > Therefore the Kubernetes client should be bumped up to version 4.3.0. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28444) Bump Kubernetes Client Version to 4.3.0
[ https://issues.apache.org/jira/browse/SPARK-28444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1671#comment-1671 ] Stavros Kontopoulos edited comment on SPARK-28444 at 7/19/19 1:45 PM: -- Am I not sure this is a k8s client version issue, it is more like a credentials issue. Have you tried to update the k8s client? Can you verify you can/cant create pods with a simple app (outside Spark) using the fabric8ios client in different versions? was (Author: skonto): Am I not sure this is a k8s client version issue, it is more like a credentials issue. Have you tried to update the k8s client? Can you verify you can create pods with a simple app using the fabric8ios client in different versions? > Bump Kubernetes Client Version to 4.3.0 > --- > > Key: SPARK-28444 > URL: https://issues.apache.org/jira/browse/SPARK-28444 > Project: Spark > Issue Type: Dependency upgrade > Components: Kubernetes >Affects Versions: 3.0.0, 2.4.3 >Reporter: Patrick Winter >Priority: Major > > Spark is currently using the Kubernetes client version 4.1.2. This client > does not support the current Kubernetes version 1.14, as can be seen on the > [compatibility > matrix|[https://github.com/fabric8io/kubernetes-client#compatibility-matrix]]. > Therefore the Kubernetes client should be bumped up to version 4.3.0. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28444) Bump Kubernetes Client Version to 4.3.0
[ https://issues.apache.org/jira/browse/SPARK-28444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1671#comment-1671 ] Stavros Kontopoulos edited comment on SPARK-28444 at 7/19/19 1:44 PM: -- Am I not sure this is a k8s client version issue, it is more like a credentials issue. Have you tried to update the k8s client? Can you verify you can create pods with a simple app using the fabric8ios client in different versions? was (Author: skonto): Am I not sure this is a k8s client version issue, it is more like a credentials issue. Have you tried to update the k8s client? > Bump Kubernetes Client Version to 4.3.0 > --- > > Key: SPARK-28444 > URL: https://issues.apache.org/jira/browse/SPARK-28444 > Project: Spark > Issue Type: Dependency upgrade > Components: Kubernetes >Affects Versions: 3.0.0, 2.4.3 >Reporter: Patrick Winter >Priority: Major > > Spark is currently using the Kubernetes client version 4.1.2. This client > does not support the current Kubernetes version 1.14, as can be seen on the > [compatibility > matrix|[https://github.com/fabric8io/kubernetes-client#compatibility-matrix]]. > Therefore the Kubernetes client should be bumped up to version 4.3.0. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28444) Bump Kubernetes Client Version to 4.3.0
[ https://issues.apache.org/jira/browse/SPARK-28444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1671#comment-1671 ] Stavros Kontopoulos commented on SPARK-28444: - Am I not sure this is a k8s client version issue, it is more like a credentials issue. Have you tried to update the k8s client? > Bump Kubernetes Client Version to 4.3.0 > --- > > Key: SPARK-28444 > URL: https://issues.apache.org/jira/browse/SPARK-28444 > Project: Spark > Issue Type: Dependency upgrade > Components: Kubernetes >Affects Versions: 3.0.0, 2.4.3 >Reporter: Patrick Winter >Priority: Major > > Spark is currently using the Kubernetes client version 4.1.2. This client > does not support the current Kubernetes version 1.14, as can be seen on the > [compatibility > matrix|[https://github.com/fabric8io/kubernetes-client#compatibility-matrix]]. > Therefore the Kubernetes client should be bumped up to version 4.3.0. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28444) Bump Kubernetes Client Version to 4.3.0
[ https://issues.apache.org/jira/browse/SPARK-28444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1639#comment-1639 ] Stavros Kontopoulos commented on SPARK-28444: - Probably you are hitting this one: https://issues.apache.org/jira/browse/SPARK-26833 > Bump Kubernetes Client Version to 4.3.0 > --- > > Key: SPARK-28444 > URL: https://issues.apache.org/jira/browse/SPARK-28444 > Project: Spark > Issue Type: Dependency upgrade > Components: Kubernetes >Affects Versions: 3.0.0, 2.4.3 >Reporter: Patrick Winter >Priority: Major > > Spark is currently using the Kubernetes client version 4.1.2. This client > does not support the current Kubernetes version 1.14, as can be seen on the > [compatibility > matrix|[https://github.com/fabric8io/kubernetes-client#compatibility-matrix]]. > Therefore the Kubernetes client should be bumped up to version 4.3.0. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28445) Inconsistency between Scala and Python/Panda udfs when groupby with udf() is used
[ https://issues.apache.org/jira/browse/SPARK-28445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-28445: Summary: Inconsistency between Scala and Python/Panda udfs when groupby with udf() is used (was: Inconsistency between Scala and Python/Panda udfs when groupby udef() is used) > Inconsistency between Scala and Python/Panda udfs when groupby with udf() is > used > - > > Key: SPARK-28445 > URL: https://issues.apache.org/jira/browse/SPARK-28445 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Stavros Kontopoulos >Priority: Major > > Python: > from pyspark.sql.functions import pandas_udf, PandasUDFType > @pandas_udf("int", PandasUDFType.SCALAR) > def noop(x): > return x > spark.udf.register("udf", noop) > sql(""" > CREATE OR REPLACE TEMPORARY VIEW testData AS SELECT * FROM VALUES > (1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2), (null, 1), (3, null), (null, > null) > AS testData(a, b)""") > sql("""SELECT udf(a + 1), udf(COUNT(b)) FROM testData GROUP BY udf(a + > 1)""").show() > : org.apache.spark.sql.AnalysisException: expression 'testdata.`a`' is > neither present in the group by, nor is it an aggregate function. Add to > group by or wrap in first() (or first_value) if you don't care which value > you get.;; > Aggregate [udf((a#0 + 1))], [udf((a#0 + 1)) AS udf((a + 1))#10, > udf(count(b#1)) AS udf(count(b))#12] > +- SubqueryAlias `testdata` > +- Project [a#0, b#1] > +- SubqueryAlias `testData` > +- LocalRelation [a#0, b#1] > Scala: > spark.udf.register("udf", (input: Int) => input) > sql(""" > CREATE OR REPLACE TEMPORARY VIEW testData AS SELECT * FROM VALUES > (1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2), (null, 1), (3, null), (null, > null) > AS testData(a, b)""") > sql("""SELECT udf(a + 1), udf(COUNT(b)) FROM testData GROUP BY udf(a + > 1)""").show() > ++-+ > |udf((a + 1))|udf(count(b))| > ++-+ > | null| 1| > | 3| 2| > | 4| 2| > | 2| 2| > ++-+ -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28445) Inconsistency between Scala and Python/Panda udfs when groupby udef() is used
[ https://issues.apache.org/jira/browse/SPARK-28445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-28445: Component/s: PySpark > Inconsistency between Scala and Python/Panda udfs when groupby udef() is used > - > > Key: SPARK-28445 > URL: https://issues.apache.org/jira/browse/SPARK-28445 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Stavros Kontopoulos >Priority: Major > > Python: > from pyspark.sql.functions import pandas_udf, PandasUDFType > @pandas_udf("int", PandasUDFType.SCALAR) > def noop(x): > return x > spark.udf.register("udf", noop) > sql(""" > CREATE OR REPLACE TEMPORARY VIEW testData AS SELECT * FROM VALUES > (1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2), (null, 1), (3, null), (null, > null) > AS testData(a, b)""") > sql("""SELECT udf(a + 1), udf(COUNT(b)) FROM testData GROUP BY udf(a + > 1)""").show() > : org.apache.spark.sql.AnalysisException: expression 'testdata.`a`' is > neither present in the group by, nor is it an aggregate function. Add to > group by or wrap in first() (or first_value) if you don't care which value > you get.;; > Aggregate [udf((a#0 + 1))], [udf((a#0 + 1)) AS udf((a + 1))#10, > udf(count(b#1)) AS udf(count(b))#12] > +- SubqueryAlias `testdata` > +- Project [a#0, b#1] > +- SubqueryAlias `testData` > +- LocalRelation [a#0, b#1] > Scala: > spark.udf.register("udf", (input: Int) => input) > sql(""" > CREATE OR REPLACE TEMPORARY VIEW testData AS SELECT * FROM VALUES > (1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2), (null, 1), (3, null), (null, > null) > AS testData(a, b)""") > sql("""SELECT udf(a + 1), udf(COUNT(b)) FROM testData GROUP BY udf(a + > 1)""").show() > ++-+ > |udf((a + 1))|udf(count(b))| > ++-+ > | null| 1| > | 3| 2| > | 4| 2| > | 2| 2| > ++-+ -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28445) Inconsistency between Scala and Python/Panda udfs when groupby udef() is used
Stavros Kontopoulos created SPARK-28445: --- Summary: Inconsistency between Scala and Python/Panda udfs when groupby udef() is used Key: SPARK-28445 URL: https://issues.apache.org/jira/browse/SPARK-28445 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Stavros Kontopoulos Python: from pyspark.sql.functions import pandas_udf, PandasUDFType @pandas_udf("int", PandasUDFType.SCALAR) def noop(x): return x spark.udf.register("udf", noop) sql(""" CREATE OR REPLACE TEMPORARY VIEW testData AS SELECT * FROM VALUES (1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2), (null, 1), (3, null), (null, null) AS testData(a, b)""") sql("""SELECT udf(a + 1), udf(COUNT(b)) FROM testData GROUP BY udf(a + 1)""").show() : org.apache.spark.sql.AnalysisException: expression 'testdata.`a`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;; Aggregate [udf((a#0 + 1))], [udf((a#0 + 1)) AS udf((a + 1))#10, udf(count(b#1)) AS udf(count(b))#12] +- SubqueryAlias `testdata` +- Project [a#0, b#1] +- SubqueryAlias `testData` +- LocalRelation [a#0, b#1] Scala: spark.udf.register("udf", (input: Int) => input) sql(""" CREATE OR REPLACE TEMPORARY VIEW testData AS SELECT * FROM VALUES (1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2), (null, 1), (3, null), (null, null) AS testData(a, b)""") sql("""SELECT udf(a + 1), udf(COUNT(b)) FROM testData GROUP BY udf(a + 1)""").show() ++-+ |udf((a + 1))|udf(count(b))| ++-+ | null| 1| | 3| 2| | 4| 2| | 2| 2| ++-+ -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28444) Bump Kubernetes Client Version to 4.3.0
[ https://issues.apache.org/jira/browse/SPARK-28444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1608#comment-1608 ] Stavros Kontopoulos commented on SPARK-28444: - Hi [~patrick-winter-swisscard]. On our ci we are using v1.15 and tests pass, could you add some log output showing why pods are not created. We need to be compliant with the compatibility matrix but still we dotn have a good answer to the problem of catching up with k8s, it moves fast. > Bump Kubernetes Client Version to 4.3.0 > --- > > Key: SPARK-28444 > URL: https://issues.apache.org/jira/browse/SPARK-28444 > Project: Spark > Issue Type: Dependency upgrade > Components: Kubernetes >Affects Versions: 3.0.0, 2.4.3 >Reporter: Patrick Winter >Priority: Major > > Spark is currently using the Kubernetes client version 4.1.2. This client > does not support the current Kubernetes version 1.14, as can be seen on the > [compatibility > matrix|[https://github.com/fabric8io/kubernetes-client#compatibility-matrix]]. > Therefore the Kubernetes client should be bumped up to version 4.3.0. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28444) Bump Kubernetes Client Version to 4.3.0
[ https://issues.apache.org/jira/browse/SPARK-28444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1608#comment-1608 ] Stavros Kontopoulos edited comment on SPARK-28444 at 7/19/19 11:44 AM: --- Hi [~patrick-winter-swisscard]. On our ci we are using v1.15 and tests pass, could you add some log output showing why pods are not created. We need to be compliant with the compatibility matrix but still we dont have a good answer to the problem of catching up with k8s, it moves fast. was (Author: skonto): Hi [~patrick-winter-swisscard]. On our ci we are using v1.15 and tests pass, could you add some log output showing why pods are not created. We need to be compliant with the compatibility matrix but still we dotn have a good answer to the problem of catching up with k8s, it moves fast. > Bump Kubernetes Client Version to 4.3.0 > --- > > Key: SPARK-28444 > URL: https://issues.apache.org/jira/browse/SPARK-28444 > Project: Spark > Issue Type: Dependency upgrade > Components: Kubernetes >Affects Versions: 3.0.0, 2.4.3 >Reporter: Patrick Winter >Priority: Major > > Spark is currently using the Kubernetes client version 4.1.2. This client > does not support the current Kubernetes version 1.14, as can be seen on the > [compatibility > matrix|[https://github.com/fabric8io/kubernetes-client#compatibility-matrix]]. > Therefore the Kubernetes client should be bumped up to version 4.3.0. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16885234#comment-16885234 ] Stavros Kontopoulos edited comment on SPARK-27927 at 7/15/19 1:43 PM: -- np :). There will be a 2.4.4 release so there is chance to fix it there. As for the policy of maintenance releases probably you are right not sure what falls into that category though. On one hand you are targeting K8s releases that are way ahead and on the other you use an old client that does not support them(check fabric8io's compatibility matrix). We had a long discussion about what to k8s versions to support, a project with high velocity that does not match Spark release planning. So for good or bad there is a bug introduced and we need a fix, not sure if there is a workaround like the one with ping interval. For the user right now the temporary work-around is to stop his session. The jackson core thing is another important but also hard to upgrade. Actually a customer asked this because it didnt pass security checks. That means 2.4.x is not acceptable for some people. Personally I was not aware of the daemon thread issue. I hope Spark 3.0.0 will solve these two issues once and for all. was (Author: skonto): np :). There will be a 2.4.4 release so there is chance to fix it there. As for the policy of maintenance releases probably you are right not sure what falls into that category though. On one hand you are targeting K8s releases that are way ahead and on the other you use an old client that does not support them(check fabric8io's compatibility matrix). We had a long discussion about what to k8s versions to support, a project with high velocity that does not match Spark release planning. So for good or bad there is a bug introduced and we need a fix, not sure if there is a workaround like the one with ping interval. The jackson core thing is another important but also hard to upgrade. Actually a customer asked this because it didnt pass security checks. That means 2.4.x is not acceptable for some people. For the user right now the temporary work-around is to stop his session. Personally I was not aware of the daemon thread issue. I hope Spark 3.0.0 will solve these two issues once and for all. > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16885234#comment-16885234 ] Stavros Kontopoulos edited comment on SPARK-27927 at 7/15/19 1:42 PM: -- np :). There will be a 2.4.4 release so there is chance to fix it there. As for the policy of maintenance releases probably you are right not sure what falls into that category though. On one hand you are targeting K8s releases that are way ahead and on the other you use an old client that does not support them(check fabric8io's compatibility matrix). We had a long discussion about what to k8s versions to support, a project with high velocity that does not match Spark release planning. So for good or bad there is a bug introduced and we need a fix, not sure if there is a workaround like the one with ping interval. The jackson core thing is another important but also hard to upgrade. Actually a customer asked this because it didnt pass security checks. That means 2.4.x is not acceptable for some people. For the user right now the temporary work-around is to stop his session. Personally I was not aware of the daemon thread issue. I hope Spark 3.0.0 will solve these two issues once and for all. was (Author: skonto): np :). There will be a 2.4.4 release so there is chance to fix it there. As for the policy of maintenance releases probably you are right not sure what falls into that category though. On one hand you are targeting K8s releases that are way ahead and on the other you use an old client that does not support them(check fabric8io's compatibility matrix). We had a long discussion about what to k8s versions to support, a project with high velocity that does not match Spark release planning. So for good or bad there is a bug introduced and we need a fix, not sure if there is a workaround like the one with ping interval. The jackson core thing is another important but also hard to upgrade. Actually a customer asked this because it didnt pass security checks. That means 2.4.x is not acceptable for some people. Personally I was not aware of the daemon thread issue. I hope Spark 3.0.0 will solve these two issues once and for all. > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16885234#comment-16885234 ] Stavros Kontopoulos edited comment on SPARK-27927 at 7/15/19 1:41 PM: -- np :). There will be a 2.4.4 release so there is chance to fix it there. As for the policy of maintenance releases probably you are right not sure what falls into that category though. On one hand you are targeting K8s releases that are way ahead and on the other you use an old client that does not support them(check fabric8io's compatibility matrix). We had a long discussion about what to k8s versions to support, a project with high velocity that does not match Spark release planning. So for good or bad there is a bug introduced and we need a fix, not sure if there is a workaround like the one with ping interval. The jackson core thing is another important but also hard to upgrade. Actually a customer asked this because it didnt pass security checks. That means 2.4.x is not acceptable for some people. Personally I was not aware of the daemon thread issue. I hope Spark 3.0.0 will solve these two issues once and for all. was (Author: skonto): np :). There will be a 2.4.4 release so there is chance to fix it there. As for the policy of maintenance releases probably you are right not sure what falls into that category though. On one hand you are targeting K8s releases that are way ahead and on the other you use an old client that does not support them(check fabric8io's compatibility matrix). We had a long discussion about what to k8s versions to support, a project with high velocity that does not match Spark release planning. So for good or bad we need a fix, not sure if there is a workaround like the one with ping interval. The jackson core thing is another important but also hard to upgrade. Actually a customer asked this because it didnt pass security checks. That means 2.4.x is not acceptable for some people. Personally I was not aware of the daemon thread issue. I hope Spark 3.0.0 will solve these two issues once and for all. > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16885234#comment-16885234 ] Stavros Kontopoulos edited comment on SPARK-27927 at 7/15/19 1:40 PM: -- np :). There will be a 2.4.4 release so there is chance to fix it there. As for the policy of maintenance releases probably you are right not sure what falls into that category though. On one hand you are targeting K8s releases that are way ahead and on the other you use an old client that does not support them(check fabric8io's compatibility matrix). We had a long discussion about what to k8s versions to support, a project with high velocity that does not match Spark release planning. So for good or bad we need a fix, not sure if there is a workaround like the one with ping interval. The jackson core thing is another important but also hard to upgrade. Actually a customer asked this because it didnt pass security checks. That means 2.4.x is not acceptable for some people. Personally I was not aware of the daemon thread issue. I hope Spark 3.0.0 will solve these two issues once and for all. was (Author: skonto): np :). There will be a 2.4.4 release so there is chance to fix it there. As for the policy of maintenance releases probably you are right not sure what falls into that category though. On one hand you are targeting K8s releases that are way ahead and on the other you use an old client that does not support them(check fabric8io's compatibility matrix). We had a long discussion about what to k8s versions to support, a project with high velocity that does not match Spark release planning. So for good or bad we need a fix, not sure if there is a workaround like the one with ping interval. The jackson core thing is another important but also hard to upgrade. Actually a customer asked this because it didnt pass security checks. That means 2.4.x is not acceptable for some people. Personally I was not aware of the daemon thread issue. I hope Spark 3.0.0 will solve these two issues. > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16885234#comment-16885234 ] Stavros Kontopoulos edited comment on SPARK-27927 at 7/15/19 1:39 PM: -- np :). There will be a 2.4.4 release so there is chance to fix it there. As for the policy of maintenance releases probably you are right not sure what falls into that category though. On one hand you are targeting K8s releases that are way ahead and on the other you use an old client that does not support them(check fabric8io's compatibility matrix). We had a long discussion about what to k8s versions to support, a project with high velocity that does not match Spark release planning. So for good or bad we need a fix, not sure if there is a workaround like the one with ping interval. The jackson core thing is another important but also hard to upgrade. Actually a customer asked this because it didnt pass security checks. That means 2.4.x is not acceptable for some people. Personally I was not aware of the daemon thread issue. I hope Spark 3.0.0 will solve these two issues. was (Author: skonto): np :). There will be a 2.4.4 release so there is chance to fix it there. As for the policy of maintenance releases probably you are right not sure what falls into that category though. On one hand you are targeting K8s releases that are way ahead and on the other you use an old client that does not support it (check fabric8io's compatibility matrix). So for good or bad we need a fix, not sure if there is a workaround like the one with ping interval. The jackson core thing is another important but also hard to upgrade. Actually a customer asked this because it didnt pass security checks. That means 2.4.x is not acceptable for some people. Personally I was not aware of the daemon thread issue. I hope Spark 3.0.0 will solve these two issues. > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16885234#comment-16885234 ] Stavros Kontopoulos edited comment on SPARK-27927 at 7/15/19 1:38 PM: -- np :). There will be a 2.4.4 release so there is chance to fix it there. As for the policy of maintenance releases probably you are right not sure what falls into that category though. On one hand you are targeting K8s releases that are way ahead and on the other you use an old client that does not support it (check fabric8io's compatibility matrix). So for good or bad we need a fix, not sure if there is a workaround like the one with ping interval. The jackson core thing is another important but also hard to upgrade. Actually a customer asked this because it didnt pass security checks. That means 2.4.x is not acceptable for some people. Personally I was not aware of the daemon thread issue. I hope Spark 3.0.0 will solve these two issues. was (Author: skonto): np. There will be a 2.4.4 release so there is chance to fix it there. As for the policy of maintenance releases probably you are right not sure what falls into that category though. On one hand you are targeting K8s releases that are way ahead and on the other you use an old client that does not support it (check fabric8io's compatibility matrix). So for good or bad we need a fix, not sure if there is a workaround like the one with ping timeout. The jackson core thing is another important but also hard to upgrade. Actually a customer asked this because it didnt pass security checks. That means 2.4.x is not acceptable for some people. Personally I was not aware of the daemon thread issue. I hope Spark 3.0.0 will solve these two issues. > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16885234#comment-16885234 ] Stavros Kontopoulos edited comment on SPARK-27927 at 7/15/19 1:37 PM: -- np. There will be a 2.4.4 release so there is chance to fix it there. As for the policy of maintenance releases probably you are right not sure what falls into that category though. On one hand you are targeting K8s releases that are way ahead and on the other you use an old client that does not support it (check fabric8io's compatibility matrix). So for good or bad we need a fix, not sure if there is a workaround like the one with ping timeout. The jackson core thing is another important but also hard to upgrade. Actually a customer asked this because it didnt pass security checks. That means 2.4.x is not acceptable for some people. Personally I was not aware of the daemon thread issue. I hope Spark 3.0.0 will solve these two issues. was (Author: skonto): np. There will be a 2.4.4 release so there is chance to fix it there. One thing to remember is also that we cannot keep 3.0.0 since it is outdated according to fabric8io's compatibility matrix. So we need a fix, not sure if there is a workaround like the one with ping timeout. The jackson core thing is another important but also hard to upgrade. Actually a customer asked this because it didnt pass security checks. That means 2.4.x is not acceptable for some people. Personally I was not aware of the daemon thread issue. I hope Spark 3.0.0 will solve these two issues. > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16885234#comment-16885234 ] Stavros Kontopoulos edited comment on SPARK-27927 at 7/15/19 1:35 PM: -- np. There will be a 2.4.4 release so there is chance to fix it there. One thing to remember is also that we cannot keep 3.0.0 since it is outdated according to fabric8io's compatibility matrix. So we need a fix, not sure if there is a workaround like the one with ping timeout. The jackson core thing is another important but also hard to upgrade. Actually a customer asked this because it didnt pass security checks. That means 2.4.x is not acceptable for some people. Personally I was not aware of the daemon thread issue. I hope Spark 3.0.0 will solve these two issues. was (Author: skonto): np. There will be a 2.4.4 release so there is chance to fix it there. The jackson core thing is another important but also hard to upgrade. Actually a customer asked this because it didnt pass security checks. That means 2.4.x is not acceptable for some people. Personally I was not aware of the daemon thread issue. I hope Spark 3.0.0 will solve these two issues. > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16885234#comment-16885234 ] Stavros Kontopoulos commented on SPARK-27927: - np. There will be a 2.4.4 release so there is chance to fix it there. The jackson core thing is another important but also hard to upgrade. Actually a customer asked this because it didnt pass security checks. That means 2.4.x is not acceptable for some people. Personally I was not aware of the daemon thread issue. I hope Spark 3.0.0 will solve these two issues. > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884791#comment-16884791 ] Stavros Kontopoulos edited comment on SPARK-27927 at 7/15/19 12:41 AM: --- I was able to reproduce it easily with 2.4.3 and this similar code: from __future__ import print_function import sys from random import random from operator import add from pyspark.sql import SparkSession if __name__ == "__main__":( """ Usage: pi [partitions] """ spark = SparkSession\ .builder\ .appName("PythonPi")\ .getOrCreate() Also commented out this part in Eventloop to make it stay in runnable state (may be not required): // val event = eventQueue.take() // try { // onReceive(event) // } catch { // case NonFatal(e) => // try { // onError(e) // } catch { // case NonFatal(e) => logError("Unexpected error in " + name, e) // } // } I used this tool [https://github.com/jglick/jkillthread] to kill eventloop succesfully and then tried to kill the other okhttp thread: 19/07/15 00:12:06 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint Killing "OkHttp [https://kubernetes.default.svc/]...; Did not find "OkHttp [https://kubernetes.default.svc/]...; Killing "dag-scheduler-event-loop" Killing "OkHttp WebSocket [https://kubernetes.default.svc/]...; Exception in thread "OkHttp WebSocket [https://kubernetes.default.svc/]...; java.lang.IllegalMonitorStateException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.signal(AbstractQueuedSynchronizer.java:1939) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1103) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Killing "OkHttp WebSocket [https://kubernetes.default.svc/]...; Exception in thread "OkHttp WebSocket [https://kubernetes.default.svc/]...; java.lang.IllegalMonitorStateException Unfortunately I cant the kill the latter as another one is created. Anyway that means that this is just another case of https://issues.apache.org/jira/browse/SPARK-27812 . spark.stop() obviously stops the k8s client and everything finishes as expected. was (Author: skonto): I was able to reproduce it easily with 2.4.3 and this similar code: from __future__ import print_function import sys from random import random from operator import add from pyspark.sql import SparkSession if __name__ == "__main__":( """ Usage: pi [partitions] """ spark = SparkSession\ .builder\ .appName("PythonPi")\ .getOrCreate() I used this tool [https://github.com/jglick/jkillthread] to kill eventloop succesfully and then tried to kill the other okhttp thread: 19/07/15 00:12:06 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint Killing "OkHttp [https://kubernetes.default.svc/]...; Did not find "OkHttp [https://kubernetes.default.svc/]...; Killing "dag-scheduler-event-loop" Killing "OkHttp WebSocket [https://kubernetes.default.svc/]...; Exception in thread "OkHttp WebSocket [https://kubernetes.default.svc/]...; java.lang.IllegalMonitorStateException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.signal(AbstractQueuedSynchronizer.java:1939) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1103) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Killing "OkHttp WebSocket [https://kubernetes.default.svc/]...; Exception in thread "OkHttp WebSocket [https://kubernetes.default.svc/]...; java.lang.IllegalMonitorStateException Unfortunately I cant the kill the latter as another one is created. Anyway that means that this is just another case of https://issues.apache.org/jira/browse/SPARK-27812 . spark.stop() obviously stops the k8s client and everything finishes as expected. > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0,
[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884791#comment-16884791 ] Stavros Kontopoulos edited comment on SPARK-27927 at 7/15/19 12:38 AM: --- I was able to reproduce it easily with 2.4.3 and this similar code: from __future__ import print_function import sys from random import random from operator import add from pyspark.sql import SparkSession if __name__ == "__main__":( """ Usage: pi [partitions] """ spark = SparkSession\ .builder\ .appName("PythonPi")\ .getOrCreate() I used this tool [https://github.com/jglick/jkillthread] to kill eventloop and then the other okhttp thread: 19/07/15 00:12:06 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint Killing "OkHttp [https://kubernetes.default.svc/]...; Did not find "OkHttp [https://kubernetes.default.svc/]...; Killing "dag-scheduler-event-loop" Killing "OkHttp WebSocket [https://kubernetes.default.svc/]...; Exception in thread "OkHttp WebSocket [https://kubernetes.default.svc/]...; java.lang.IllegalMonitorStateException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.signal(AbstractQueuedSynchronizer.java:1939) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1103) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Killing "OkHttp WebSocket [https://kubernetes.default.svc/]...; Exception in thread "OkHttp WebSocket [https://kubernetes.default.svc/]...; java.lang.IllegalMonitorStateException Unfortunately I cant the kill the latter as another one is created. Anyway that means that this is just another case of https://issues.apache.org/jira/browse/SPARK-27812 . spark.stop() obviously stops the k8s client and everything finishes as expected. was (Author: skonto): I was able to reproduce it easily with 2.4.3 and this similar code: from __future__ import print_function import sys from random import random from operator import add from pyspark.sql import SparkSession if __name__ == "__main__":( """ Usage: pi [partitions] """ spark = SparkSession\ .builder\ .appName("PythonPi")\ .getOrCreate() I used this tool [https://github.com/jglick/jkillthread] to kill eventloop and then the other okhttp thread: 19/07/15 00:12:06 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint Killing "OkHttp [https://kubernetes.default.svc/]...; Did not find "OkHttp [https://kubernetes.default.svc/]...; Killing "dag-scheduler-event-loop" Killing "OkHttp WebSocket [https://kubernetes.default.svc/]...; Exception in thread "OkHttp WebSocket [https://kubernetes.default.svc/]...; java.lang.IllegalMonitorStateException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.signal(AbstractQueuedSynchronizer.java:1939) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1103) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Killing "OkHttp WebSocket [https://kubernetes.default.svc/]...; Exception in thread "OkHttp WebSocket [https://kubernetes.default.svc/]...; java.lang.IllegalMonitorStateException Unfortunately I cant the kill the latter as another one is created. Anyway that means that this is just another case of https://issues.apache.org/jira/browse/SPARK-27812 . spark.stop() obviously stops the k8s client and everything finishes as expected. > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python >
[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884791#comment-16884791 ] Stavros Kontopoulos edited comment on SPARK-27927 at 7/15/19 12:38 AM: --- I was able to reproduce it easily with 2.4.3 and this similar code: from __future__ import print_function import sys from random import random from operator import add from pyspark.sql import SparkSession if __name__ == "__main__":( """ Usage: pi [partitions] """ spark = SparkSession\ .builder\ .appName("PythonPi")\ .getOrCreate() I used this tool [https://github.com/jglick/jkillthread] to kill eventloop succesfully and then tried to kill the other okhttp thread: 19/07/15 00:12:06 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint Killing "OkHttp [https://kubernetes.default.svc/]...; Did not find "OkHttp [https://kubernetes.default.svc/]...; Killing "dag-scheduler-event-loop" Killing "OkHttp WebSocket [https://kubernetes.default.svc/]...; Exception in thread "OkHttp WebSocket [https://kubernetes.default.svc/]...; java.lang.IllegalMonitorStateException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.signal(AbstractQueuedSynchronizer.java:1939) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1103) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Killing "OkHttp WebSocket [https://kubernetes.default.svc/]...; Exception in thread "OkHttp WebSocket [https://kubernetes.default.svc/]...; java.lang.IllegalMonitorStateException Unfortunately I cant the kill the latter as another one is created. Anyway that means that this is just another case of https://issues.apache.org/jira/browse/SPARK-27812 . spark.stop() obviously stops the k8s client and everything finishes as expected. was (Author: skonto): I was able to reproduce it easily with 2.4.3 and this similar code: from __future__ import print_function import sys from random import random from operator import add from pyspark.sql import SparkSession if __name__ == "__main__":( """ Usage: pi [partitions] """ spark = SparkSession\ .builder\ .appName("PythonPi")\ .getOrCreate() I used this tool [https://github.com/jglick/jkillthread] to kill eventloop and then the other okhttp thread: 19/07/15 00:12:06 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint Killing "OkHttp [https://kubernetes.default.svc/]...; Did not find "OkHttp [https://kubernetes.default.svc/]...; Killing "dag-scheduler-event-loop" Killing "OkHttp WebSocket [https://kubernetes.default.svc/]...; Exception in thread "OkHttp WebSocket [https://kubernetes.default.svc/]...; java.lang.IllegalMonitorStateException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.signal(AbstractQueuedSynchronizer.java:1939) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1103) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Killing "OkHttp WebSocket [https://kubernetes.default.svc/]...; Exception in thread "OkHttp WebSocket [https://kubernetes.default.svc/]...; java.lang.IllegalMonitorStateException Unfortunately I cant the kill the latter as another one is created. Anyway that means that this is just another case of https://issues.apache.org/jira/browse/SPARK-27812 . spark.stop() obviously stops the k8s client and everything finishes as expected. > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java}
[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884791#comment-16884791 ] Stavros Kontopoulos edited comment on SPARK-27927 at 7/15/19 12:37 AM: --- I was able to reproduce it easily with 2.4.3 and this similar code: from __future__ import print_function import sys from random import random from operator import add from pyspark.sql import SparkSession if __name__ == "__main__":( """ Usage: pi [partitions] """ spark = SparkSession\ .builder\ .appName("PythonPi")\ .getOrCreate() I used this tool [https://github.com/jglick/jkillthread] to kill eventloop and then the other okhttp thread: 19/07/15 00:12:06 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint Killing "OkHttp [https://kubernetes.default.svc/]...; Did not find "OkHttp [https://kubernetes.default.svc/]...; Killing "dag-scheduler-event-loop" Killing "OkHttp WebSocket [https://kubernetes.default.svc/]...; Exception in thread "OkHttp WebSocket [https://kubernetes.default.svc/]...; java.lang.IllegalMonitorStateException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.signal(AbstractQueuedSynchronizer.java:1939) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1103) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Killing "OkHttp WebSocket [https://kubernetes.default.svc/]...; Exception in thread "OkHttp WebSocket [https://kubernetes.default.svc/]...; java.lang.IllegalMonitorStateException Unfortunately I cant the kill the latter as another one is created. Anyway that means that this is just another case of https://issues.apache.org/jira/browse/SPARK-27812 . spark.stop() obviously stops the k8s client and everything finishes as expected. was (Author: skonto): I was able to reproduce it reasily with 2.4.3: from __future__ import print_function import sys from random import random from operator import add from pyspark.sql import SparkSession if __name__ == "__main__": """ Usage: pi [partitions] """ spark = SparkSession\ .builder\ .appName("PythonPi")\ .getOrCreate() I used this tool [https://github.com/jglick/jkillthread] to kill eventloop and then the other okhttp thread: 19/07/15 00:12:06 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint Killing "OkHttp [https://kubernetes.default.svc/]...; Did not find "OkHttp [https://kubernetes.default.svc/]...; Killing "dag-scheduler-event-loop" Killing "OkHttp WebSocket [https://kubernetes.default.svc/]...; Exception in thread "OkHttp WebSocket [https://kubernetes.default.svc/]...; java.lang.IllegalMonitorStateException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.signal(AbstractQueuedSynchronizer.java:1939) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1103) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Killing "OkHttp WebSocket [https://kubernetes.default.svc/]...; Exception in thread "OkHttp WebSocket [https://kubernetes.default.svc/]...; java.lang.IllegalMonitorStateException Unfortunately I cant the kill the latter as another one is created. Anyway that means that this is just another case of https://issues.apache.org/jira/browse/SPARK-27812 . spark.stop() obviously stops the k8s client and everything finishes as expected. > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import
[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884791#comment-16884791 ] Stavros Kontopoulos edited comment on SPARK-27927 at 7/15/19 12:36 AM: --- I was able to reproduce it reasily with 2.4.3: from __future__ import print_function import sys from random import random from operator import add from pyspark.sql import SparkSession if __name__ == "__main__": """ Usage: pi [partitions] """ spark = SparkSession\ .builder\ .appName("PythonPi")\ .getOrCreate() I used this tool [https://github.com/jglick/jkillthread] to kill eventloop and then the other okhttp thread: 19/07/15 00:12:06 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint Killing "OkHttp [https://kubernetes.default.svc/]...; Did not find "OkHttp [https://kubernetes.default.svc/]...; Killing "dag-scheduler-event-loop" Killing "OkHttp WebSocket [https://kubernetes.default.svc/]...; Exception in thread "OkHttp WebSocket [https://kubernetes.default.svc/]...; java.lang.IllegalMonitorStateException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.signal(AbstractQueuedSynchronizer.java:1939) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1103) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Killing "OkHttp WebSocket [https://kubernetes.default.svc/]...; Exception in thread "OkHttp WebSocket [https://kubernetes.default.svc/]...; java.lang.IllegalMonitorStateException Unfortunately I cant the kill the latter as another one is created. Anyway that means that this is just another case of https://issues.apache.org/jira/browse/SPARK-27812 . spark.stop() obviously stops the k8s client and everything finishes as expected. was (Author: skonto): I was able to reproduce it reasily with 2.4.3. I used this tool [https://github.com/jglick/jkillthread] to kill eventloop and then the other okhttp thread: 19/07/15 00:12:06 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint Killing "OkHttp [https://kubernetes.default.svc/]...; Did not find "OkHttp [https://kubernetes.default.svc/]...; Killing "dag-scheduler-event-loop" Killing "OkHttp WebSocket [https://kubernetes.default.svc/]...; Exception in thread "OkHttp WebSocket [https://kubernetes.default.svc/]...; java.lang.IllegalMonitorStateException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.signal(AbstractQueuedSynchronizer.java:1939) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1103) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Killing "OkHttp WebSocket [https://kubernetes.default.svc/]...; Exception in thread "OkHttp WebSocket [https://kubernetes.default.svc/]...; java.lang.IllegalMonitorStateException Unfortunately I cant the kill the latter as another one is created. Anyway that means that this is just another case of https://issues.apache.org/jira/browse/SPARK-27812 . spark.stop() obviously stops the k8s client and everything finishes as expected. > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={}
[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884791#comment-16884791 ] Stavros Kontopoulos edited comment on SPARK-27927 at 7/15/19 12:35 AM: --- I was able to reproduce it reasily with 2.4.3. I used this tool [https://github.com/jglick/jkillthread] to kill eventloop and then the other okhttp thread: 19/07/15 00:12:06 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint Killing "OkHttp [https://kubernetes.default.svc/]...; Did not find "OkHttp [https://kubernetes.default.svc/]...; Killing "dag-scheduler-event-loop" Killing "OkHttp WebSocket [https://kubernetes.default.svc/]...; Exception in thread "OkHttp WebSocket [https://kubernetes.default.svc/]...; java.lang.IllegalMonitorStateException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.signal(AbstractQueuedSynchronizer.java:1939) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1103) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Killing "OkHttp WebSocket [https://kubernetes.default.svc/]...; Exception in thread "OkHttp WebSocket [https://kubernetes.default.svc/]...; java.lang.IllegalMonitorStateException Unfortunately I cant the kill the latter as another one is created. Anyway that means that this is just another case of https://issues.apache.org/jira/browse/SPARK-27812 . spark.stop() obviously stops the k8s client and everything finishes as expected. was (Author: skonto): I was able to reproduce it reasily with 2.4.3. I used this tool [https://github.com/jglick/jkillthread] to kill eventloop and then the other okhttp thread: 19/07/15 00:12:06 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint Killing "OkHttp https://kubernetes.default.svc/...; Did not find "OkHttp https://kubernetes.default.svc/...; Killing "dag-scheduler-event-loop" Killing "OkHttp WebSocket https://kubernetes.default.svc/...; Exception in thread "OkHttp WebSocket https://kubernetes.default.svc/...; java.lang.IllegalMonitorStateException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.signal(AbstractQueuedSynchronizer.java:1939) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1103) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Killing "OkHttp WebSocket https://kubernetes.default.svc/...; Exception in thread "OkHttp WebSocket https://kubernetes.default.svc/...; java.lang.IllegalMonitorStateException Unfortunately I cant the kill the latter as another one is created. Anyway that means that this is just another case of https://issues.apache.org/jira/browse/SPARK-27812 > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information:
[jira] [Commented] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884791#comment-16884791 ] Stavros Kontopoulos commented on SPARK-27927: - I was able to reproduce it reasily with 2.4.3. I used this tool [https://github.com/jglick/jkillthread] to kill eventloop and then the other okhttp thread: 19/07/15 00:12:06 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint Killing "OkHttp https://kubernetes.default.svc/...; Did not find "OkHttp https://kubernetes.default.svc/...; Killing "dag-scheduler-event-loop" Killing "OkHttp WebSocket https://kubernetes.default.svc/...; Exception in thread "OkHttp WebSocket https://kubernetes.default.svc/...; java.lang.IllegalMonitorStateException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.signal(AbstractQueuedSynchronizer.java:1939) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1103) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Killing "OkHttp WebSocket https://kubernetes.default.svc/...; Exception in thread "OkHttp WebSocket https://kubernetes.default.svc/...; java.lang.IllegalMonitorStateException Unfortunately I cant the kill the latter as another one is created. Anyway that means that this is just another case of https://issues.apache.org/jira/browse/SPARK-27812 > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884675#comment-16884675 ] Stavros Kontopoulos edited comment on SPARK-27927 at 7/14/19 1:50 PM: -- That call, among others, creates a SparkContext if it does not exist. The SparkContext will start the dag scheduler thread which starts this eventLoop thread. We have the following facts: a) a non-daemon thread is running due to https://issues.apache.org/jira/browse/SPARK-27812 b) a daemon thread is blocked which could cause issues ([https://meteatamel.wordpress.com/2012/05/22/when-a-daemon-thread-is-not-so-daemon/]) c) no shutdownhook was run although main has exited, as jvm cannot exit. I would start with commenting out [https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47] build and re-run the job. Since this is a dummy job with no actions it does not matter. If jvm still does not exit then the only explanation is that https://issues.apache.org/jira/browse/SPARK-27812 stops us from that. If it exits then it could mean that for some reason in 2.4.0 EventLoop will not have the time to block as things move faster (we can show that with adding logging). was (Author: skonto): That call, among others, creates a SparkContext if it does not exist. The SparkContext will start the dag scheduler thread which starts this eventLoop thread. We have the following facts: a) a non-daemon thread is running due to https://issues.apache.org/jira/browse/SPARK-27812 b) a daemon thread is blocked which could cause issues ([https://meteatamel.wordpress.com/2012/05/22/when-a-daemon-thread-is-not-so-daemon/]) c) no shutdownhook was run although main has exited, as jvm cannot exit. I would start with commenting out [https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47] build and re-run the job. Since this is a dummy job with no actions it does not matter. If jvm still does not exit then the only explanation is that https://issues.apache.org/jira/browse/SPARK-27812 stops us from that. If it exits then it could mean that for some reason in 2.4.0 EventLoop will not have the time to block as things move faster. > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884675#comment-16884675 ] Stavros Kontopoulos edited comment on SPARK-27927 at 7/14/19 1:23 PM: -- That call, among others, creates a SparkContext if it does not exist. The SparkContext will start the dag scheduler thread which starts this eventLoop thread. We have the following facts: a) a non-daemon thread is running due to https://issues.apache.org/jira/browse/SPARK-27812 b) a daemon thread is blocked which could cause issues ([https://meteatamel.wordpress.com/2012/05/22/when-a-daemon-thread-is-not-so-daemon/]) c) no shutdownhook was run although main has exited, as jvm cannot exit. I would start with commenting out [https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47] build and re-run the job. Since this is a dummy job with no actions it does not matter. If jvm still does not exit then the only explanation is that https://issues.apache.org/jira/browse/SPARK-27812 stops us from that. If it exits then it could mean that for some reason in 2.4.0 EventLoop will not have the time to block as things move faster. was (Author: skonto): That call, among others, creates a SparkContext if it does not exist. The SparkContext will start the dag scheduler thread which starts this eventLoop thread. We have the following facts: a) a non-daemon thread is running due to https://issues.apache.org/jira/browse/SPARK-27812 b) a daemon thread is blocked which could cause issues ([https://meteatamel.wordpress.com/2012/05/22/when-a-daemon-thread-is-not-so-daemon/]) c) no shutdownhook was run although main has exited, as jvm cannot exit. I would start with commenting out [https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47] build and re-run the job. Since this is a dummy job with no actions it does not matter. If jvm still does not exit then the only explanation is that https://issues.apache.org/jira/browse/SPARK-27812 stops us from that. > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884675#comment-16884675 ] Stavros Kontopoulos commented on SPARK-27927: - That call, among others, creates a SparkContext if it does not exist. The SparkContext will start the dag scheduler thread which starts this eventLoop thread. We have these facts: a) a non-daemon thread is running due to https://issues.apache.org/jira/browse/SPARK-27812 b) a daemon thread is blocked which could cause issues ([https://meteatamel.wordpress.com/2012/05/22/when-a-daemon-thread-is-not-so-daemon/]) c) no shutdownhook was run although main has exited, as jvm cannot exit. I would start with commenting out [https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47] build and re-run the job. Since this is a dummy job with no actions it does not matter. If jvm still does not exit then the only explanation is that https://issues.apache.org/jira/browse/SPARK-27812 stops us from that. > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884675#comment-16884675 ] Stavros Kontopoulos edited comment on SPARK-27927 at 7/14/19 1:21 PM: -- That call, among others, creates a SparkContext if it does not exist. The SparkContext will start the dag scheduler thread which starts this eventLoop thread. We have the following facts: a) a non-daemon thread is running due to https://issues.apache.org/jira/browse/SPARK-27812 b) a daemon thread is blocked which could cause issues ([https://meteatamel.wordpress.com/2012/05/22/when-a-daemon-thread-is-not-so-daemon/]) c) no shutdownhook was run although main has exited, as jvm cannot exit. I would start with commenting out [https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47] build and re-run the job. Since this is a dummy job with no actions it does not matter. If jvm still does not exit then the only explanation is that https://issues.apache.org/jira/browse/SPARK-27812 stops us from that. was (Author: skonto): That call, among others, creates a SparkContext if it does not exist. The SparkContext will start the dag scheduler thread which starts this eventLoop thread. We have these facts: a) a non-daemon thread is running due to https://issues.apache.org/jira/browse/SPARK-27812 b) a daemon thread is blocked which could cause issues ([https://meteatamel.wordpress.com/2012/05/22/when-a-daemon-thread-is-not-so-daemon/]) c) no shutdownhook was run although main has exited, as jvm cannot exit. I would start with commenting out [https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47] build and re-run the job. Since this is a dummy job with no actions it does not matter. If jvm still does not exit then the only explanation is that https://issues.apache.org/jira/browse/SPARK-27812 stops us from that. > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884374#comment-16884374 ] Stavros Kontopoulos edited comment on SPARK-27927 at 7/13/19 2:24 PM: -- Yes, needs debugging (build Spark with extra log statements, one way to do it), but if you check the code there, there is an interrupt call by the other thread that joins the EventLoop one: [https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L78] Without the interrupt the EventLoop thread cannot exit. Does this ever happen (logging would help but i suspect it never happens)? Stop is called there by the shutdownhook when sparkcontext is stopped. So the diff with the working version will be why in the working case the shutdown happens? Are you using btw the same jdk (we need to make sure behavior has not changed as in this one: [https://bugs.openjdk.java.net/browse/JDK-8154017)?|https://bugs.openjdk.java.net/browse/JDK-8154017)] Another question is why there is no PythonRunner thread, has that exited? was (Author: skonto): Yes, needs debugging (build Spark with extra log statements), but if you check the code there, there is an interrupt call by the other thread that joins the EventLoop one: [https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L78] Without the interrupt the EventLoop thread cannot exit. Does this ever happen (logging would help but i suspect it never happens)? Stop is called there by the shutdownhook when sparkcontext is stopped. So the diff with the working version will be why in the working case the shutdown happens? Are you using btw the same jdk (we need to make sure behavior has not changed as in this one: [https://bugs.openjdk.java.net/browse/JDK-8154017)?|https://bugs.openjdk.java.net/browse/JDK-8154017)] Another question is why there is no PythonRunner thread, has that exited? > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884374#comment-16884374 ] Stavros Kontopoulos edited comment on SPARK-27927 at 7/13/19 2:20 PM: -- Yes, needs debugging (build Spark with extra log statements), but if you check the code there, there is an interrupt call by the other thread that joins the EventLoop one: [https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L78] Without the interrupt the EventLoop thread cannot exit. Does this ever happen (logging would help but i suspect it never happens)? Stop is called there by the shutdownhook when sparkcontext is stopped. So the diff with the working version will be why in the working case the shutdown happens? Are you using btw the same jdk (we need to make sure behavior has not changed as in this one: [https://bugs.openjdk.java.net/browse/JDK-8154017)?|https://bugs.openjdk.java.net/browse/JDK-8154017)] Another question is why there is no PythonRunner thread, has that exited? was (Author: skonto): Yes, needs debugging, but if you check the code there, there is an interrupt call by the other thread that joins the EventLoop one: [https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L78] Without the interrupt the EventLoop thread cannot exit. Does this ever happen (logging would help but i suspect it never happens)? Stop is called there by the shutdownhook when sparkcontext is stopped. So the diff with the working version will be why in the working case the shutdown happens? Are you using btw the same jdk (we need to make sure behavior has not changed as in this one: [https://bugs.openjdk.java.net/browse/JDK-8154017)?|https://bugs.openjdk.java.net/browse/JDK-8154017)] Another question is why there is no PythonRunner thread, has that exited? > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884374#comment-16884374 ] Stavros Kontopoulos edited comment on SPARK-27927 at 7/13/19 2:13 PM: -- Yes, needs debugging, but if you check the code there, there is an interrupt call by the other thread that joins the EventLoop one: [https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L78] Without the interrupt the EventLoop thread cannot exit. Does this ever happen (logging would help but i suspect it never happens)? Stop is called there by the shutdownhook when sparkcontext is stopped. So the diff with the working version will be why in the working case the shutdown happens? Are you using btw the same jdk (we need to make sure behavior has not changed as in this one: [https://bugs.openjdk.java.net/browse/JDK-8154017)?|https://bugs.openjdk.java.net/browse/JDK-8154017)] Another question is why there is no PythonRunner thread, has that exited? was (Author: skonto): Yes, needs debugging, but if you check the code there, there is an interrupt call by the other thread that joins the EventLoop one: [https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L78] Without the interrupt the EventLoop thread cannot exit. Does this ever happen (logging would help but i suspect it never happens)? Stop is called there by the shutdownhook when sparkcontext is stopped. So the diff with the working version will be why in the working case the shutdown happens? Are you using btw the same jdk (we need to make sure behavior has not changed as in this one: [https://bugs.openjdk.java.net/browse/JDK-8154017)?|https://bugs.openjdk.java.net/browse/JDK-8154017)] > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884374#comment-16884374 ] Stavros Kontopoulos edited comment on SPARK-27927 at 7/13/19 1:56 PM: -- Yes, needs debugging, but if you check the code there, there is an interrupt call by the other thread that joins the EventLoop one: [https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L78] Without the interrupt the EventLoop thread cannot exit. Does this ever happen (logging would help but i suspect it never happens)? Stop is called there by the shutdownhook when sparkcontext is stopped. So the diff with the working version will be why in the working case the shutdown happens? Are you using btw the same jdk (we need to make sure behavior has not changed as in this one: [https://bugs.openjdk.java.net/browse/JDK-8154017)?|https://bugs.openjdk.java.net/browse/JDK-8154017)] was (Author: skonto): Yes, needs debugging, but if you check the code there, there is an interrupt call by the other thread that joins the EventLoop one: [https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L78] Without the interrupt the EventLoop thread cannot exit. Does this ever happen (logging would help but i suspect it never happens)? Stop is called there by the shutdownhook when sparkcontext is stopped. So the diff with the working version will be why in the working case the shutdown happens? Btw since DestroyJavaVM is there as a thread in your dump the shutdown process has started but blocked. Are you using btw the same jdk (we need to make sure behavior has not changed as in this one: [https://bugs.openjdk.java.net/browse/JDK-8154017)?|https://bugs.openjdk.java.net/browse/JDK-8154017)] > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884374#comment-16884374 ] Stavros Kontopoulos edited comment on SPARK-27927 at 7/13/19 1:53 PM: -- Yes, needs debugging, but if you check the code there, there is an interrupt call by the other thread that joins the EventLoop one: [https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L78] Without the interrupt the EventLoop thread cannot exit. Does this ever happen (logging would help but i suspect it never happens)? Stop is called there by the shutdownhook when sparkcontext is stopped. So the diff with the working version will be why in the working case the shutdown happens? Btw since DestroyJavaVM is there as a thread in your dump the shutdown process has started but blocked. Are you using btw the same jdk (we need to make sure behavior has not changed as in this one: [https://bugs.openjdk.java.net/browse/JDK-8154017)?|https://bugs.openjdk.java.net/browse/JDK-8154017)] was (Author: skonto): Yes, needs debugging, but if you check the code there, there is an interrupt call by the other thread that joins the EventLoop one: [https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L78] Without the interrupt the EventLoop thread cannot exit. Does this ever happen (logging would help but i suspect it never happens)? Stop is called there by the shutdownhook when sparkcontext is stopped. So the diff with the working version will be why in the working case the shutdown happens? Btw since DestroyJavaVM is there as a thread in your dump the shutdown process has started but blocked. Are you using btw the same jdk (we need to make sure behavior has not changed as in this one: https://bugs.openjdk.java.net/browse/JDK-8154017)? > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884374#comment-16884374 ] Stavros Kontopoulos edited comment on SPARK-27927 at 7/13/19 1:52 PM: -- Yes, needs debugging, but if you check the code there, there is an interrupt call by the other thread that joins the EventLoop one: [https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L78] Without the interrupt the EventLoop thread cannot exit. Does this ever happen (logging would help but i suspect it never happens)? Stop is called there by the shutdownhook when sparkcontext is stopped. So the diff with the working version will be why in the working case the shutdown happens? Btw since DestroyJavaVM is there as a thread in your dump the shutdown process has started but blocked. Are you using btw the same jdk (we need to make sure behavior has not changed as in this one: https://bugs.openjdk.java.net/browse/JDK-8154017)? was (Author: skonto): Yes, needs debugging, but if you check the code there, there is an interrupt call by the other thread that joins the EventLoop one: [https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L78] Without the interrupt the EventLoop thread cannot exit. Does this ever happen (logging would help but i suspect it never happens)? Stop is called there by the shutdownhook when sparkcontext is stopped. So the diff with the working version will be why in the working case the shutdown happens? Btw since DestroyJavaVM is there as a thread in your dump the shutdown process has started but blocked. Are you using btw the same jdk? > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884374#comment-16884374 ] Stavros Kontopoulos edited comment on SPARK-27927 at 7/13/19 1:51 PM: -- Yes, needs debugging, but if you check the code there, there is an interrupt call by the other thread that joins the EventLoop one: [https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L78] Without the interrupt the EventLoop thread cannot exit. Does this ever happen (logging would help but i suspect it never happens)? Stop is called there by the shutdownhook when sparkcontext is stopped. So the diff with the working version will be why in the working case the shutdown happens? Btw since DestroyJavaVM is there as a thread in your dump the shutdown process has started but blocked. Are you using btw the same jdk? was (Author: skonto): Yes, needs debugging, but if you check the code there, there is an interrupt call by the other thread that joins the EventLoop one: [https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L78] Without the interrupt the EventLoop thread cannot exit. Does this ever happen (logging would help but i suspect it never happens)? Stop is called there by the shutdownhook when sparkcontext is stopped. So the diff with the working version will be why in the working case the shutdown happens? > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884374#comment-16884374 ] Stavros Kontopoulos edited comment on SPARK-27927 at 7/13/19 1:46 PM: -- Yes, needs debugging, but if you check the code there, there is an interrupt call by the other thread that joins the EventLoop one: [https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L78] Without the interrupt the EventLoop thread cannot exit. Does this ever happen (logging would help but i suspect it never happens)? Stop is called there by the shutdownhook when sparkcontext is stopped. So the diff with the working version will be why in the working case the shutdown happens? was (Author: skonto): Yes, needs debugging, but if you check the code there, there is an interrupt call by the other thread that joins the EventLoop one: [https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L78] Without the interrupt the EventLoop thread cannot exit. Does this ever happen? Stop is called there by the shutdownhook when sparkcontext is stopped. So the diff with the working version will be why in the working case the shutdown happens? > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884374#comment-16884374 ] Stavros Kontopoulos edited comment on SPARK-27927 at 7/13/19 1:43 PM: -- Yes, needs debugging, but if you check the code there, there is an interrupt call by the other thread that joins the EventLoop one: [https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L78] Without the interrupt the EventLoop thread cannot exit. Does this ever happen? Stop is called there by the shutdownhook when sparkcontext is stopped. So the diff with the working version will be why in the working case the shutdown happens? was (Author: skonto): Yes, needs debugging not sure if the commit itself if the issue, but if you check the code there there is an interrupt call by the other thread that joins the EventLoop one: [https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L78] Does this ever happen? Stop is called there by the shutdownhook when sparkcontext is stopped. So the diff will be why in the working case the shutdown happens? > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884374#comment-16884374 ] Stavros Kontopoulos commented on SPARK-27927: - Yes, needs debugging not sure if the commit itself if the issue, but if you check the code there there is an interrupt call by the other thread that joins the EventLoop one: [https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L78] Does this ever happen? Stop is called there by the shutdownhook when sparkcontext is stopped. So the diff will be why in the working case the shutdown happens? > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884250#comment-16884250 ] Stavros Kontopoulos edited comment on SPARK-27927 at 7/13/19 12:37 AM: --- It is on 2.4.0: [https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47] Not sure if it is the k8s client in this case because if you check my thread dump [https://gist.github.com/skonto/74181e434a727901d4f3323461c1050b] in [https://github.com/apache/spark/pull/24796] (this is recent because I didnt report it earlier, this failing pi job was there for at least a year but didnt have time...) these k8s threads still exist but they were not the root cause in the case with the exception. In any case we need to spot the root cause because we dont know how we ended up in different results anyway. So my question is why that thread is blocked there and we should debug the execution sequence in both cases eg. add logging. If it was the K8s threads I would expect to see only these threads blocked but it is also the eventloop, my 0.02$. was (Author: skonto): It is on 2.4.0: [https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47] Not sure if it is the k8s client in this case because if you check my thread dump [https://gist.github.com/skonto/74181e434a727901d4f3323461c1050b] in [https://github.com/apache/spark/pull/24796] (this is recent because I didnt report it earlier, this failing pi job was there for at least a year but didnt have time...) these k8s threads still exist but they were not the root cause in the case with the exception. In any case we need to spot the root cause because we dont know how we ended up in different results anyway. So my question is why that thread is blocked there and we should debug the execution sequence in both cases eg. add logging. If it was the K8s threads I would expect to see only these threads blocked but it is also the eventloop. > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884250#comment-16884250 ] Stavros Kontopoulos edited comment on SPARK-27927 at 7/13/19 12:37 AM: --- It is on 2.4.0: [https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47] Not sure if it is the k8s client in this case because if you check my thread dump [https://gist.github.com/skonto/74181e434a727901d4f3323461c1050b] in [https://github.com/apache/spark/pull/24796] (this is recent because I didnt report it earlier, this failing pi job was there for at least a year but didnt have time...) these k8s threads still exist but they were not the root cause in the case with the exception. In any case we need to spot the root cause because we dont know how we ended up in different results anyway. So my question is why that thread is blocked there and we should debug the execution sequence in both cases eg. add logging. If it was the K8s threads I would expect to see only these threads blocked but it is also the eventloop. was (Author: skonto): It is on 2.4.0: [https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47] Not sure if it is the k8s client in this case because if you check my thread dump [https://gist.github.com/skonto/74181e434a727901d4f3323461c1050b] in [https://github.com/apache/spark/pull/24796] these k8s threads still exist but they were not the root cause in the case with the exception. In any case we need to spot the root cause because we dont know how we ended up in different results anyway. So my question is why that thread is blocked there and we should debug the execution sequence in both cases eg. add logging. If it was the K8s threads I would expect to see only these threads blocked but it is also the eventloop. > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884250#comment-16884250 ] Stavros Kontopoulos edited comment on SPARK-27927 at 7/13/19 12:36 AM: --- It is on 2.4.0: [https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47] Not sure if it is the k8s client in this case because if you check my thread dump [https://gist.github.com/skonto/74181e434a727901d4f3323461c1050b] in [https://github.com/apache/spark/pull/24796] these k8s threads still exist but they were not the root cause in the case with the exception. In any case we need to spot the root cause because we dont know how we ended up in different results anyway. So my question is why that thread is blocked there and we should debug the execution sequence in both cases eg. add logging. If it was the K8s threads I would expect to see only these threads blocked but it is also the eventloop. was (Author: skonto): It is on 2.4.0: [https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47] Not sure if it is the k8s client in this case because if you check my thread dump [https://gist.github.com/skonto/74181e434a727901d4f3323461c1050b] in [https://github.com/apache/spark/pull/24796] these k8s threads still exist but they were not the root cause in the case with the exception. In any case we need to spot the root cause because we dont know how we ended up in different results anyway. So my question is why that thread is blocked there and start re-wind the execution in both cases. If it was the K8s threads I would expect to see only these threads blocked but it is also the eventloop. > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884250#comment-16884250 ] Stavros Kontopoulos edited comment on SPARK-27927 at 7/13/19 12:35 AM: --- It is on 2.4.0: [https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47] Not sure if it is the k8s client in this case because if you check my thread dump [https://gist.github.com/skonto/74181e434a727901d4f3323461c1050b] in [https://github.com/apache/spark/pull/24796] these k8s threads still exist but they were not the root cause in the case with the exception. In any case we need to spot the root cause because we dont know how we ended up in different results anyway. So my question is why that thread is blocked there and start re-wind the execution in both cases. If it was the K8s threads I would expect to see only these threads blocked but it is also the eventloop. was (Author: skonto): It is on 2.4.0: [https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47] Not sure if it is the k8s client in this case because if you check my thread dump [https://gist.github.com/skonto/74181e434a727901d4f3323461c1050b] in [https://github.com/apache/spark/pull/24796] these k8s threads still exist but they were not the root cause in the case with the exception. In any case we need to spot the root cause because we dont know how we ended up in different results anyway. So my question is why that thread is blocked there and start re-wind the execution in both cases. If it was the K8s threads I would expect to see only these threads blocked but it is also the eventloop. > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884250#comment-16884250 ] Stavros Kontopoulos edited comment on SPARK-27927 at 7/13/19 12:34 AM: --- It is on 2.4.0: [https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47] Not sure if it is the k8s client in this case because if you check my thread dump [https://gist.github.com/skonto/74181e434a727901d4f3323461c1050b] in [https://github.com/apache/spark/pull/24796] these k8s threads still exist but they were not the root cause in the case with the exception. In any case we need to spot the root cause because we dont know how we ended up in different results anyway. So my question is why that thread is blocked there and start re-wind the execution in both cases. If it was the K8s threads I would expect to see only these threads blocked but it is also the eventloop. was (Author: skonto): It is on 2.4.0: [https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47] Not sure if it is the k8s client in this case because if you check my thread dump [https://gist.github.com/skonto/74181e434a727901d4f3323461c1050b] in [https://github.com/apache/spark/pull/24796] these k8s threads still exist but they were not the root cause in the case with the exception. In any case we need to spot the root cause. So my question is why that thread is blocked there. > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884250#comment-16884250 ] Stavros Kontopoulos commented on SPARK-27927: - It is on 2.4.0: [https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47] Not sure if it is the k8s client in this case because if you check my thread dump [https://gist.github.com/skonto/74181e434a727901d4f3323461c1050b] in [https://github.com/apache/spark/pull/24796] these k8s threads still exist but they were not the root cause in the case with the exception. In any case we need to spot the root cause. So my question is why that thread is blocked there. > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884121#comment-16884121 ] Stavros Kontopoulos edited comment on SPARK-27927 at 7/12/19 7:41 PM: -- I think the issue is here: {quote}"dag-scheduler-event-loop" #50 daemon prio=5 os_prio=0 tid=0x7f561ceb1000 nid=0xa6 waiting on condition [0x7f5619ee4000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x000542de6188> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) at java.util.concurrent.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:492) at java.util.concurrent.LinkedBlockingDeque.take(LinkedBlockingDeque.java:680) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:47) {quote} Code is here :[https://github.com/apache/spark/blob/aa41dcea4a41899507dfe4ec1eceaabb5edf728f/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47]. That thread is blocked there (blocking queue) and although its a daemon thread it cannot move forward. Why it happens I dont know exactly but looks similar to [https://github.com/apache/spark/pull/24796] (although there is no exception here), [~zsxwing] thoughts? was (Author: skonto): I think the issue is here: {quote}"dag-scheduler-event-loop" #50 daemon prio=5 os_prio=0 tid=0x7f561ceb1000 nid=0xa6 waiting on condition [0x7f5619ee4000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x000542de6188> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) at java.util.concurrent.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:492) at java.util.concurrent.LinkedBlockingDeque.take(LinkedBlockingDeque.java:680) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:47) {quote} Code is here :[https://github.com/apache/spark/blob/aa41dcea4a41899507dfe4ec1eceaabb5edf728f/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47]. That thread is blocked there (blocking queue) and although its a daemon thread it cannot move forward. Why it happens I dont know exactly but looks similar to [https://github.com/apache/spark/pull/24796], [~zsxwing] thoughts? > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884121#comment-16884121 ] Stavros Kontopoulos edited comment on SPARK-27927 at 7/12/19 7:40 PM: -- I think the issue is here: {quote}"dag-scheduler-event-loop" #50 daemon prio=5 os_prio=0 tid=0x7f561ceb1000 nid=0xa6 waiting on condition [0x7f5619ee4000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x000542de6188> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) at java.util.concurrent.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:492) at java.util.concurrent.LinkedBlockingDeque.take(LinkedBlockingDeque.java:680) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:47) {quote} Code is here :[https://github.com/apache/spark/blob/aa41dcea4a41899507dfe4ec1eceaabb5edf728f/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47]. That thread blocked there (blocking queue) and although its a daemon thread it cannot move forward. Why it happens I dont know exactly but looks similar to [https://github.com/apache/spark/pull/24796], [~zsxwing] thoughts? was (Author: skonto): I think the issue is here: ``` "dag-scheduler-event-loop" #50 daemon prio=5 os_prio=0 tid=0x7f561ceb1000 nid=0xa6 waiting on condition [0x7f5619ee4000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x000542de6188> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) at java.util.concurrent.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:492) at java.util.concurrent.LinkedBlockingDeque.take(LinkedBlockingDeque.java:680) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:47) ``` Code is [here|[https://github.com/apache/spark/blob/aa41dcea4a41899507dfe4ec1eceaabb5edf728f/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47]]. That thread blocked there (blocking queue) and although its a daemon thread it cannot move forward. Why it happens I dont know exactly but looks similar to [https://github.com/apache/spark/pull/24796], [~zsxwing] thoughts? > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884121#comment-16884121 ] Stavros Kontopoulos edited comment on SPARK-27927 at 7/12/19 7:40 PM: -- I think the issue is here: {quote}"dag-scheduler-event-loop" #50 daemon prio=5 os_prio=0 tid=0x7f561ceb1000 nid=0xa6 waiting on condition [0x7f5619ee4000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x000542de6188> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) at java.util.concurrent.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:492) at java.util.concurrent.LinkedBlockingDeque.take(LinkedBlockingDeque.java:680) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:47) {quote} Code is here :[https://github.com/apache/spark/blob/aa41dcea4a41899507dfe4ec1eceaabb5edf728f/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47]. That thread is blocked there (blocking queue) and although its a daemon thread it cannot move forward. Why it happens I dont know exactly but looks similar to [https://github.com/apache/spark/pull/24796], [~zsxwing] thoughts? was (Author: skonto): I think the issue is here: {quote}"dag-scheduler-event-loop" #50 daemon prio=5 os_prio=0 tid=0x7f561ceb1000 nid=0xa6 waiting on condition [0x7f5619ee4000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x000542de6188> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) at java.util.concurrent.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:492) at java.util.concurrent.LinkedBlockingDeque.take(LinkedBlockingDeque.java:680) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:47) {quote} Code is here :[https://github.com/apache/spark/blob/aa41dcea4a41899507dfe4ec1eceaabb5edf728f/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47]. That thread blocked there (blocking queue) and although its a daemon thread it cannot move forward. Why it happens I dont know exactly but looks similar to [https://github.com/apache/spark/pull/24796], [~zsxwing] thoughts? > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884121#comment-16884121 ] Stavros Kontopoulos commented on SPARK-27927: - I think the issue is here: ``` "dag-scheduler-event-loop" #50 daemon prio=5 os_prio=0 tid=0x7f561ceb1000 nid=0xa6 waiting on condition [0x7f5619ee4000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x000542de6188> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) at java.util.concurrent.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:492) at java.util.concurrent.LinkedBlockingDeque.take(LinkedBlockingDeque.java:680) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:47) ``` Code is [here|[https://github.com/apache/spark/blob/aa41dcea4a41899507dfe4ec1eceaabb5edf728f/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47]]. That thread blocked there (blocking queue) and although its a daemon thread it cannot move forward. Why it happens I dont know exactly but looks similar to [https://github.com/apache/spark/pull/24796], [~zsxwing] thoughts? > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org