Hi Dongjoon Thx for replying and clarifying.
Below are the errors in Spark32 pn K8s which occurred because of time out . io.fabric8.kubernetes.client.KubernetesClientException: Operation: [create] for kind: [ConfigMap] with name: [null] in namespace: [xyz] failed. at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64) at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:380) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:86) at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend.setUpExecutorConfigMap(KubernetesClusterSchedulerBackend.scala:80) at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend.start(KubernetesClusterSchedulerBackend.scala:103) at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:220) io.fabric8.kubernetes.client.KubernetesClientException: Operation: [create] for kind: [Pod] with name: [null] in namespace: [xyz] failed. at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64) at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:380) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:86) at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$requestNewExecutors$1(ExecutorPodsAllocator.scala:400) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158) at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.requestNewExecutors(ExecutorPodsAllocator.scala:382) at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36(ExecutorPodsAllocator.scala:346) at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36$adapted(ExecutorPodsAllocator.scala:339) Above both errors are occurring because of timeout . Caused by: java.net.SocketTimeoutException: timeout at okio.Okio$4.newTimeoutException(Okio.java:232) at okio.AsyncTimeout.exit(AsyncTimeout.java:285) at okio.AsyncTimeout$2.read(AsyncTimeout.java:241) at okio.RealBufferedSource.indexOf(RealBufferedSource.java:355) at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:227) at okhttp3.internal.http1.Http1Codec.readHeaderLine(Http1Codec.java:215) at okhttp3.internal.http1.Http1Codec.readResponseHeaders(Http1Codec.java:189) at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:88) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) Setting these values *spark.kubernetes.driver.connectionTimeout spark.kubernetes.submission.connectionTimeout* to higher value made this work . Since spark.network.timeout was already* s*et , I was wondering why to set this time out separately . But your explanation helps me to understand things better. As you have suggested , *IMO adding a better error message in case of K8s timeout would be better for the user debuggability.* On Tue, Aug 2, 2022 at 3:55 AM Dongjoon Hyun <dongjoon.h...@gmail.com> wrote: > Hi, Pralabh. > > Could you elaborate on your situation more? I'm interested in your needs. > > Currently, the default value of spark.network.timeout, 120s, is quite > bigger than the default value of > spark.kubernetes.driver.connectionTimeout, 10s. It would be a breaking > change if we increase `spark.kubernetes.driver.connectionTimeout` to > `120s` blindly in the next release. > > In addition, I don't think it's a good idea to adjust > `spark.network.timeout` for K8s control plane timeout issues. > `spark.network.timeout` has already many other side-effects in Spark > operation itself. I'd recommend having more directional error messages > to guide those novice users in that situation instead. > > Lastly, the most expensive API call is polling the executor status. To > reduce the overhead of K8s server side and mitigate the root cause, > Apache Spark 3.3.0 allows K8s API server-side caching via SPARK-36334. > You may want to try the following configuration if you have very > limited control plan resources. > > spark.kubernetes.executor.enablePollingWithResourceVersion=true > > Dongjoon. > > On Mon, Aug 1, 2022 at 7:52 AM Pralabh Kumar <pralabhku...@gmail.com> > wrote: > > > > Hi Dev team > > > > > > > > Since spark.network.timeout is default for all the network transactions > . Shouldn’t spark.kubernetes.driver.connectionTimeout, > spark.kubernetes.submission.connectionTimeout by default to be set > spark.network.timeout . > > > > Users migrating from Yarn to K8s are familiar with spark.network.timeout > and if time out occurs on K8s , they need to explicitly set the above two > properties. If the above properties are default set to > spark.network.timeout then user don’t need to explicitly set above > properties and it can work with spark.network.timeout. > > > > > > > > Please let me know if my understanding is correct > > > > > > > > Regards > > > > Pralabh Kumar >