Hi Dongjoon

Thx for replying and clarifying.

Below are the errors in Spark32 pn K8s  which occurred because of time
out .

io.fabric8.kubernetes.client.KubernetesClientException: Operation: [create]
 for kind: [ConfigMap]  with name: [null]  in namespace: [xyz]  failed.

        at
io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)

        at
io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)

        at
io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:380)

        at
io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:86)

        at
org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend.setUpExecutorConfigMap(KubernetesClusterSchedulerBackend.scala:80)

        at
org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend.start(KubernetesClusterSchedulerBackend.scala:103)

        at
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:220)





io.fabric8.kubernetes.client.KubernetesClientException:
Operation: [create]  for kind: [Pod]  with name: [null]  in
namespace: [xyz]  failed.

        at
io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)

        at
io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)

        at
io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:380)

        at
io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:86)

        at
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$requestNewExecutors$1(ExecutorPodsAllocator.scala:400)

        at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)

        at
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.requestNewExecutors(ExecutorPodsAllocator.scala:382)

        at
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36(ExecutorPodsAllocator.scala:346)

        at
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36$adapted(ExecutorPodsAllocator.scala:339)







Above both errors are occurring because of  timeout .



Caused by: java.net.SocketTimeoutException: timeout

        at okio.Okio$4.newTimeoutException(Okio.java:232)

        at okio.AsyncTimeout.exit(AsyncTimeout.java:285)

        at okio.AsyncTimeout$2.read(AsyncTimeout.java:241)

        at okio.RealBufferedSource.indexOf(RealBufferedSource.java:355)

        at
okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:227)

        at
okhttp3.internal.http1.Http1Codec.readHeaderLine(Http1Codec.java:215)

        at
okhttp3.internal.http1.Http1Codec.readResponseHeaders(Http1Codec.java:189)

        at
okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:88)

        at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)

        at
okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45)

        at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)

        at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)

        at
okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)

        at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)





Setting these values *spark.kubernetes.driver.connectionTimeout
spark.kubernetes.submission.connectionTimeout*  to higher value made this
work .  Since spark.network.timeout was already* s*et , I was wondering why
to set this time out separately . But your explanation helps me
to understand things better.



As you have suggested , *IMO adding a better error message in case of K8s
timeout would be better for the user debuggability.*


On Tue, Aug 2, 2022 at 3:55 AM Dongjoon Hyun <dongjoon.h...@gmail.com>
wrote:

> Hi, Pralabh.
>
> Could you elaborate on your situation more? I'm interested in your needs.
>
> Currently, the default value of spark.network.timeout, 120s, is quite
> bigger than the default value of
> spark.kubernetes.driver.connectionTimeout, 10s. It would be a breaking
> change if we increase `spark.kubernetes.driver.connectionTimeout` to
> `120s` blindly in the next release.
>
> In addition, I don't think it's a good idea to adjust
> `spark.network.timeout` for K8s control plane timeout issues.
> `spark.network.timeout` has already many other side-effects in Spark
> operation itself. I'd recommend having more directional error messages
> to guide those novice users in that situation instead.
>
> Lastly, the most expensive API call is polling the executor status. To
> reduce the overhead of K8s server side and mitigate the root cause,
> Apache Spark 3.3.0 allows K8s API server-side caching via SPARK-36334.
> You may want to try the following configuration if you have very
> limited control plan resources.
>
>     spark.kubernetes.executor.enablePollingWithResourceVersion=true
>
> Dongjoon.
>
> On Mon, Aug 1, 2022 at 7:52 AM Pralabh Kumar <pralabhku...@gmail.com>
> wrote:
> >
> > Hi Dev team
> >
> >
> >
> > Since spark.network.timeout is default for all the network transactions
> . Shouldn’t   spark.kubernetes.driver.connectionTimeout,
> spark.kubernetes.submission.connectionTimeout by default to be set
> spark.network.timeout .
> >
> > Users migrating from Yarn to K8s are familiar with spark.network.timeout
> and if time out occurs on K8s , they need to explicitly set the above two
> properties. If the above properties are default set to
> spark.network.timeout then user don’t need to explicitly set above
> properties and it can work with spark.network.timeout.
> >
> >
> >
> > Please let me know if my understanding is correct
> >
> >
> >
> > Regards
> >
> > Pralabh Kumar
>

Reply via email to