So I'm running into an issue with my resource staging server that's producing a stacktrace like Issue 342 <https://github.com/apache-spark-on-k8s/spark/issues/342>, but I don't think for the same reasons. What's happening is that every time after I start up a resource staging server, the first job submitted that uses it will fail with a java.net.SocketTimeoutException: timeout, and then every subsequent job will run perfectly. Including with different jars and different users. It's only ever the first job that fails and it always fails. I know I'm also running into Issue 577 <https://github.com/apache-spark-on-k8s/spark/issues/577> in that it takes about three minutes before the resource staging server is accessible, but I'm still failing waiting over ten minutes or in one case overnight. And I'm just using the examples jar, so it's not a super large jar like in Issue 342.
This isn't great for our CI process, so has anyone seen anything like this before or know how to increase the timeout if it just takes a while on initial contact? Using spark.network.timeout has no effect. [jhoole@nid00006 spark]$ kubectl get pods | grep jhoole-spark jhoole-spark-resource-staging-server-64666675c8-w5cdm 1/1 Running 0 13m [jhoole@nid00006 spark]$ kubectl get svc | grep jhoole-spark jhoole-spark-resource-staging-service NodePort 10.96.143.55 <none> 10000:30622/TCP 13m [jhoole@nid00006 spark]$ bin/spark-submit --class org.apache.spark.examples.SparkPi --conf spark.app.name=spark-pi --conf spark.kubernetes.resourceStagingServer.uri=http://192.168.0.1:30622 ./examples/target/scala-2.11/jars/spark-examples_2.11-2.2.0-k8s-0.5.0.jar 2018-03-27 12:30:13 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2018-03-27 12:30:13 INFO UserGroupInformation:966 - Login successful for user jhoole@local using keytab file /security/secrets/jhoole.keytab 2018-03-27 12:30:14 INFO HadoopStepsOrchestrator:54 - Hadoop Conf directory: /etc/hadoop/conf 2018-03-27 12:30:14 INFO SecurityManager:54 - Changing view acls to: jhoole 2018-03-27 12:30:14 INFO SecurityManager:54 - Changing modify acls to: jhoole 2018-03-27 12:30:14 INFO SecurityManager:54 - Changing view acls groups to: 2018-03-27 12:30:14 INFO SecurityManager:54 - Changing modify acls groups to: 2018-03-27 12:30:14 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(jhoole); groups with view permissions: Set(); users with modify permissions: Set(jhoole); groups with modify permissions: Set() Exception in thread "main" java.net.SocketTimeoutException: timeout at okio.Okio$4.newTimeoutException(Okio.java:230) at okio.AsyncTimeout.exit(AsyncTimeout.java:285) at okio.AsyncTimeout$2.read(AsyncTimeout.java:241) at okio.RealBufferedSource.indexOf(RealBufferedSource.java:345) at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:217) at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:211) at okhttp3.internal.http1.Http1Codec.readResponseHeaders(Http1Codec.java:189) at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:75) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185) at okhttp3.RealCall.execute(RealCall.java:69) at retrofit2.OkHttpCall.execute(OkHttpCall.java:174) at org.apache.spark.deploy.k8s.submit.SubmittedDependencyUploaderImpl.getTypedResponseResult(SubmittedDependencyUploaderImpl.scala:101) at org.apache.spark.deploy.k8s.submit.SubmittedDependencyUploaderImpl.doUpload(SubmittedDependencyUploaderImpl.scala:97) at org.apache.spark.deploy.k8s.submit.SubmittedDependencyUploaderImpl.uploadJars(SubmittedDependencyUploaderImpl.scala:70) at org.apache.spark.deploy.k8s.submit.submitsteps.initcontainer.SubmittedResourcesInitContainerConfigurationStep.configureInitContainer(SubmittedResourcesInitContainerConfigurationStep.scala:48) at org.apache.spark.deploy.k8s.submit.submitsteps.InitContainerBootstrapStep$$anonfun$configureDriver$1.apply(InitContainerBootstrapStep.scala:43) at org.apache.spark.deploy.k8s.submit.submitsteps.InitContainerBootstrapStep$$anonfun$configureDriver$1.apply(InitContainerBootstrapStep.scala:42) at scala.collection.immutable.List.foreach(List.scala:381) at org.apache.spark.deploy.k8s.submit.submitsteps.InitContainerBootstrapStep.configureDriver(InitContainerBootstrapStep.scala:42) at org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$1.apply(Client.scala:102) at org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$1.apply(Client.scala:101) at scala.collection.immutable.List.foreach(List.scala:381) at org.apache.spark.deploy.k8s.submit.Client.run(Client.scala:101) at org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$5.apply(Client.scala:200) at org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$5.apply(Client.scala:193) at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2551) at org.apache.spark.deploy.k8s.submit.Client$.run(Client.scala:193) at org.apache.spark.deploy.k8s.submit.Client$.main(Client.scala:213) at org.apache.spark.deploy.k8s.submit.Client.main(Client.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:786) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.net.SocketException: Socket closed at java.net.SocketInputStream.read(SocketInputStream.java:204) at java.net.SocketInputStream.read(SocketInputStream.java:141) at okio.Okio$2.read(Okio.java:139) at okio.AsyncTimeout$2.read(AsyncTimeout.java:237) ... 47 more 2018-03-27 12:30:24 INFO ShutdownHookManager:54 - Shutdown hook called 2018-03-27 12:30:24 INFO ShutdownHookManager:54 - Deleting directory /tmp/uploaded-jars-4c7ca1cf-31d6-4dba-9203-c9a6f1cd4099 Thanks, Jenna