Unfortunately the other Kubernetes cluster I was using was rebuilt from scratch yesterday, but the RSS I have up today has pretty uninteresting logs.
[root@nid00006 ~]# kubectl logs default-spark-resource-staging-server-7669dd57d7-xkvp6 ++ id -u + myuid=0 ++ id -g + mygid=0 ++ getent passwd 0 + uidentry=root:x:0:0:root:/root:/bin/ash + '[' -z root:x:0:0:root:/root:/bin/ash ']' + /sbin/tini -s -- /opt/spark/bin/spark-class org.apache.spark.deploy.rest.k8s.ResourceStagingServer /etc/spark-resource-staging-server/resource-staging-server.properties 2018-03-29 18:44:03 INFO log:192 - Logging initialized @23503ms 2018-03-29 18:44:07 WARN ContextHandler:1444 - o.s.j.s.ServletContextHandler@7a55af6b{/,null,null} contextPath ends with / 2018-03-29 18:44:17 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2018-03-29 18:44:21 INFO SecurityManager:54 - Changing view acls to: root 2018-03-29 18:44:21 INFO SecurityManager:54 - Changing modify acls to: root 2018-03-29 18:44:21 INFO SecurityManager:54 - Changing view acls groups to: 2018-03-29 18:44:21 INFO SecurityManager:54 - Changing modify acls groups to: 2018-03-29 18:44:21 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set() 2018-03-29 18:44:22 INFO Server:345 - jetty-9.3.z-SNAPSHOT 2018-03-29 18:44:47 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7a55af6b{/api,null,AVAILABLE} 2018-03-29 18:44:48 INFO AbstractConnector:270 - Started ServerConnector@4f8b4bd0{HTTP/1.1,[http/1.1]}{0.0.0.0:10000} 2018-03-29 18:44:48 INFO Server:403 - Started @68600ms 2018-03-29 18:44:48 INFO ResourceStagingServer:54 - Resource staging server started on port 10000. -Jenna On Thu, Mar 29, 2018 at 1:26 PM, Matt Cheah <mch...@palantir.com> wrote: > Hello Jenna, > > > > Are there any logs from the resource staging server pod? They might show > something interesting. > > > > Unfortunately, we haven’t been maintaining the resource staging server > because we’ve moved all of our effort to the main repository instead of the > fork. When we consider the submission of local files in the official > release we should probably create a mechanism that’s more resilient. Using > a single HTTP server isn’t ideal – would ideally like something that’s > highly available, replicated, etc. > > > > -Matt Cheah > > > > *From: *Jenna Hoole <jenna.ho...@gmail.com> > *Date: *Thursday, March 29, 2018 at 10:37 AM > *To: *"user@spark.apache.org" <user@spark.apache.org> > *Subject: *Re: Spark on K8s resource staging server timeout > > > > I added overkill high timeouts to the OkHttpClient.Builder() in > RetrofitClientFactory.scala and I don't seem to be timing out anymore. > > > > val okHttpClientBuilder = new OkHttpClient.Builder() > > .dispatcher(dispatcher) > > .proxy(resolvedProxy) > > .connectTimeout(120, TimeUnit.SECONDS) > > .writeTimeout(120, TimeUnit.SECONDS) > > .readTimeout(120, TimeUnit.SECONDS) > > > > -Jenna > > > > On Tue, Mar 27, 2018 at 10:48 AM, Jenna Hoole <jenna.ho...@gmail.com> > wrote: > > So I'm running into an issue with my resource staging server that's > producing a stacktrace like Issue 342 [github.com] > <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache-2Dspark-2Don-2Dk8s_spark_issues_342&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=GgV-jRNHAD0wSF_AnyITsBbfvbkaMumFMb_8JXHvzeY&s=THLyxpQFfgYbGIOphIPQPSYEIq3SJ7O7Y67dOo1TclQ&e=>, > but I don't think for the same reasons. What's happening is that every time > after I start up a resource staging server, the first job submitted that > uses it will fail with a java.net [java.net] > <https://urldefense.proofpoint.com/v2/url?u=http-3A__java.net&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=GgV-jRNHAD0wSF_AnyITsBbfvbkaMumFMb_8JXHvzeY&s=qfNJryW4ENvnUZQ1ZB4J7q-OA5TAY9S7-dVeh1sT8qs&e=> > .SocketTimeoutException: timeout, and then every subsequent job will run > perfectly. Including with different jars and different users. It's only > ever the first job that fails and it always fails. I know I'm also running > into Issue 577 [github.com] > <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache-2Dspark-2Don-2Dk8s_spark_issues_577&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=GgV-jRNHAD0wSF_AnyITsBbfvbkaMumFMb_8JXHvzeY&s=AAjPIbk-JINgth0uIt8BPYVJIvZx5LBPRheqNDccUs8&e=> > in > that it takes about three minutes before the resource staging server is > accessible, but I'm still failing waiting over ten minutes or in one case > overnight. And I'm just using the examples jar, so it's not a super large > jar like in Issue 342. > > > > This isn't great for our CI process, so has anyone seen anything like this > before or know how to increase the timeout if it just takes a while on > initial contact? Using spark.network.timeout has no effect. > > > > [jhoole@nid00006 spark]$ kubectl get pods | grep jhoole-spark > > jhoole-spark-resource-staging-server-64666675c8-w5cdm 1/1 Running > 13m > > [jhoole@nid00006 spark]$ kubectl get svc | grep jhoole-spark > > jhoole-spark-resource-staging-service NodePort 10.96.143.55 > <none> 10000:30622/TCP 13m > > [jhoole@nid00006 spark]$ bin/spark-submit --class > org.apache.spark.examples.SparkPi --conf spark.app.name [spark.app.name] > <https://urldefense.proofpoint.com/v2/url?u=http-3A__spark.app.name&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=GgV-jRNHAD0wSF_AnyITsBbfvbkaMumFMb_8JXHvzeY&s=ZDB5X0shXc5gYqnF5fm5BICZu4s7qCcvDj7Oert4tAQ&e=>=spark-pi > --conf spark.kubernetes.resourceStagingServer.uri=http://192.168.0.1:30622 > [192.168.0.1] > <https://urldefense.proofpoint.com/v2/url?u=http-3A__192.168.0.1-3A30622&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=GgV-jRNHAD0wSF_AnyITsBbfvbkaMumFMb_8JXHvzeY&s=YWPlC02hcssNzPFyXPr4WmQcNgn_2vIsD-g_mqmUkjY&e=> > ./examples/target/scala-2.11/jars/spark-examples_2.11-2.2.0-k8s-0.5.0.jar > > 2018-03-27 12:30:13 WARN NativeCodeLoader:62 - Unable to load > native-hadoop library for your platform... using builtin-java classes where > applicable > > 2018-03-27 12:30:13 INFO UserGroupInformation:966 - Login successful for > user jhoole@local using keytab file /security/secrets/jhoole.keytab > > 2018-03-27 12:30:14 INFO HadoopStepsOrchestrator:54 - Hadoop Conf > directory: /etc/hadoop/conf > > 2018-03-27 12:30:14 INFO SecurityManager:54 - Changing view acls to: > jhoole > > 2018-03-27 12:30:14 INFO SecurityManager:54 - Changing modify acls to: > jhoole > > 2018-03-27 12:30:14 INFO SecurityManager:54 - Changing view acls groups > to: > > 2018-03-27 12:30:14 INFO SecurityManager:54 - Changing modify acls > groups to: > > 2018-03-27 12:30:14 INFO SecurityManager:54 - SecurityManager: > authentication disabled; ui acls disabled; users with view permissions: > Set(jhoole); groups with view permissions: Set(); users with modify > permissions: Set(jhoole); groups with modify permissions: Set() > > Exception in thread "main" java.net [java.net] > <https://urldefense.proofpoint.com/v2/url?u=http-3A__java.net&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=GgV-jRNHAD0wSF_AnyITsBbfvbkaMumFMb_8JXHvzeY&s=qfNJryW4ENvnUZQ1ZB4J7q-OA5TAY9S7-dVeh1sT8qs&e=> > .SocketTimeoutException: timeout > > at okio.Okio$4.newTimeoutException(Okio.java:230) > > at okio.AsyncTimeout.exit(AsyncTimeout.java:285) > > at okio.AsyncTimeout$2.read(AsyncTimeout.java:241) > > at okio.RealBufferedSource.indexOf(RealBufferedSource.java:345) > > at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:217) > > at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:211) > > at okhttp3.internal.http1.Http1Codec.readResponseHeaders( > Http1Codec.java:189) > > at okhttp3.internal.http.CallServerInterceptor.intercept( > CallServerInterceptor.java:75) > > at okhttp3.internal.http.RealInterceptorChain.proceed( > RealInterceptorChain.java:92) > > at okhttp3.internal.connection.ConnectInterceptor.intercept( > ConnectInterceptor.java:45) > > at okhttp3.internal.http.RealInterceptorChain.proceed( > RealInterceptorChain.java:92) > > at okhttp3.internal.http.RealInterceptorChain.proceed( > RealInterceptorChain.java:67) > > at okhttp3.internal.cache.CacheInterceptor.intercept( > CacheInterceptor.java:93) > > at okhttp3.internal.http.RealInterceptorChain.proceed( > RealInterceptorChain.java:92) > > at okhttp3.internal.http.RealInterceptorChain.proceed( > RealInterceptorChain.java:67) > > at okhttp3.internal.http.BridgeInterceptor.intercept( > BridgeInterceptor.java:93) > > at okhttp3.internal.http.RealInterceptorChain.proceed( > RealInterceptorChain.java:92) > > at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept( > RetryAndFollowUpInterceptor.java:120) > > at okhttp3.internal.http.RealInterceptorChain.proceed( > RealInterceptorChain.java:92) > > at okhttp3.internal.http.RealInterceptorChain.proceed( > RealInterceptorChain.java:67) > > at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185) > > at okhttp3.RealCall.execute(RealCall.java:69) > > at retrofit2.OkHttpCall.execute(OkHttpCall.java:174) > > at org.apache.spark.deploy.k8s.submit.SubmittedDependencyUploaderImp > l.getTypedResponseResult(SubmittedDependencyUploaderImpl.scala:101) > > at org.apache.spark.deploy.k8s.submit.SubmittedDependencyUploaderImp > l.doUpload(SubmittedDependencyUploaderImpl.scala:97) > > at org.apache.spark.deploy.k8s.submit.SubmittedDependencyUploaderImp > l.uploadJars(SubmittedDependencyUploaderImpl.scala:70) > > at org.apache.spark.deploy.k8s.submit.submitsteps.initcontainer. > SubmittedResourcesInitContainerConfigurationStep.configureInitContainer( > SubmittedResourcesInitContainerConfigurationStep.scala:48) > > at org.apache.spark.deploy.k8s.submit.submitsteps. > InitContainerBootstrapStep$$anonfun$configureDriver$1.apply( > InitContainerBootstrapStep.scala:43) > > at org.apache.spark.deploy.k8s.submit.submitsteps. > InitContainerBootstrapStep$$anonfun$configureDriver$1.apply( > InitContainerBootstrapStep.scala:42) > > at scala.collection.immutable.List.foreach(List.scala:381) > > at org.apache.spark.deploy.k8s.submit.submitsteps. > InitContainerBootstrapStep.configureDriver(InitContainerBootstrapStep. > scala:42) > > at org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$1. > apply(Client.scala:102) > > at org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$1. > apply(Client.scala:101) > > at scala.collection.immutable.List.foreach(List.scala:381) > > at org.apache.spark.deploy.k8s.submit.Client.run(Client.scala:101) > > at org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$5. > apply(Client.scala:200) > > at org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$5. > apply(Client.scala:193) > > at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2551) > > at org.apache.spark.deploy.k8s.submit.Client$.run(Client.scala:193) > > at org.apache.spark.deploy.k8s.submit.Client$.main(Client.scala:213) > > at org.apache.spark.deploy.k8s.submit.Client.main(Client.scala) > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > at sun.reflect.NativeMethodAccessorImpl.invoke( > NativeMethodAccessorImpl.java:62) > > at sun.reflect.DelegatingMethodAccessorImpl.invoke( > DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:498) > > at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$ > deploy$SparkSubmit$$runMain(SparkSubmit.scala:786) > > at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) > > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) > > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) > > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > > Caused by: java.net.SocketException: Socket closed > > at java.net.SocketInputStream.read(SocketInputStream.java:204) > > at java.net.SocketInputStream.read(SocketInputStream.java:141) > > at okio.Okio$2.read(Okio.java:139) > > at okio.AsyncTimeout$2.read(AsyncTimeout.java:237) > > ... 47 more > > 2018-03-27 12:30:24 INFO ShutdownHookManager:54 - Shutdown hook called > > 2018-03-27 12:30:24 INFO ShutdownHookManager:54 - Deleting directory > /tmp/uploaded-jars-4c7ca1cf-31d6-4dba-9203-c9a6f1cd4099 > > > > Thanks, > > Jenna > > >