Unfortunately the other Kubernetes cluster I was using was rebuilt from
scratch yesterday, but the RSS I have up today has pretty uninteresting
logs.

[root@nid00006 ~]# kubectl logs
default-spark-resource-staging-server-7669dd57d7-xkvp6

++ id -u

+ myuid=0

++ id -g

+ mygid=0

++ getent passwd 0

+ uidentry=root:x:0:0:root:/root:/bin/ash

+ '[' -z root:x:0:0:root:/root:/bin/ash ']'

+ /sbin/tini -s -- /opt/spark/bin/spark-class
org.apache.spark.deploy.rest.k8s.ResourceStagingServer
/etc/spark-resource-staging-server/resource-staging-server.properties

2018-03-29 18:44:03 INFO  log:192 - Logging initialized @23503ms

2018-03-29 18:44:07 WARN  ContextHandler:1444 -
o.s.j.s.ServletContextHandler@7a55af6b{/,null,null} contextPath ends with /

2018-03-29 18:44:17 WARN  NativeCodeLoader:62 - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable

2018-03-29 18:44:21 INFO  SecurityManager:54 - Changing view acls to: root

2018-03-29 18:44:21 INFO  SecurityManager:54 - Changing modify acls to: root

2018-03-29 18:44:21 INFO  SecurityManager:54 - Changing view acls groups to:


2018-03-29 18:44:21 INFO  SecurityManager:54 - Changing modify acls groups
to:

2018-03-29 18:44:21 INFO  SecurityManager:54 - SecurityManager:
authentication disabled; ui acls disabled; users  with view permissions:
Set(root); groups with view permissions: Set(); users  with modify
permissions: Set(root); groups with modify permissions: Set()

2018-03-29 18:44:22 INFO  Server:345 - jetty-9.3.z-SNAPSHOT

2018-03-29 18:44:47 INFO  ContextHandler:781 - Started
o.s.j.s.ServletContextHandler@7a55af6b{/api,null,AVAILABLE}

2018-03-29 18:44:48 INFO  AbstractConnector:270 - Started
ServerConnector@4f8b4bd0{HTTP/1.1,[http/1.1]}{0.0.0.0:10000}

2018-03-29 18:44:48 INFO  Server:403 - Started @68600ms

2018-03-29 18:44:48 INFO  ResourceStagingServer:54 - Resource staging
server started on port 10000.


-Jenna

On Thu, Mar 29, 2018 at 1:26 PM, Matt Cheah <mch...@palantir.com> wrote:

> Hello Jenna,
>
>
>
> Are there any logs from the resource staging server pod? They might show
> something interesting.
>
>
>
> Unfortunately, we haven’t been maintaining the resource staging server
> because we’ve moved all of our effort to the main repository instead of the
> fork. When we consider the submission of local files in the official
> release we should probably create a mechanism that’s more resilient. Using
> a single HTTP server isn’t ideal – would ideally like something that’s
> highly available, replicated, etc.
>
>
>
> -Matt Cheah
>
>
>
> *From: *Jenna Hoole <jenna.ho...@gmail.com>
> *Date: *Thursday, March 29, 2018 at 10:37 AM
> *To: *"user@spark.apache.org" <user@spark.apache.org>
> *Subject: *Re: Spark on K8s resource staging server timeout
>
>
>
> I added overkill high timeouts to the OkHttpClient.Builder() in
> RetrofitClientFactory.scala and I don't seem to be timing out anymore.
>
>
>
> val okHttpClientBuilder = new OkHttpClient.Builder()
>
>       .dispatcher(dispatcher)
>
>       .proxy(resolvedProxy)
>
>       .connectTimeout(120, TimeUnit.SECONDS)
>
>       .writeTimeout(120, TimeUnit.SECONDS)
>
>       .readTimeout(120, TimeUnit.SECONDS)
>
>
>
> -Jenna
>
>
>
> On Tue, Mar 27, 2018 at 10:48 AM, Jenna Hoole <jenna.ho...@gmail.com>
> wrote:
>
> So I'm running into an issue with my resource staging server that's
> producing a stacktrace like Issue 342 [github.com]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache-2Dspark-2Don-2Dk8s_spark_issues_342&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=GgV-jRNHAD0wSF_AnyITsBbfvbkaMumFMb_8JXHvzeY&s=THLyxpQFfgYbGIOphIPQPSYEIq3SJ7O7Y67dOo1TclQ&e=>,
> but I don't think for the same reasons. What's happening is that every time
> after I start up a resource staging server, the first job submitted that
> uses it will fail with a java.net [java.net]
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__java.net&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=GgV-jRNHAD0wSF_AnyITsBbfvbkaMumFMb_8JXHvzeY&s=qfNJryW4ENvnUZQ1ZB4J7q-OA5TAY9S7-dVeh1sT8qs&e=>
> .SocketTimeoutException: timeout, and then every subsequent job will run
> perfectly. Including with different jars and different users. It's only
> ever the first job that fails and it always fails. I know I'm also running
> into Issue 577 [github.com]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache-2Dspark-2Don-2Dk8s_spark_issues_577&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=GgV-jRNHAD0wSF_AnyITsBbfvbkaMumFMb_8JXHvzeY&s=AAjPIbk-JINgth0uIt8BPYVJIvZx5LBPRheqNDccUs8&e=>
>  in
> that it takes about three minutes before the resource staging server is
> accessible, but I'm still failing waiting over ten minutes or in one case
> overnight. And I'm just using the examples jar, so it's not a super large
> jar like in Issue 342.
>
>
>
> This isn't great for our CI process, so has anyone seen anything like this
> before or know how to increase the timeout if it just takes a while on
> initial contact? Using spark.network.timeout has no effect.
>
>
>
> [jhoole@nid00006 spark]$ kubectl get pods | grep jhoole-spark
>
> jhoole-spark-resource-staging-server-64666675c8-w5cdm   1/1       Running
>               13m
>
> [jhoole@nid00006 spark]$ kubectl get svc | grep jhoole-spark
>
> jhoole-spark-resource-staging-service               NodePort    10.96.143.55
>   <none>        10000:30622/TCP     13m
>
> [jhoole@nid00006 spark]$ bin/spark-submit --class
> org.apache.spark.examples.SparkPi --conf spark.app.name [spark.app.name]
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__spark.app.name&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=GgV-jRNHAD0wSF_AnyITsBbfvbkaMumFMb_8JXHvzeY&s=ZDB5X0shXc5gYqnF5fm5BICZu4s7qCcvDj7Oert4tAQ&e=>=spark-pi
> --conf spark.kubernetes.resourceStagingServer.uri=http://192.168.0.1:30622
> [192.168.0.1]
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__192.168.0.1-3A30622&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=GgV-jRNHAD0wSF_AnyITsBbfvbkaMumFMb_8JXHvzeY&s=YWPlC02hcssNzPFyXPr4WmQcNgn_2vIsD-g_mqmUkjY&e=>
> ./examples/target/scala-2.11/jars/spark-examples_2.11-2.2.0-k8s-0.5.0.jar
>
> 2018-03-27 12:30:13 WARN  NativeCodeLoader:62 - Unable to load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
>
> 2018-03-27 12:30:13 INFO  UserGroupInformation:966 - Login successful for
> user jhoole@local using keytab file /security/secrets/jhoole.keytab
>
> 2018-03-27 12:30:14 INFO  HadoopStepsOrchestrator:54 - Hadoop Conf
> directory: /etc/hadoop/conf
>
> 2018-03-27 12:30:14 INFO  SecurityManager:54 - Changing view acls to:
> jhoole
>
> 2018-03-27 12:30:14 INFO  SecurityManager:54 - Changing modify acls to:
> jhoole
>
> 2018-03-27 12:30:14 INFO  SecurityManager:54 - Changing view acls groups
> to:
>
> 2018-03-27 12:30:14 INFO  SecurityManager:54 - Changing modify acls
> groups to:
>
> 2018-03-27 12:30:14 INFO  SecurityManager:54 - SecurityManager:
> authentication disabled; ui acls disabled; users  with view permissions:
> Set(jhoole); groups with view permissions: Set(); users  with modify
> permissions: Set(jhoole); groups with modify permissions: Set()
>
> Exception in thread "main" java.net [java.net]
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__java.net&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=GgV-jRNHAD0wSF_AnyITsBbfvbkaMumFMb_8JXHvzeY&s=qfNJryW4ENvnUZQ1ZB4J7q-OA5TAY9S7-dVeh1sT8qs&e=>
> .SocketTimeoutException: timeout
>
> at okio.Okio$4.newTimeoutException(Okio.java:230)
>
> at okio.AsyncTimeout.exit(AsyncTimeout.java:285)
>
> at okio.AsyncTimeout$2.read(AsyncTimeout.java:241)
>
> at okio.RealBufferedSource.indexOf(RealBufferedSource.java:345)
>
> at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:217)
>
> at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:211)
>
> at okhttp3.internal.http1.Http1Codec.readResponseHeaders(
> Http1Codec.java:189)
>
> at okhttp3.internal.http.CallServerInterceptor.intercept(
> CallServerInterceptor.java:75)
>
> at okhttp3.internal.http.RealInterceptorChain.proceed(
> RealInterceptorChain.java:92)
>
> at okhttp3.internal.connection.ConnectInterceptor.intercept(
> ConnectInterceptor.java:45)
>
> at okhttp3.internal.http.RealInterceptorChain.proceed(
> RealInterceptorChain.java:92)
>
> at okhttp3.internal.http.RealInterceptorChain.proceed(
> RealInterceptorChain.java:67)
>
> at okhttp3.internal.cache.CacheInterceptor.intercept(
> CacheInterceptor.java:93)
>
> at okhttp3.internal.http.RealInterceptorChain.proceed(
> RealInterceptorChain.java:92)
>
> at okhttp3.internal.http.RealInterceptorChain.proceed(
> RealInterceptorChain.java:67)
>
> at okhttp3.internal.http.BridgeInterceptor.intercept(
> BridgeInterceptor.java:93)
>
> at okhttp3.internal.http.RealInterceptorChain.proceed(
> RealInterceptorChain.java:92)
>
> at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(
> RetryAndFollowUpInterceptor.java:120)
>
> at okhttp3.internal.http.RealInterceptorChain.proceed(
> RealInterceptorChain.java:92)
>
> at okhttp3.internal.http.RealInterceptorChain.proceed(
> RealInterceptorChain.java:67)
>
> at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185)
>
> at okhttp3.RealCall.execute(RealCall.java:69)
>
> at retrofit2.OkHttpCall.execute(OkHttpCall.java:174)
>
> at org.apache.spark.deploy.k8s.submit.SubmittedDependencyUploaderImp
> l.getTypedResponseResult(SubmittedDependencyUploaderImpl.scala:101)
>
> at org.apache.spark.deploy.k8s.submit.SubmittedDependencyUploaderImp
> l.doUpload(SubmittedDependencyUploaderImpl.scala:97)
>
> at org.apache.spark.deploy.k8s.submit.SubmittedDependencyUploaderImp
> l.uploadJars(SubmittedDependencyUploaderImpl.scala:70)
>
> at org.apache.spark.deploy.k8s.submit.submitsteps.initcontainer.
> SubmittedResourcesInitContainerConfigurationStep.configureInitContainer(
> SubmittedResourcesInitContainerConfigurationStep.scala:48)
>
> at org.apache.spark.deploy.k8s.submit.submitsteps.
> InitContainerBootstrapStep$$anonfun$configureDriver$1.apply(
> InitContainerBootstrapStep.scala:43)
>
> at org.apache.spark.deploy.k8s.submit.submitsteps.
> InitContainerBootstrapStep$$anonfun$configureDriver$1.apply(
> InitContainerBootstrapStep.scala:42)
>
> at scala.collection.immutable.List.foreach(List.scala:381)
>
> at org.apache.spark.deploy.k8s.submit.submitsteps.
> InitContainerBootstrapStep.configureDriver(InitContainerBootstrapStep.
> scala:42)
>
> at org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$1.
> apply(Client.scala:102)
>
> at org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$1.
> apply(Client.scala:101)
>
> at scala.collection.immutable.List.foreach(List.scala:381)
>
> at org.apache.spark.deploy.k8s.submit.Client.run(Client.scala:101)
>
> at org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$5.
> apply(Client.scala:200)
>
> at org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$5.
> apply(Client.scala:193)
>
> at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2551)
>
> at org.apache.spark.deploy.k8s.submit.Client$.run(Client.scala:193)
>
> at org.apache.spark.deploy.k8s.submit.Client$.main(Client.scala:213)
>
> at org.apache.spark.deploy.k8s.submit.Client.main(Client.scala)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke(
> NativeMethodAccessorImpl.java:62)
>
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:498)
>
> at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$
> deploy$SparkSubmit$$runMain(SparkSubmit.scala:786)
>
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
>
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
> Caused by: java.net.SocketException: Socket closed
>
> at java.net.SocketInputStream.read(SocketInputStream.java:204)
>
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
>
> at okio.Okio$2.read(Okio.java:139)
>
> at okio.AsyncTimeout$2.read(AsyncTimeout.java:237)
>
> ... 47 more
>
> 2018-03-27 12:30:24 INFO  ShutdownHookManager:54 - Shutdown hook called
>
> 2018-03-27 12:30:24 INFO  ShutdownHookManager:54 - Deleting directory
> /tmp/uploaded-jars-4c7ca1cf-31d6-4dba-9203-c9a6f1cd4099
>
>
>
> Thanks,
>
> Jenna
>
>
>

Reply via email to