So I'm running into an issue with my resource staging server that's
producing a stacktrace like Issue 342
<https://github.com/apache-spark-on-k8s/spark/issues/342>, but I don't
think for the same reasons. What's happening is that every time after I
start up a resource staging server, the first job submitted that uses it
will fail with a java.net.SocketTimeoutException: timeout, and then every
subsequent job will run perfectly. Including with different jars and
different users. It's only ever the first job that fails and it always
fails. I know I'm also running into Issue 577
<https://github.com/apache-spark-on-k8s/spark/issues/577> in that it takes
about three minutes before the resource staging server is accessible, but
I'm still failing waiting over ten minutes or in one case overnight. And
I'm just using the examples jar, so it's not a super large jar like in
Issue 342.

This isn't great for our CI process, so has anyone seen anything like this
before or know how to increase the timeout if it just takes a while on
initial contact? Using spark.network.timeout has no effect.

[jhoole@nid00006 spark]$ kubectl get pods | grep jhoole-spark

jhoole-spark-resource-staging-server-64666675c8-w5cdm   1/1       Running
    0          13m

[jhoole@nid00006 spark]$ kubectl get svc | grep jhoole-spark

jhoole-spark-resource-staging-service               NodePort    10.96.143.55
  <none>        10000:30622/TCP     13m

[jhoole@nid00006 spark]$ bin/spark-submit --class
org.apache.spark.examples.SparkPi --conf spark.app.name=spark-pi --conf
spark.kubernetes.resourceStagingServer.uri=http://192.168.0.1:30622
./examples/target/scala-2.11/jars/spark-examples_2.11-2.2.0-k8s-0.5.0.jar

2018-03-27 12:30:13 WARN  NativeCodeLoader:62 - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable

2018-03-27 12:30:13 INFO  UserGroupInformation:966 - Login successful for
user jhoole@local using keytab file /security/secrets/jhoole.keytab

2018-03-27 12:30:14 INFO  HadoopStepsOrchestrator:54 - Hadoop Conf
directory: /etc/hadoop/conf

2018-03-27 12:30:14 INFO  SecurityManager:54 - Changing view acls to: jhoole

2018-03-27 12:30:14 INFO  SecurityManager:54 - Changing modify acls to:
jhoole

2018-03-27 12:30:14 INFO  SecurityManager:54 - Changing view acls groups to:


2018-03-27 12:30:14 INFO  SecurityManager:54 - Changing modify acls groups
to:

2018-03-27 12:30:14 INFO  SecurityManager:54 - SecurityManager:
authentication disabled; ui acls disabled; users  with view permissions:
Set(jhoole); groups with view permissions: Set(); users  with modify
permissions: Set(jhoole); groups with modify permissions: Set()

Exception in thread "main" java.net.SocketTimeoutException: timeout

at okio.Okio$4.newTimeoutException(Okio.java:230)

at okio.AsyncTimeout.exit(AsyncTimeout.java:285)

at okio.AsyncTimeout$2.read(AsyncTimeout.java:241)

at okio.RealBufferedSource.indexOf(RealBufferedSource.java:345)

at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:217)

at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:211)

at
okhttp3.internal.http1.Http1Codec.readResponseHeaders(Http1Codec.java:189)

at
okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:75)

at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)

at
okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45)

at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)

at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)

at
okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)

at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)

at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)

at
okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)

at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)

at
okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120)

at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)

at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)

at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185)

at okhttp3.RealCall.execute(RealCall.java:69)

at retrofit2.OkHttpCall.execute(OkHttpCall.java:174)

at
org.apache.spark.deploy.k8s.submit.SubmittedDependencyUploaderImpl.getTypedResponseResult(SubmittedDependencyUploaderImpl.scala:101)

at
org.apache.spark.deploy.k8s.submit.SubmittedDependencyUploaderImpl.doUpload(SubmittedDependencyUploaderImpl.scala:97)

at
org.apache.spark.deploy.k8s.submit.SubmittedDependencyUploaderImpl.uploadJars(SubmittedDependencyUploaderImpl.scala:70)

at
org.apache.spark.deploy.k8s.submit.submitsteps.initcontainer.SubmittedResourcesInitContainerConfigurationStep.configureInitContainer(SubmittedResourcesInitContainerConfigurationStep.scala:48)

at
org.apache.spark.deploy.k8s.submit.submitsteps.InitContainerBootstrapStep$$anonfun$configureDriver$1.apply(InitContainerBootstrapStep.scala:43)

at
org.apache.spark.deploy.k8s.submit.submitsteps.InitContainerBootstrapStep$$anonfun$configureDriver$1.apply(InitContainerBootstrapStep.scala:42)

at scala.collection.immutable.List.foreach(List.scala:381)

at
org.apache.spark.deploy.k8s.submit.submitsteps.InitContainerBootstrapStep.configureDriver(InitContainerBootstrapStep.scala:42)

at
org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$1.apply(Client.scala:102)

at
org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$1.apply(Client.scala:101)

at scala.collection.immutable.List.foreach(List.scala:381)

at org.apache.spark.deploy.k8s.submit.Client.run(Client.scala:101)

at
org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$5.apply(Client.scala:200)

at
org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$5.apply(Client.scala:193)

at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2551)

at org.apache.spark.deploy.k8s.submit.Client$.run(Client.scala:193)

at org.apache.spark.deploy.k8s.submit.Client$.main(Client.scala:213)

at org.apache.spark.deploy.k8s.submit.Client.main(Client.scala)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:786)

at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)

at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)

at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)

at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Caused by: java.net.SocketException: Socket closed

at java.net.SocketInputStream.read(SocketInputStream.java:204)

at java.net.SocketInputStream.read(SocketInputStream.java:141)

at okio.Okio$2.read(Okio.java:139)

at okio.AsyncTimeout$2.read(AsyncTimeout.java:237)

... 47 more

2018-03-27 12:30:24 INFO  ShutdownHookManager:54 - Shutdown hook called

2018-03-27 12:30:24 INFO  ShutdownHookManager:54 - Deleting directory
/tmp/uploaded-jars-4c7ca1cf-31d6-4dba-9203-c9a6f1cd4099

Thanks,
Jenna

Reply via email to