Re: Spark on K8s resource staging server timeout

2018-03-29 Thread Jenna Hoole
Unfortunately the other Kubernetes cluster I was using was rebuilt from
scratch yesterday, but the RSS I have up today has pretty uninteresting
logs.

[root@nid6 ~]# kubectl logs
default-spark-resource-staging-server-7669dd57d7-xkvp6

++ id -u

+ myuid=0

++ id -g

+ mygid=0

++ getent passwd 0

+ uidentry=root:x:0:0:root:/root:/bin/ash

+ '[' -z root:x:0:0:root:/root:/bin/ash ']'

+ /sbin/tini -s -- /opt/spark/bin/spark-class
org.apache.spark.deploy.rest.k8s.ResourceStagingServer
/etc/spark-resource-staging-server/resource-staging-server.properties

2018-03-29 18:44:03 INFO  log:192 - Logging initialized @23503ms

2018-03-29 18:44:07 WARN  ContextHandler:1444 -
o.s.j.s.ServletContextHandler@7a55af6b{/,null,null} contextPath ends with /

2018-03-29 18:44:17 WARN  NativeCodeLoader:62 - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable

2018-03-29 18:44:21 INFO  SecurityManager:54 - Changing view acls to: root

2018-03-29 18:44:21 INFO  SecurityManager:54 - Changing modify acls to: root

2018-03-29 18:44:21 INFO  SecurityManager:54 - Changing view acls groups to:


2018-03-29 18:44:21 INFO  SecurityManager:54 - Changing modify acls groups
to:

2018-03-29 18:44:21 INFO  SecurityManager:54 - SecurityManager:
authentication disabled; ui acls disabled; users  with view permissions:
Set(root); groups with view permissions: Set(); users  with modify
permissions: Set(root); groups with modify permissions: Set()

2018-03-29 18:44:22 INFO  Server:345 - jetty-9.3.z-SNAPSHOT

2018-03-29 18:44:47 INFO  ContextHandler:781 - Started
o.s.j.s.ServletContextHandler@7a55af6b{/api,null,AVAILABLE}

2018-03-29 18:44:48 INFO  AbstractConnector:270 - Started
ServerConnector@4f8b4bd0{HTTP/1.1,[http/1.1]}{0.0.0.0:1}

2018-03-29 18:44:48 INFO  Server:403 - Started @68600ms

2018-03-29 18:44:48 INFO  ResourceStagingServer:54 - Resource staging
server started on port 1.


-Jenna

On Thu, Mar 29, 2018 at 1:26 PM, Matt Cheah <mch...@palantir.com> wrote:

> Hello Jenna,
>
>
>
> Are there any logs from the resource staging server pod? They might show
> something interesting.
>
>
>
> Unfortunately, we haven’t been maintaining the resource staging server
> because we’ve moved all of our effort to the main repository instead of the
> fork. When we consider the submission of local files in the official
> release we should probably create a mechanism that’s more resilient. Using
> a single HTTP server isn’t ideal – would ideally like something that’s
> highly available, replicated, etc.
>
>
>
> -Matt Cheah
>
>
>
> *From: *Jenna Hoole <jenna.ho...@gmail.com>
> *Date: *Thursday, March 29, 2018 at 10:37 AM
> *To: *"user@spark.apache.org" <user@spark.apache.org>
> *Subject: *Re: Spark on K8s resource staging server timeout
>
>
>
> I added overkill high timeouts to the OkHttpClient.Builder() in
> RetrofitClientFactory.scala and I don't seem to be timing out anymore.
>
>
>
> val okHttpClientBuilder = new OkHttpClient.Builder()
>
>   .dispatcher(dispatcher)
>
>   .proxy(resolvedProxy)
>
>   .connectTimeout(120, TimeUnit.SECONDS)
>
>   .writeTimeout(120, TimeUnit.SECONDS)
>
>   .readTimeout(120, TimeUnit.SECONDS)
>
>
>
> -Jenna
>
>
>
> On Tue, Mar 27, 2018 at 10:48 AM, Jenna Hoole <jenna.ho...@gmail.com>
> wrote:
>
> So I'm running into an issue with my resource staging server that's
> producing a stacktrace like Issue 342 [github.com]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache-2Dspark-2Don-2Dk8s_spark_issues_342=DwMFaQ=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs=GgV-jRNHAD0wSF_AnyITsBbfvbkaMumFMb_8JXHvzeY=THLyxpQFfgYbGIOphIPQPSYEIq3SJ7O7Y67dOo1TclQ=>,
> but I don't think for the same reasons. What's happening is that every time
> after I start up a resource staging server, the first job submitted that
> uses it will fail with a java.net [java.net]
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__java.net=DwMFaQ=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs=GgV-jRNHAD0wSF_AnyITsBbfvbkaMumFMb_8JXHvzeY=qfNJryW4ENvnUZQ1ZB4J7q-OA5TAY9S7-dVeh1sT8qs=>
> .SocketTimeoutException: timeout, and then every subsequent job will run
> perfectly. Including with different jars and different users. It's only
> ever the first job that fails and it always fails. I know I'm also running
> into Issue 577 [github.com]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache-2Dspark-2Don-2Dk8s_spark_issues_577=DwMFaQ=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs=GgV-jRNHAD0wSF_AnyITsBbfvbkaMumFMb_8JXHvzeY=AAjPIbk-JINgth0uIt8BPYVJIvZx5LBPRheqNDccUs8=>
&g

Re: Spark on K8s resource staging server timeout

2018-03-29 Thread Jenna Hoole
I added overkill high timeouts to the OkHttpClient.Builder() in
RetrofitClientFactory.scala and I don't seem to be timing out anymore.

val okHttpClientBuilder = new OkHttpClient.Builder()
  .dispatcher(dispatcher)
  .proxy(resolvedProxy)
  .connectTimeout(120, TimeUnit.SECONDS)
  .writeTimeout(120, TimeUnit.SECONDS)
  .readTimeout(120, TimeUnit.SECONDS)

-Jenna

On Tue, Mar 27, 2018 at 10:48 AM, Jenna Hoole <jenna.ho...@gmail.com> wrote:

> So I'm running into an issue with my resource staging server that's
> producing a stacktrace like Issue 342
> <https://github.com/apache-spark-on-k8s/spark/issues/342>, but I don't
> think for the same reasons. What's happening is that every time after I
> start up a resource staging server, the first job submitted that uses it
> will fail with a java.net.SocketTimeoutException: timeout, and then every
> subsequent job will run perfectly. Including with different jars and
> different users. It's only ever the first job that fails and it always
> fails. I know I'm also running into Issue 577
> <https://github.com/apache-spark-on-k8s/spark/issues/577> in that it
> takes about three minutes before the resource staging server is accessible,
> but I'm still failing waiting over ten minutes or in one case overnight.
> And I'm just using the examples jar, so it's not a super large jar like in
> Issue 342.
>
> This isn't great for our CI process, so has anyone seen anything like this
> before or know how to increase the timeout if it just takes a while on
> initial contact? Using spark.network.timeout has no effect.
>
> [jhoole@nid6 spark]$ kubectl get pods | grep jhoole-spark
>
> jhoole-spark-resource-staging-server-6475c8-w5cdm   1/1   Running
> 0  13m
>
> [jhoole@nid6 spark]$ kubectl get svc | grep jhoole-spark
>
> jhoole-spark-resource-staging-service   NodePort10.96.143.55
>   1:30622/TCP 13m
>
> [jhoole@nid6 spark]$ bin/spark-submit --class
> org.apache.spark.examples.SparkPi --conf spark.app.name=spark-pi --conf
> spark.kubernetes.resourceStagingServer.uri=http://192.168.0.1:30622
> ./examples/target/scala-2.11/jars/spark-examples_2.11-2.2.0-k8s-0.5.0.jar
>
> 2018-03-27 12:30:13 WARN  NativeCodeLoader:62 - Unable to load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
>
> 2018-03-27 12:30:13 INFO  UserGroupInformation:966 - Login successful for
> user jhoole@local using keytab file /security/secrets/jhoole.keytab
>
> 2018-03-27 12:30:14 INFO  HadoopStepsOrchestrator:54 - Hadoop Conf
> directory: /etc/hadoop/conf
>
> 2018-03-27 12:30:14 INFO  SecurityManager:54 - Changing view acls to:
> jhoole
>
> 2018-03-27 12:30:14 INFO  SecurityManager:54 - Changing modify acls to:
> jhoole
>
> 2018-03-27 12:30:14 INFO  SecurityManager:54 - Changing view acls groups
> to:
>
> 2018-03-27 12:30:14 INFO  SecurityManager:54 - Changing modify acls
> groups to:
>
> 2018-03-27 12:30:14 INFO  SecurityManager:54 - SecurityManager:
> authentication disabled; ui acls disabled; users  with view permissions:
> Set(jhoole); groups with view permissions: Set(); users  with modify
> permissions: Set(jhoole); groups with modify permissions: Set()
>
> Exception in thread "main" java.net.SocketTimeoutException: timeout
>
> at okio.Okio$4.newTimeoutException(Okio.java:230)
>
> at okio.AsyncTimeout.exit(AsyncTimeout.java:285)
>
> at okio.AsyncTimeout$2.read(AsyncTimeout.java:241)
>
> at okio.RealBufferedSource.indexOf(RealBufferedSource.java:345)
>
> at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:217)
>
> at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:211)
>
> at okhttp3.internal.http1.Http1Codec.readResponseHeaders(
> Http1Codec.java:189)
>
> at okhttp3.internal.http.CallServerInterceptor.intercept(
> CallServerInterceptor.java:75)
>
> at okhttp3.internal.http.RealInterceptorChain.proceed(
> RealInterceptorChain.java:92)
>
> at okhttp3.internal.connection.ConnectInterceptor.intercept(
> ConnectInterceptor.java:45)
>
> at okhttp3.internal.http.RealInterceptorChain.proceed(
> RealInterceptorChain.java:92)
>
> at okhttp3.internal.http.RealInterceptorChain.proceed(
> RealInterceptorChain.java:67)
>
> at okhttp3.internal.cache.CacheInterceptor.intercept(
> CacheInterceptor.java:93)
>
> at okhttp3.internal.http.RealInterceptorChain.proceed(
> RealInterceptorChain.java:92)
>
> at okhttp3.internal.http.RealInterceptorChain.proceed(
> RealInterceptorChain.java:67)
>
> at okhttp3.internal.http.BridgeInterceptor.intercept(
> BridgeInterceptor.java:93)
>
> at okhttp3.internal.http.RealInterceptorChain

Spark on K8s resource staging server timeout

2018-03-27 Thread Jenna Hoole
So I'm running into an issue with my resource staging server that's
producing a stacktrace like Issue 342
, but I don't
think for the same reasons. What's happening is that every time after I
start up a resource staging server, the first job submitted that uses it
will fail with a java.net.SocketTimeoutException: timeout, and then every
subsequent job will run perfectly. Including with different jars and
different users. It's only ever the first job that fails and it always
fails. I know I'm also running into Issue 577
 in that it takes
about three minutes before the resource staging server is accessible, but
I'm still failing waiting over ten minutes or in one case overnight. And
I'm just using the examples jar, so it's not a super large jar like in
Issue 342.

This isn't great for our CI process, so has anyone seen anything like this
before or know how to increase the timeout if it just takes a while on
initial contact? Using spark.network.timeout has no effect.

[jhoole@nid6 spark]$ kubectl get pods | grep jhoole-spark

jhoole-spark-resource-staging-server-6475c8-w5cdm   1/1   Running
0  13m

[jhoole@nid6 spark]$ kubectl get svc | grep jhoole-spark

jhoole-spark-resource-staging-service   NodePort10.96.143.55
  1:30622/TCP 13m

[jhoole@nid6 spark]$ bin/spark-submit --class
org.apache.spark.examples.SparkPi --conf spark.app.name=spark-pi --conf
spark.kubernetes.resourceStagingServer.uri=http://192.168.0.1:30622
./examples/target/scala-2.11/jars/spark-examples_2.11-2.2.0-k8s-0.5.0.jar

2018-03-27 12:30:13 WARN  NativeCodeLoader:62 - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable

2018-03-27 12:30:13 INFO  UserGroupInformation:966 - Login successful for
user jhoole@local using keytab file /security/secrets/jhoole.keytab

2018-03-27 12:30:14 INFO  HadoopStepsOrchestrator:54 - Hadoop Conf
directory: /etc/hadoop/conf

2018-03-27 12:30:14 INFO  SecurityManager:54 - Changing view acls to: jhoole

2018-03-27 12:30:14 INFO  SecurityManager:54 - Changing modify acls to:
jhoole

2018-03-27 12:30:14 INFO  SecurityManager:54 - Changing view acls groups to:


2018-03-27 12:30:14 INFO  SecurityManager:54 - Changing modify acls groups
to:

2018-03-27 12:30:14 INFO  SecurityManager:54 - SecurityManager:
authentication disabled; ui acls disabled; users  with view permissions:
Set(jhoole); groups with view permissions: Set(); users  with modify
permissions: Set(jhoole); groups with modify permissions: Set()

Exception in thread "main" java.net.SocketTimeoutException: timeout

at okio.Okio$4.newTimeoutException(Okio.java:230)

at okio.AsyncTimeout.exit(AsyncTimeout.java:285)

at okio.AsyncTimeout$2.read(AsyncTimeout.java:241)

at okio.RealBufferedSource.indexOf(RealBufferedSource.java:345)

at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:217)

at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:211)

at
okhttp3.internal.http1.Http1Codec.readResponseHeaders(Http1Codec.java:189)

at
okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:75)

at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)

at
okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45)

at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)

at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)

at
okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)

at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)

at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)

at
okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)

at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)

at
okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120)

at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)

at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)

at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185)

at okhttp3.RealCall.execute(RealCall.java:69)

at retrofit2.OkHttpCall.execute(OkHttpCall.java:174)

at
org.apache.spark.deploy.k8s.submit.SubmittedDependencyUploaderImpl.getTypedResponseResult(SubmittedDependencyUploaderImpl.scala:101)

at
org.apache.spark.deploy.k8s.submit.SubmittedDependencyUploaderImpl.doUpload(SubmittedDependencyUploaderImpl.scala:97)

at
org.apache.spark.deploy.k8s.submit.SubmittedDependencyUploaderImpl.uploadJars(SubmittedDependencyUploaderImpl.scala:70)

at

Re: Spark on K8s - using files fetched by init-container?

2018-02-26 Thread Jenna Hoole
Oh, duh. I completely forgot that file:// is a prefix I can use. Up and
running now :)

Thank you so much!
Jenna

On Mon, Feb 26, 2018 at 1:00 PM, Yinan Li <liyinan...@gmail.com> wrote:

> OK, it looks like you will need to use 
> `file:///var/spark-data/spark-files/flights.csv`
> instead. The 'file://' scheme must be explicitly used as it seems it
> defaults to 'hdfs' in your setup.
>
> On Mon, Feb 26, 2018 at 12:57 PM, Jenna Hoole <jenna.ho...@gmail.com>
> wrote:
>
>> Thank you for the quick response! However, I'm still having problems.
>>
>> When I try to look for /var/spark-data/spark-files/flights.csv I get
>> told:
>>
>> Error: Error in loadDF : analysis error - Path does not exist: hdfs://
>> 192.168.0.1:8020/var/spark-data/spark-files/flights.csv;
>>
>> Execution halted
>>
>> Exception in thread "main" org.apache.spark.SparkUserAppException: User
>> application exited with 1
>>
>> at org.apache.spark.deploy.RRunner$.main(RRunner.scala:104)
>>
>> at org.apache.spark.deploy.RRunner.main(RRunner.scala)
>>
>> And when I try to look for local:///var/spark-data/spark-files/flights.csv,
>> I get:
>>
>> Error in file(file, "rt") : cannot open the connection
>>
>> Calls: read.csv -> read.table -> file
>>
>> In addition: Warning message:
>>
>> In file(file, "rt") :
>>
>>   cannot open file 'local:///var/spark-data/spark-files/flights.csv': No
>> such file or directory
>>
>> Execution halted
>>
>> Exception in thread "main" org.apache.spark.SparkUserAppException: User
>> application exited with 1
>>
>> at org.apache.spark.deploy.RRunner$.main(RRunner.scala:104)
>>
>> at org.apache.spark.deploy.RRunner.main(RRunner.scala)
>>
>> I can see from a kubectl describe that the directory is getting mounted.
>>
>> Mounts:
>>
>>   /etc/hadoop/conf from hadoop-properties (rw)
>>
>>   /var/run/secrets/kubernetes.io/serviceaccount from
>> spark-token-pxz79 (ro)
>>
>>   /var/spark-data/spark-files from download-files (rw)
>>
>>   /var/spark-data/spark-jars from download-jars-volume (rw)
>>
>>   /var/spark/tmp from spark-local-dir-0-tmp (rw)
>>
>> Is there something else I need to be doing in my set up?
>>
>> Thanks,
>> Jenna
>>
>> On Mon, Feb 26, 2018 at 12:02 PM, Yinan Li <liyinan...@gmail.com> wrote:
>>
>>> The files specified through --files are localized by the init-container
>>> to /var/spark-data/spark-files by default. So in your case, the file should
>>> be located at /var/spark-data/spark-files/flights.csv locally in the
>>> container.
>>>
>>> On Mon, Feb 26, 2018 at 10:51 AM, Jenna Hoole <jenna.ho...@gmail.com>
>>> wrote:
>>>
>>>> This is probably stupid user error, but I can't for the life of me
>>>> figure out how to access the files that are staged by the init-container.
>>>>
>>>> I'm trying to run the SparkR example data-manipulation.R which requires
>>>> the path to its datafile. I supply the hdfs location via --files and then
>>>> the full hdfs path.
>>>>
>>>>
>>>> --files hdfs://192.168.0.1:8020/user/jhoole/flights.csv
>>>> local:///opt/spark/examples/src/main/r/data-manipulation.R hdfs://
>>>> 192.168.0.1:8020/user/jhoole/flights.csv
>>>>
>>>> The init-container seems to load my file.
>>>>
>>>> 18/02/26 18:29:09 INFO spark.SparkContext: Added file hdfs://
>>>> 192.168.0.1:8020/user/jhoole/flights.csv at hdfs://
>>>> 192.168.0.1:8020/user/jhoole/flights.csv with timestamp 1519669749519
>>>>
>>>> 18/02/26 18:29:09 INFO util.Utils: Fetching hdfs://
>>>> 192.168.0.1:8020/user/jhoole/flights.csv to
>>>> /var/spark/tmp/spark-d943dae6-9b95-4df0-87a3-9f7978d6d4d2/us
>>>> erFiles-4112b7aa-b9e7-47a9-bcbc-7f7a01f93e38/fetchFileTemp78
>>>> 72615076522023165.tmp
>>>>
>>>> However, I get an error that my file does not exist.
>>>>
>>>> Error in file(file, "rt") : cannot open the connection
>>>>
>>>> Calls: read.csv -> read.table -> file
>>>>
>>>> In addition: Warning message:
>>>>
>>>> In file(file, "rt") :
>>>>
>>>>   cannot open file 'hdfs://192.168.0.1:8020/user/jhoole/flights.csv':
>>>> No such file or directory
&g

Spark on K8s - using files fetched by init-container?

2018-02-26 Thread Jenna Hoole
This is probably stupid user error, but I can't for the life of me figure
out how to access the files that are staged by the init-container.

I'm trying to run the SparkR example data-manipulation.R which requires the
path to its datafile. I supply the hdfs location via --files and then the
full hdfs path.


--files hdfs://192.168.0.1:8020/user/jhoole/flights.csv
local:///opt/spark/examples/src/main/r/data-manipulation.R hdfs://
192.168.0.1:8020/user/jhoole/flights.csv

The init-container seems to load my file.

18/02/26 18:29:09 INFO spark.SparkContext: Added file hdfs://
192.168.0.1:8020/user/jhoole/flights.csv at hdfs://
192.168.0.1:8020/user/jhoole/flights.csv with timestamp 1519669749519

18/02/26 18:29:09 INFO util.Utils: Fetching hdfs://
192.168.0.1:8020/user/jhoole/flights.csv to
/var/spark/tmp/spark-d943dae6-9b95-4df0-87a3-9f7978d6d4d2/userFiles-4112b7aa-b9e7-47a9-bcbc-7f7a01f93e38/fetchFileTemp7872615076522023165.tmp

However, I get an error that my file does not exist.

Error in file(file, "rt") : cannot open the connection

Calls: read.csv -> read.table -> file

In addition: Warning message:

In file(file, "rt") :

  cannot open file 'hdfs://192.168.0.1:8020/user/jhoole/flights.csv': No
such file or directory

Execution halted

Exception in thread "main" org.apache.spark.SparkUserAppException: User
application exited with 1

at org.apache.spark.deploy.RRunner$.main(RRunner.scala:104)

at org.apache.spark.deploy.RRunner.main(RRunner.scala)

If I try supplying just flights.csv, I get a different error

--files hdfs://192.168.0.1:8020/user/jhoole/flights.csv
local:///opt/spark/examples/src/main/r/data-manipulation.R flights.csv

Error: Error in loadDF : analysis error - Path does not exist: hdfs://
192.168.0.1:8020/user/root/flights.csv;

Execution halted

Exception in thread "main" org.apache.spark.SparkUserAppException: User
application exited with 1

at org.apache.spark.deploy.RRunner$.main(RRunner.scala:104)

at org.apache.spark.deploy.RRunner.main(RRunner.scala)

If the path /user/root/flights.csv does exist and I only supply
"flights.csv" as the file path, it runs to completion successfully.
However, if I provide the file path as "hdfs://
192.168.0.1:8020/user/root/flights.csv," I get the same "No such file or
directory" error as I do initially.

Since I obviously can't put all my hdfs files under /user/root, how do I
get it to use the file that the init-container is fetching?

Thanks,
Jenna


Spark on K8s with Romana

2018-02-12 Thread Jenna Hoole
So, we've run into something interesting. In our case, we've got some
proprietary networking HW which is very feature limited in the TCP/IP
space, so using Romana, executors can't seem to find the driver using the
hostname lookup method it's attempting. Is there any way to make it use IP?

Thanks,
Jenna