Re: Spark on K8s - using files fetched by init-container?

2018-02-27 Thread Felix Cheung
Yes you were pointing to HDFS on a loopback address...


From: Jenna Hoole <jenna.ho...@gmail.com>
Sent: Monday, February 26, 2018 1:11:35 PM
To: Yinan Li; user@spark.apache.org
Subject: Re: Spark on K8s - using files fetched by init-container?

Oh, duh. I completely forgot that file:// is a prefix I can use. Up and running 
now :)

Thank you so much!
Jenna

On Mon, Feb 26, 2018 at 1:00 PM, Yinan Li 
<liyinan...@gmail.com<mailto:liyinan...@gmail.com>> wrote:
OK, it looks like you will need to use 
`file:///var/spark-data/spark-files/flights.csv` instead. The 'file://' scheme 
must be explicitly used as it seems it defaults to 'hdfs' in your setup.

On Mon, Feb 26, 2018 at 12:57 PM, Jenna Hoole 
<jenna.ho...@gmail.com<mailto:jenna.ho...@gmail.com>> wrote:
Thank you for the quick response! However, I'm still having problems.

When I try to look for /var/spark-data/spark-files/flights.csv I get told:

Error: Error in loadDF : analysis error - Path does not exist: 
hdfs://192.168.0.1:8020/var/spark-data/spark-files/flights.csv<http://192.168.0.1:8020/var/spark-data/spark-files/flights.csv>;

Execution halted

Exception in thread "main" org.apache.spark.SparkUserAppException: User 
application exited with 1

at org.apache.spark.deploy.RRunner$.main(RRunner.scala:104)

at org.apache.spark.deploy.RRunner.main(RRunner.scala)

And when I try to look for local:///var/spark-data/spark-files/flights.csv, I 
get:

Error in file(file, "rt") : cannot open the connection

Calls: read.csv -> read.table -> file

In addition: Warning message:

In file(file, "rt") :

  cannot open file 'local:///var/spark-data/spark-files/flights.csv': No such 
file or directory

Execution halted

Exception in thread "main" org.apache.spark.SparkUserAppException: User 
application exited with 1

at org.apache.spark.deploy.RRunner$.main(RRunner.scala:104)

at org.apache.spark.deploy.RRunner.main(RRunner.scala)

I can see from a kubectl describe that the directory is getting mounted.

Mounts:

  /etc/hadoop/conf from hadoop-properties (rw)

  
/var/run/secrets/kubernetes.io/serviceaccount<http://kubernetes.io/serviceaccount>
 from spark-token-pxz79 (ro)

  /var/spark-data/spark-files from download-files (rw)

  /var/spark-data/spark-jars from download-jars-volume (rw)

  /var/spark/tmp from spark-local-dir-0-tmp (rw)

Is there something else I need to be doing in my set up?

Thanks,
Jenna

On Mon, Feb 26, 2018 at 12:02 PM, Yinan Li 
<liyinan...@gmail.com<mailto:liyinan...@gmail.com>> wrote:
The files specified through --files are localized by the init-container to 
/var/spark-data/spark-files by default. So in your case, the file should be 
located at /var/spark-data/spark-files/flights.csv locally in the container.

On Mon, Feb 26, 2018 at 10:51 AM, Jenna Hoole 
<jenna.ho...@gmail.com<mailto:jenna.ho...@gmail.com>> wrote:
This is probably stupid user error, but I can't for the life of me figure out 
how to access the files that are staged by the init-container.

I'm trying to run the SparkR example data-manipulation.R which requires the 
path to its datafile. I supply the hdfs location via --files and then the full 
hdfs path.


--files 
hdfs://192.168.0.1:8020/user/jhoole/flights.csv<http://192.168.0.1:8020/user/jhoole/flights.csv>
 local:///opt/spark/examples/src/main/r/data-manipulation.R 
hdfs://192.168.0.1:8020/user/jhoole/flights.csv<http://192.168.0.1:8020/user/jhoole/flights.csv>

The init-container seems to load my file.

18/02/26 18:29:09 INFO spark.SparkContext: Added file 
hdfs://192.168.0.1:8020/user/jhoole/flights.csv<http://192.168.0.1:8020/user/jhoole/flights.csv>
 at 
hdfs://192.168.0.1:8020/user/jhoole/flights.csv<http://192.168.0.1:8020/user/jhoole/flights.csv>
 with timestamp 1519669749519

18/02/26 18:29:09 INFO util.Utils: Fetching 
hdfs://192.168.0.1:8020/user/jhoole/flights.csv<http://192.168.0.1:8020/user/jhoole/flights.csv>
 to 
/var/spark/tmp/spark-d943dae6-9b95-4df0-87a3-9f7978d6d4d2/userFiles-4112b7aa-b9e7-47a9-bcbc-7f7a01f93e38/fetchFileTemp7872615076522023165.tmp

However, I get an error that my file does not exist.

Error in file(file, "rt") : cannot open the connection

Calls: read.csv -> read.table -> file

In addition: Warning message:

In file(file, "rt") :

  cannot open file 
'hdfs://192.168.0.1:8020/user/jhoole/flights.csv<http://192.168.0.1:8020/user/jhoole/flights.csv>':
 No such file or directory

Execution halted

Exception in thread "main" org.apache.spark.SparkUserAppException: User 
application exited with 1

at org.apache.spark.deploy.RRunner$.main(RRunner.scala:104)

at org.apache.spark.deploy.RRunner.main(RRunner.scala)

If I try supplying just flights.csv, I get a different error

--files 
hdfs://192.168.0.1:8020/user/jhoole/flights.csv<http://192.168.0.1

Re: Spark on K8s - using files fetched by init-container?

2018-02-26 Thread Jenna Hoole
Oh, duh. I completely forgot that file:// is a prefix I can use. Up and
running now :)

Thank you so much!
Jenna

On Mon, Feb 26, 2018 at 1:00 PM, Yinan Li  wrote:

> OK, it looks like you will need to use 
> `file:///var/spark-data/spark-files/flights.csv`
> instead. The 'file://' scheme must be explicitly used as it seems it
> defaults to 'hdfs' in your setup.
>
> On Mon, Feb 26, 2018 at 12:57 PM, Jenna Hoole 
> wrote:
>
>> Thank you for the quick response! However, I'm still having problems.
>>
>> When I try to look for /var/spark-data/spark-files/flights.csv I get
>> told:
>>
>> Error: Error in loadDF : analysis error - Path does not exist: hdfs://
>> 192.168.0.1:8020/var/spark-data/spark-files/flights.csv;
>>
>> Execution halted
>>
>> Exception in thread "main" org.apache.spark.SparkUserAppException: User
>> application exited with 1
>>
>> at org.apache.spark.deploy.RRunner$.main(RRunner.scala:104)
>>
>> at org.apache.spark.deploy.RRunner.main(RRunner.scala)
>>
>> And when I try to look for local:///var/spark-data/spark-files/flights.csv,
>> I get:
>>
>> Error in file(file, "rt") : cannot open the connection
>>
>> Calls: read.csv -> read.table -> file
>>
>> In addition: Warning message:
>>
>> In file(file, "rt") :
>>
>>   cannot open file 'local:///var/spark-data/spark-files/flights.csv': No
>> such file or directory
>>
>> Execution halted
>>
>> Exception in thread "main" org.apache.spark.SparkUserAppException: User
>> application exited with 1
>>
>> at org.apache.spark.deploy.RRunner$.main(RRunner.scala:104)
>>
>> at org.apache.spark.deploy.RRunner.main(RRunner.scala)
>>
>> I can see from a kubectl describe that the directory is getting mounted.
>>
>> Mounts:
>>
>>   /etc/hadoop/conf from hadoop-properties (rw)
>>
>>   /var/run/secrets/kubernetes.io/serviceaccount from
>> spark-token-pxz79 (ro)
>>
>>   /var/spark-data/spark-files from download-files (rw)
>>
>>   /var/spark-data/spark-jars from download-jars-volume (rw)
>>
>>   /var/spark/tmp from spark-local-dir-0-tmp (rw)
>>
>> Is there something else I need to be doing in my set up?
>>
>> Thanks,
>> Jenna
>>
>> On Mon, Feb 26, 2018 at 12:02 PM, Yinan Li  wrote:
>>
>>> The files specified through --files are localized by the init-container
>>> to /var/spark-data/spark-files by default. So in your case, the file should
>>> be located at /var/spark-data/spark-files/flights.csv locally in the
>>> container.
>>>
>>> On Mon, Feb 26, 2018 at 10:51 AM, Jenna Hoole 
>>> wrote:
>>>
 This is probably stupid user error, but I can't for the life of me
 figure out how to access the files that are staged by the init-container.

 I'm trying to run the SparkR example data-manipulation.R which requires
 the path to its datafile. I supply the hdfs location via --files and then
 the full hdfs path.


 --files hdfs://192.168.0.1:8020/user/jhoole/flights.csv
 local:///opt/spark/examples/src/main/r/data-manipulation.R hdfs://
 192.168.0.1:8020/user/jhoole/flights.csv

 The init-container seems to load my file.

 18/02/26 18:29:09 INFO spark.SparkContext: Added file hdfs://
 192.168.0.1:8020/user/jhoole/flights.csv at hdfs://
 192.168.0.1:8020/user/jhoole/flights.csv with timestamp 1519669749519

 18/02/26 18:29:09 INFO util.Utils: Fetching hdfs://
 192.168.0.1:8020/user/jhoole/flights.csv to
 /var/spark/tmp/spark-d943dae6-9b95-4df0-87a3-9f7978d6d4d2/us
 erFiles-4112b7aa-b9e7-47a9-bcbc-7f7a01f93e38/fetchFileTemp78
 72615076522023165.tmp

 However, I get an error that my file does not exist.

 Error in file(file, "rt") : cannot open the connection

 Calls: read.csv -> read.table -> file

 In addition: Warning message:

 In file(file, "rt") :

   cannot open file 'hdfs://192.168.0.1:8020/user/jhoole/flights.csv':
 No such file or directory

 Execution halted

 Exception in thread "main" org.apache.spark.SparkUserAppException:
 User application exited with 1

 at org.apache.spark.deploy.RRunner$.main(RRunner.scala:104)

 at org.apache.spark.deploy.RRunner.main(RRunner.scala)

 If I try supplying just flights.csv, I get a different error

 --files hdfs://192.168.0.1:8020/user/jhoole/flights.csv
 local:///opt/spark/examples/src/main/r/data-manipulation.R flights.csv

 Error: Error in loadDF : analysis error - Path does not exist: hdfs://
 192.168.0.1:8020/user/root/flights.csv;

 Execution halted

 Exception in thread "main" org.apache.spark.SparkUserAppException:
 User application exited with 1

 at org.apache.spark.deploy.RRunner$.main(RRunner.scala:104)

 at org.apache.spark.deploy.RRunner.main(RRunner.scala)

 If the path /user/root/flights.csv does exist and I only supply
 "flights.csv" as the file path, 

Re: Spark on K8s - using files fetched by init-container?

2018-02-26 Thread Yinan Li
The files specified through --files are localized by the init-container
to /var/spark-data/spark-files by default. So in your case, the file should
be located at /var/spark-data/spark-files/flights.csv locally in the
container.

On Mon, Feb 26, 2018 at 10:51 AM, Jenna Hoole  wrote:

> This is probably stupid user error, but I can't for the life of me figure
> out how to access the files that are staged by the init-container.
>
> I'm trying to run the SparkR example data-manipulation.R which requires
> the path to its datafile. I supply the hdfs location via --files and then
> the full hdfs path.
>
>
> --files hdfs://192.168.0.1:8020/user/jhoole/flights.csv
> local:///opt/spark/examples/src/main/r/data-manipulation.R hdfs://
> 192.168.0.1:8020/user/jhoole/flights.csv
>
> The init-container seems to load my file.
>
> 18/02/26 18:29:09 INFO spark.SparkContext: Added file hdfs://
> 192.168.0.1:8020/user/jhoole/flights.csv at hdfs://192.168.0.1:8020/user/
> jhoole/flights.csv with timestamp 1519669749519
>
> 18/02/26 18:29:09 INFO util.Utils: Fetching hdfs://192.168.0.1:8020/user/
> jhoole/flights.csv to /var/spark/tmp/spark-d943dae6-
> 9b95-4df0-87a3-9f7978d6d4d2/userFiles-4112b7aa-b9e7-47a9-
> bcbc-7f7a01f93e38/fetchFileTemp7872615076522023165.tmp
>
> However, I get an error that my file does not exist.
>
> Error in file(file, "rt") : cannot open the connection
>
> Calls: read.csv -> read.table -> file
>
> In addition: Warning message:
>
> In file(file, "rt") :
>
>   cannot open file 'hdfs://192.168.0.1:8020/user/jhoole/flights.csv': No
> such file or directory
>
> Execution halted
>
> Exception in thread "main" org.apache.spark.SparkUserAppException: User
> application exited with 1
>
> at org.apache.spark.deploy.RRunner$.main(RRunner.scala:104)
>
> at org.apache.spark.deploy.RRunner.main(RRunner.scala)
>
> If I try supplying just flights.csv, I get a different error
>
> --files hdfs://192.168.0.1:8020/user/jhoole/flights.csv
> local:///opt/spark/examples/src/main/r/data-manipulation.R flights.csv
>
> Error: Error in loadDF : analysis error - Path does not exist: hdfs://
> 192.168.0.1:8020/user/root/flights.csv;
>
> Execution halted
>
> Exception in thread "main" org.apache.spark.SparkUserAppException: User
> application exited with 1
>
> at org.apache.spark.deploy.RRunner$.main(RRunner.scala:104)
>
> at org.apache.spark.deploy.RRunner.main(RRunner.scala)
>
> If the path /user/root/flights.csv does exist and I only supply
> "flights.csv" as the file path, it runs to completion successfully.
> However, if I provide the file path as "hdfs://192.168.0.1:8020/user/
> root/flights.csv," I get the same "No such file or directory" error as I
> do initially.
>
> Since I obviously can't put all my hdfs files under /user/root, how do I
> get it to use the file that the init-container is fetching?
>
> Thanks,
> Jenna
>