[jira] [Updated] (SPARK-47228) spark.files are not copied to Spark Executors in client session mode

Alexey Golovizin (Jira) Thu, 29 Feb 2024 03:01:34 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-47228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Alexey Golovizin updated SPARK-47228:
-------------------------------------
    Description: 
Hi folks! And thank you for you brilliant project.

I'm using PySpark for a couple of years and last week I have encountered with a 
problem that I don't know how to solve.

 

{{I'm using Spark session in client mode (with .master("yarn")) and I have a 
file (JKS certificate) that I have to copy to all executor node of my Spark 
cluster (it's needed for another library using Spark).}}

{{I have read some docs and have find out that I can use 
SparkSession.builder.config("spark.files", "/path/to/my/precious.file") or 
SparkContext.addFile("/path/to/my/precious.file") – and after that call 
SparkFiles.get("precious.file") – and it will return a path that to copied file 
that could be used on executors.}}

Alas, SparkFiles.get("precious.file") gives me a path like

{{'/tmp/spark-d64b1d73-3745-4362-adb1-586a6228e7a5/userFiles-2f5a6735-f297-489b-b6e1-6ff9b0c3a214/precious.file'}}

such directory with this file appears on my driver node – but does not appear 
on executor nodes.

How can It be fixed? Isnt't here a bug in Spark?

P.S. I've tried putting the file on HDFS but it gives the same result: it 
appears on driver node and not on executors.

P.P.S. Here's some code with which I was trying to copy files to executors:

 

{{from pyspark.sql import SparkSession}}
{{from pyspark import SparkFiles}}

{{spark = SparkSession.builder.enableHiveSupport()}}
{{.config("spark.files", "hdfs:///tmp/dummy.txt")}}
{{.master("yarn").getOrCreate()}}
{{file_system = FileSystem(spark)}}

{{# I have uploaded a file "dummy.txt" to HDFS (to hdfs:///tmp/dummy.txt)}}
{{# I want this file from HDFS to be copied to all executors.}}
{{# spark.sparkContext.addFile("hdfs:///tmp/dummy.txt")}}

{{# Let's look where it was uploaded...}}
{{path_on_exectors = SparkFiles.get("dummy.txt")}}
{{print(path_on_exectors)}}

{{# > 
'/tmp/spark-6e61ed03-6866-4863-adb6-b7345836caf3/userFiles-28d28f93-e4df-44c3-a1e5-7411d02d4cf5/dummy.txt'}}
{{# This path was created on my driver node but DOES NOT EXIST on my 
executors.}}
{{# What can I do to fix it?}}

{{# (Quitting programming, moving to Somalia and becoming a pirate is not an 
option yet).}}

  was:
Hi folks! And thank you for you brilliant project.

I'm using PySpark for a couple of years and last week I have encountered with a 
problem that I don't know how to solve.

 

I'm using Spark session in client mode (with .master("yarn")) and I have a file 
(JKS certificate) that I have to copy to all executor node of my Spark cluster 
(it's needed for another library using Spark).

I have read some docs and have find out that I can use 
SparkSession.builder.config("spark.files", "/path/to/my/precious.file") or 
SparkContext.addFile("/path/to/my/precious.file") – and after that call 
SparkFiles.get("precious.file") – and it will return a path that to copied file 
that could be used on executors.

Alas, SparkFiles.get("precious.file") gives me a path like

'/tmp/spark-d64b1d73-3745-4362-adb1-586a6228e7a5/userFiles-2f5a6735-f297-489b-b6e1-6ff9b0c3a214/precious.file'

such directory with this file appears on my driver node – but does not appear 
on executor nodes.

How can It be fixed? Isnt't here a bug in Spark?

P.S. I've tried putting the file on HDFS but it gives the same result: it 
appears on driver node and not on executors.

P.P.S. Here's some code with which I was trying to copy files to executors:

 

{{from pyspark.sql import SparkSession}}
{{from pyspark import SparkFiles}}

{{spark = SparkSession.builder.enableHiveSupport()\}}
{{.config("spark.files", "hdfs:///tmp/dummy.txt")\}}
{{.master("yarn").getOrCreate()}}
{{file_system = FileSystem(spark)}}

{{# I have uploaded a file "dummy.txt" to HDFS (to hdfs:///tmp/dummy.txt)}}
{{# I want this file from HDFS to be copied to all executors.}}
{{# spark.sparkContext.addFile("hdfs:///tmp/dummy.txt")}}

{{# Let's look where it was uploaded...}}
{{path_on_exectors = SparkFiles.get("dummy.txt")}}
{{print(path_on_exectors)}}

{{# > 
'/tmp/spark-6e61ed03-6866-4863-adb6-b7345836caf3/userFiles-28d28f93-e4df-44c3-a1e5-7411d02d4cf5/dummy.txt'}}
{{# This path was created on my driver node but DOES NOT EXIST on my 
executors.}}
{{# What can I do to fix it?}}

{{# (Quitting programming, moving to Somalia and becoming a pirate is not an 
option yet).}}


> spark.files are not copied to Spark Executors in client session mode
> --------------------------------------------------------------------
>
>                 Key: SPARK-47228
>                 URL: https://issues.apache.org/jira/browse/SPARK-47228
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Spark Core
>    Affects Versions: 2.4.5, 3.3.4
>         Environment: Ubuntu 22.04, Python 3.7
> Quickstart VM with CDH 6.3.2
>            Reporter: Alexey Golovizin
>            Priority: Major
>
> Hi folks! And thank you for you brilliant project.
> I'm using PySpark for a couple of years and last week I have encountered with 
> a problem that I don't know how to solve.
>  
> {{I'm using Spark session in client mode (with .master("yarn")) and I have a 
> file (JKS certificate) that I have to copy to all executor node of my Spark 
> cluster (it's needed for another library using Spark).}}
> {{I have read some docs and have find out that I can use 
> SparkSession.builder.config("spark.files", "/path/to/my/precious.file") or 
> SparkContext.addFile("/path/to/my/precious.file") – and after that call 
> SparkFiles.get("precious.file") – and it will return a path that to copied 
> file that could be used on executors.}}
> Alas, SparkFiles.get("precious.file") gives me a path like
> {{'/tmp/spark-d64b1d73-3745-4362-adb1-586a6228e7a5/userFiles-2f5a6735-f297-489b-b6e1-6ff9b0c3a214/precious.file'}}
> such directory with this file appears on my driver node – but does not appear 
> on executor nodes.
> How can It be fixed? Isnt't here a bug in Spark?
> P.S. I've tried putting the file on HDFS but it gives the same result: it 
> appears on driver node and not on executors.
> P.P.S. Here's some code with which I was trying to copy files to executors:
>  
> {{from pyspark.sql import SparkSession}}
> {{from pyspark import SparkFiles}}
> {{spark = SparkSession.builder.enableHiveSupport()}}
> {{.config("spark.files", "hdfs:///tmp/dummy.txt")}}
> {{.master("yarn").getOrCreate()}}
> {{file_system = FileSystem(spark)}}
> {{# I have uploaded a file "dummy.txt" to HDFS (to hdfs:///tmp/dummy.txt)}}
> {{# I want this file from HDFS to be copied to all executors.}}
> {{# spark.sparkContext.addFile("hdfs:///tmp/dummy.txt")}}
> {{# Let's look where it was uploaded...}}
> {{path_on_exectors = SparkFiles.get("dummy.txt")}}
> {{print(path_on_exectors)}}
> {{# > 
> '/tmp/spark-6e61ed03-6866-4863-adb6-b7345836caf3/userFiles-28d28f93-e4df-44c3-a1e5-7411d02d4cf5/dummy.txt'}}
> {{# This path was created on my driver node but DOES NOT EXIST on my 
> executors.}}
> {{# What can I do to fix it?}}
> {{# (Quitting programming, moving to Somalia and becoming a pirate is not an 
> option yet).}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-47228) spark.files are not copied to Spark Executors in client session mode

Reply via email to