Alexey Golovizin created SPARK-47228:
----------------------------------------

             Summary: spark.files are not copied to Spark Executors in client 
session mode
                 Key: SPARK-47228
                 URL: https://issues.apache.org/jira/browse/SPARK-47228
             Project: Spark
          Issue Type: Bug
          Components: PySpark, Spark Core
    Affects Versions: 3.3.4, 2.4.5
         Environment: Ubuntu 22.04, Python 3.7

Quickstart VM with CDH 6.3.2
            Reporter: Alexey Golovizin


Hi folks! And thank you for you brilliant project.

I'm using PySpark for a couple of years and last week I have encountered with a 
problem that I don't know how to solve.

 

I'm using Spark session in client mode (with .master("yarn")) and I have a file 
(JKS certificate) that I have to copy to all executor node of my Spark cluster 
(it's needed for another library using Spark).

I have read some docs and have find out that I can use 
SparkSession.builder.config("spark.files", "/path/to/my/precious.file") or 
SparkContext.addFile("/path/to/my/precious.file") – and after that call 
SparkFiles.get("precious.file") – and it will return a path that to copied file 
that could be used on executors.

Alas, SparkFiles.get("precious.file") gives me a path like

'/tmp/spark-d64b1d73-3745-4362-adb1-586a6228e7a5/userFiles-2f5a6735-f297-489b-b6e1-6ff9b0c3a214/precious.file'

such directory with this file appears on my driver node – but does not appear 
on executor nodes.

How can It be fixed? Isnt't here a bug in Spark?

P.S. I've tried putting the file on HDFS but it gives the same result: it 
appears on driver node and not on executors.

P.P.S. Here's some code with which I was trying to copy files to executors:

 

{{from pyspark.sql import SparkSession}}
{{from pyspark import SparkFiles}}

{{spark = SparkSession.builder.enableHiveSupport()\}}
{{.config("spark.files", "hdfs:///tmp/dummy.txt")\}}
{{.master("yarn").getOrCreate()}}
{{file_system = FileSystem(spark)}}

{{# I have uploaded a file "dummy.txt" to HDFS (to hdfs:///tmp/dummy.txt)}}
{{# I want this file from HDFS to be copied to all executors.}}
{{# spark.sparkContext.addFile("hdfs:///tmp/dummy.txt")}}

{{# Let's look where it was uploaded...}}
{{path_on_exectors = SparkFiles.get("dummy.txt")}}
{{print(path_on_exectors)}}

{{# > 
'/tmp/spark-6e61ed03-6866-4863-adb6-b7345836caf3/userFiles-28d28f93-e4df-44c3-a1e5-7411d02d4cf5/dummy.txt'}}
{{# This path was created on my driver node but DOES NOT EXIST on my 
executors.}}
{{# What can I do to fix it?}}

{{# (Quitting programming, moving to Somalia and becoming a pirate is not an 
option yet).}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to