Alexey Golovizin created SPARK-47228:
----------------------------------------
Summary: spark.files are not copied to Spark Executors in client
session mode
Key: SPARK-47228
URL: https://issues.apache.org/jira/browse/SPARK-47228
Project: Spark
Issue Type: Bug
Components: PySpark, Spark Core
Affects Versions: 3.3.4, 2.4.5
Environment: Ubuntu 22.04, Python 3.7
Quickstart VM with CDH 6.3.2
Reporter: Alexey Golovizin
Hi folks! And thank you for you brilliant project.
I'm using PySpark for a couple of years and last week I have encountered with a
problem that I don't know how to solve.
I'm using Spark session in client mode (with .master("yarn")) and I have a file
(JKS certificate) that I have to copy to all executor node of my Spark cluster
(it's needed for another library using Spark).
I have read some docs and have find out that I can use
SparkSession.builder.config("spark.files", "/path/to/my/precious.file") or
SparkContext.addFile("/path/to/my/precious.file") – and after that call
SparkFiles.get("precious.file") – and it will return a path that to copied file
that could be used on executors.
Alas, SparkFiles.get("precious.file") gives me a path like
'/tmp/spark-d64b1d73-3745-4362-adb1-586a6228e7a5/userFiles-2f5a6735-f297-489b-b6e1-6ff9b0c3a214/precious.file'
such directory with this file appears on my driver node – but does not appear
on executor nodes.
How can It be fixed? Isnt't here a bug in Spark?
P.S. I've tried putting the file on HDFS but it gives the same result: it
appears on driver node and not on executors.
P.P.S. Here's some code with which I was trying to copy files to executors:
{{from pyspark.sql import SparkSession}}
{{from pyspark import SparkFiles}}
{{spark = SparkSession.builder.enableHiveSupport()\}}
{{.config("spark.files", "hdfs:///tmp/dummy.txt")\}}
{{.master("yarn").getOrCreate()}}
{{file_system = FileSystem(spark)}}
{{# I have uploaded a file "dummy.txt" to HDFS (to hdfs:///tmp/dummy.txt)}}
{{# I want this file from HDFS to be copied to all executors.}}
{{# spark.sparkContext.addFile("hdfs:///tmp/dummy.txt")}}
{{# Let's look where it was uploaded...}}
{{path_on_exectors = SparkFiles.get("dummy.txt")}}
{{print(path_on_exectors)}}
{{# >
'/tmp/spark-6e61ed03-6866-4863-adb6-b7345836caf3/userFiles-28d28f93-e4df-44c3-a1e5-7411d02d4cf5/dummy.txt'}}
{{# This path was created on my driver node but DOES NOT EXIST on my
executors.}}
{{# What can I do to fix it?}}
{{# (Quitting programming, moving to Somalia and becoming a pirate is not an
option yet).}}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]