[
https://issues.apache.org/jira/browse/SPARK-47228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexey Golovizin updated SPARK-47228:
-------------------------------------
Description:
Hi folks! And thank you for you brilliant project.
I'm using PySpark for a couple of years and last week I have encountered with a
problem that I don't know how to solve.
{{I'm using Spark session in client mode (with .master("yarn")) and I have a
file (JKS certificate) that I have to copy to all executor node of my Spark
cluster (it's needed for another library using Spark).}}
{{I have read some docs and have find out that I can use
SparkSession.builder.config("spark.files", "/path/to/my/precious.file") or
SparkContext.addFile("/path/to/my/precious.file") – and after that call
SparkFiles.get("precious.file") – and it will return a path that to copied file
that could be used on executors.}}
Alas, SparkFiles.get("precious.file") gives me a path like
{{'/tmp/spark-d64b1d73-3745-4362-adb1-586a6228e7a5/userFiles-2f5a6735-f297-489b-b6e1-6ff9b0c3a214/precious.file'}}
such directory with this file appears on my driver node – but does not appear
on executor nodes.
How can It be fixed? Isnt't here a bug in Spark?
P.S. I've tried putting the file on HDFS but it gives the same result: it
appears on driver node and not on executors.
P.P.S. Here's some code with which I was trying to copy files to executors:
{{from pyspark.sql import SparkSession}}
{{from pyspark import SparkFiles}}
{{spark = SparkSession.builder.enableHiveSupport()}}
{{.config("spark.files", "hdfs:///tmp/dummy.txt")}}
{{.master("yarn").getOrCreate()}}
{{file_system = FileSystem(spark)}}
{{# I have uploaded a file "dummy.txt" to HDFS (to hdfs:///tmp/dummy.txt)}}
{{# I want this file from HDFS to be copied to all executors.}}
{{# spark.sparkContext.addFile("hdfs:///tmp/dummy.txt")}}
{{# Let's look where it was uploaded...}}
{{path_on_exectors = SparkFiles.get("dummy.txt")}}
{{print(path_on_exectors)}}
{{# >
'/tmp/spark-6e61ed03-6866-4863-adb6-b7345836caf3/userFiles-28d28f93-e4df-44c3-a1e5-7411d02d4cf5/dummy.txt'}}
{{# This path was created on my driver node but DOES NOT EXIST on my
executors.}}
{{# What can I do to fix it?}}
{{# (Quitting programming, moving to Somalia and becoming a pirate is not an
option yet).}}
was:
Hi folks! And thank you for you brilliant project.
I'm using PySpark for a couple of years and last week I have encountered with a
problem that I don't know how to solve.
I'm using Spark session in client mode (with .master("yarn")) and I have a file
(JKS certificate) that I have to copy to all executor node of my Spark cluster
(it's needed for another library using Spark).
I have read some docs and have find out that I can use
SparkSession.builder.config("spark.files", "/path/to/my/precious.file") or
SparkContext.addFile("/path/to/my/precious.file") – and after that call
SparkFiles.get("precious.file") – and it will return a path that to copied file
that could be used on executors.
Alas, SparkFiles.get("precious.file") gives me a path like
'/tmp/spark-d64b1d73-3745-4362-adb1-586a6228e7a5/userFiles-2f5a6735-f297-489b-b6e1-6ff9b0c3a214/precious.file'
such directory with this file appears on my driver node – but does not appear
on executor nodes.
How can It be fixed? Isnt't here a bug in Spark?
P.S. I've tried putting the file on HDFS but it gives the same result: it
appears on driver node and not on executors.
P.P.S. Here's some code with which I was trying to copy files to executors:
{{from pyspark.sql import SparkSession}}
{{from pyspark import SparkFiles}}
{{spark = SparkSession.builder.enableHiveSupport()\}}
{{.config("spark.files", "hdfs:///tmp/dummy.txt")\}}
{{.master("yarn").getOrCreate()}}
{{file_system = FileSystem(spark)}}
{{# I have uploaded a file "dummy.txt" to HDFS (to hdfs:///tmp/dummy.txt)}}
{{# I want this file from HDFS to be copied to all executors.}}
{{# spark.sparkContext.addFile("hdfs:///tmp/dummy.txt")}}
{{# Let's look where it was uploaded...}}
{{path_on_exectors = SparkFiles.get("dummy.txt")}}
{{print(path_on_exectors)}}
{{# >
'/tmp/spark-6e61ed03-6866-4863-adb6-b7345836caf3/userFiles-28d28f93-e4df-44c3-a1e5-7411d02d4cf5/dummy.txt'}}
{{# This path was created on my driver node but DOES NOT EXIST on my
executors.}}
{{# What can I do to fix it?}}
{{# (Quitting programming, moving to Somalia and becoming a pirate is not an
option yet).}}
> spark.files are not copied to Spark Executors in client session mode
> --------------------------------------------------------------------
>
> Key: SPARK-47228
> URL: https://issues.apache.org/jira/browse/SPARK-47228
> Project: Spark
> Issue Type: Bug
> Components: PySpark, Spark Core
> Affects Versions: 2.4.5, 3.3.4
> Environment: Ubuntu 22.04, Python 3.7
> Quickstart VM with CDH 6.3.2
> Reporter: Alexey Golovizin
> Priority: Major
>
> Hi folks! And thank you for you brilliant project.
> I'm using PySpark for a couple of years and last week I have encountered with
> a problem that I don't know how to solve.
>
> {{I'm using Spark session in client mode (with .master("yarn")) and I have a
> file (JKS certificate) that I have to copy to all executor node of my Spark
> cluster (it's needed for another library using Spark).}}
> {{I have read some docs and have find out that I can use
> SparkSession.builder.config("spark.files", "/path/to/my/precious.file") or
> SparkContext.addFile("/path/to/my/precious.file") – and after that call
> SparkFiles.get("precious.file") – and it will return a path that to copied
> file that could be used on executors.}}
> Alas, SparkFiles.get("precious.file") gives me a path like
> {{'/tmp/spark-d64b1d73-3745-4362-adb1-586a6228e7a5/userFiles-2f5a6735-f297-489b-b6e1-6ff9b0c3a214/precious.file'}}
> such directory with this file appears on my driver node – but does not appear
> on executor nodes.
> How can It be fixed? Isnt't here a bug in Spark?
> P.S. I've tried putting the file on HDFS but it gives the same result: it
> appears on driver node and not on executors.
> P.P.S. Here's some code with which I was trying to copy files to executors:
>
> {{from pyspark.sql import SparkSession}}
> {{from pyspark import SparkFiles}}
> {{spark = SparkSession.builder.enableHiveSupport()}}
> {{.config("spark.files", "hdfs:///tmp/dummy.txt")}}
> {{.master("yarn").getOrCreate()}}
> {{file_system = FileSystem(spark)}}
> {{# I have uploaded a file "dummy.txt" to HDFS (to hdfs:///tmp/dummy.txt)}}
> {{# I want this file from HDFS to be copied to all executors.}}
> {{# spark.sparkContext.addFile("hdfs:///tmp/dummy.txt")}}
> {{# Let's look where it was uploaded...}}
> {{path_on_exectors = SparkFiles.get("dummy.txt")}}
> {{print(path_on_exectors)}}
> {{# >
> '/tmp/spark-6e61ed03-6866-4863-adb6-b7345836caf3/userFiles-28d28f93-e4df-44c3-a1e5-7411d02d4cf5/dummy.txt'}}
> {{# This path was created on my driver node but DOES NOT EXIST on my
> executors.}}
> {{# What can I do to fix it?}}
> {{# (Quitting programming, moving to Somalia and becoming a pirate is not an
> option yet).}}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]