[ 
https://issues.apache.org/jira/browse/SPARK-41510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ohad Raviv updated SPARK-41510:
-------------------------------
    Description: 
When working interactively with Spark through notebooks in various envs - 
Databricks/YARN I often encounter a very frustrating process of trying to add 
new python modules and even change their code without starting a new spark 
session/cluster.

In the driver side it is easy to add things like `sys.path.append()` but if for 
example, if a UDF code is importing a function from a local module, then the 
pickle boundaries will assume that the module exists in the workers, and fail 
on "python module does not exist..".

To update the code "online" I can add NFS volume to the workers' PYTHONPATH.

However, setting the PYTHONPATH in the workers is not easy as it gets 
overridden by someone (databricks/spark) along the way. a few ugly workarounds 
are suggested like running a "dummy" UDF on the workers to add the folder to 
the sys.path.

I think all of that could easily be solved if we just add a dedicated 
`spark.conf` the will get merged into the worker's PYTHONPATH, just here:

[https://github.com/apache/spark/blob/0e2d604fd33c8236cfa8ae243eeaec42d3176a06/core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala#L94]

 

please tell me what you think, and I will make the PR.

thanks.

 

 

  was:
When working interactively with Spark through notebooks in various envs - 
Databricks/YARN I often encounter a very frustrating process of trying to add 
new python modules and even change their code without starting a new spark 
session/cluster.

in the driver side it is easy to add things like `sys.path.append()` but if for 
example UDF code is importing function from a local module, then the pickle 
boundaries will assume that the module exists in the workers. and then I fail 
on "python module does not exist..".

adding NFS volumes to the workers PYTHONPATH could solve it, but it requires 
restarting the session/cluster and worse doesn't work in all envs as the 
PYTHONPATH gets overridden by someone (databricks/spark) along the way. a few 
ugly work around are suggested like running a "dummy" udf on workers to add the 
folder to the sys.path.

I think all of that could easily be solved if we add a spark.conf to add to the 
worker PYTHONPATH. here:

[https://github.com/apache/spark/blob/0e2d604fd33c8236cfa8ae243eeaec42d3176a06/core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala#L94]

 

please tell me what you think, and I will make the PR.

thanks.

 

 


> Support easy way for user defined PYTHONPATH in workers
> -------------------------------------------------------
>
>                 Key: SPARK-41510
>                 URL: https://issues.apache.org/jira/browse/SPARK-41510
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 3.3.1
>            Reporter: Ohad Raviv
>            Priority: Minor
>
> When working interactively with Spark through notebooks in various envs - 
> Databricks/YARN I often encounter a very frustrating process of trying to add 
> new python modules and even change their code without starting a new spark 
> session/cluster.
> In the driver side it is easy to add things like `sys.path.append()` but if 
> for example, if a UDF code is importing a function from a local module, then 
> the pickle boundaries will assume that the module exists in the workers, and 
> fail on "python module does not exist..".
> To update the code "online" I can add NFS volume to the workers' PYTHONPATH.
> However, setting the PYTHONPATH in the workers is not easy as it gets 
> overridden by someone (databricks/spark) along the way. a few ugly 
> workarounds are suggested like running a "dummy" UDF on the workers to add 
> the folder to the sys.path.
> I think all of that could easily be solved if we just add a dedicated 
> `spark.conf` the will get merged into the worker's PYTHONPATH, just here:
> [https://github.com/apache/spark/blob/0e2d604fd33c8236cfa8ae243eeaec42d3176a06/core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala#L94]
>  
> please tell me what you think, and I will make the PR.
> thanks.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to