[ 
https://issues.apache.org/jira/browse/SPARK-41510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17646734#comment-17646734
 ] 

Ohad Raviv commented on SPARK-41510:
------------------------------------

the conda solution is more for a "static" packages.

the scenario is this:

we're developing a python library (more than one .py file) and want to do it 
interactively in the notebooks. so we have all the modules in some folder and 
we add this folder to the driver's sys.path.

Then, if we for example use a function from the module inside a UDF, we get: 
"ModuleNotFoundError: No module named 'some_module' ".

The reason is that some_module is not in the PYTHONPATH/sys.path of the 
workers. the code itself is accessible to the workers for example in a shared 
NFS folder.

so all we now need is to add the path.

we can do it inside the UDF something like:

```

if "/shared_nfs/my_folder" not in sys.path: sys.path.insert(0, 
"/shared_nfs/my_folder")

```

but that is both very ugly and only a partial solution as it works only in UDF 
case.

the suggestion is to have some kind of mechanism to easily add a folder to the 
workers' sys.path.

the option of wrapping the code in zip/egg and add it makes a very long 
development cycle and requires restarting the spark session and the Notebook to 
lose its state.

with the suggestion above we could actually edit the python package 
interactively and see the changes almost immediately.

hope it is clearer now.

 

 

> Support easy way for user defined PYTHONPATH in workers
> -------------------------------------------------------
>
>                 Key: SPARK-41510
>                 URL: https://issues.apache.org/jira/browse/SPARK-41510
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 3.3.1
>            Reporter: Ohad Raviv
>            Priority: Minor
>
> When working interactively with Spark through notebooks in various envs - 
> Databricks/YARN I often encounter a very frustrating process of trying to add 
> new python modules and even change their code without starting a new spark 
> session/cluster.
> In the driver side it is easy to add things like `sys.path.append()` but if 
> for example, if a UDF code is importing a function from a local module, then 
> the pickle boundaries will assume that the module exists in the workers, and 
> fail on "python module does not exist..".
> To update the code "online" I can add NFS volume to the workers' PYTHONPATH.
> However, setting the PYTHONPATH in the workers is not easy as it gets 
> overridden by someone (databricks/spark) along the way. a few ugly 
> workarounds are suggested like running a "dummy" UDF on the workers to add 
> the folder to the sys.path.
> I think all of that could easily be solved if we just add a dedicated 
> `spark.conf` the will get merged into the worker's PYTHONPATH, just here:
> [https://github.com/apache/spark/blob/0e2d604fd33c8236cfa8ae243eeaec42d3176a06/core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala#L94]
>  
> please tell me what you think, and I will make the PR.
> thanks.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to