[ https://issues.apache.org/jira/browse/SPARK-41510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17646620#comment-17646620 ]
Hyukjin Kwon commented on SPARK-41510: -------------------------------------- What about using Conda ([https://www.databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html)] or adding python files via --py-files? Would be great to elabourate the usage. > Support easy way for user defined PYTHONPATH in workers > ------------------------------------------------------- > > Key: SPARK-41510 > URL: https://issues.apache.org/jira/browse/SPARK-41510 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 3.3.1 > Reporter: Ohad Raviv > Priority: Minor > > When working interactively with Spark through notebooks in various envs - > Databricks/YARN I often encounter a very frustrating process of trying to add > new python modules and even change their code without starting a new spark > session/cluster. > In the driver side it is easy to add things like `sys.path.append()` but if > for example, if a UDF code is importing a function from a local module, then > the pickle boundaries will assume that the module exists in the workers, and > fail on "python module does not exist..". > To update the code "online" I can add NFS volume to the workers' PYTHONPATH. > However, setting the PYTHONPATH in the workers is not easy as it gets > overridden by someone (databricks/spark) along the way. a few ugly > workarounds are suggested like running a "dummy" UDF on the workers to add > the folder to the sys.path. > I think all of that could easily be solved if we just add a dedicated > `spark.conf` the will get merged into the worker's PYTHONPATH, just here: > [https://github.com/apache/spark/blob/0e2d604fd33c8236cfa8ae243eeaec42d3176a06/core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala#L94] > > please tell me what you think, and I will make the PR. > thanks. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org