[ https://issues.apache.org/jira/browse/SPARK-43289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Weichen Xu reassigned SPARK-43289: ---------------------------------- Assignee: Weichen Xu > PySpark UDF supports python package dependencies > ------------------------------------------------ > > Key: SPARK-43289 > URL: https://issues.apache.org/jira/browse/SPARK-43289 > Project: Spark > Issue Type: New Feature > Components: Connect, ML, PySpark > Affects Versions: 3.5.0 > Reporter: Weichen Xu > Assignee: Weichen Xu > Priority: Major > > h3. Requirements > > Make the pyspark UDF support annotating python dependencies and when > executing UDF, the UDF worker creates a new python environment with provided > python dependencies. > h3. Motivation > > We have two major cases: > > * For spark connect case, the client python environment is very likely to be > different with pyspark server side python environment, this causes user's UDF > function execution failure in pyspark server side. > * Some machine learning third-party library (e.g. MLflow) requires pyspark > UDF supporting dependencies, because in ML cases, we need to run model > inference by pyspark UDF in the exactly the same python environment that > trains the model. Currently MLflow supports it by creating a child python > process in pyspark UDF worker, and redirecting all UDF input data to the > child python process to run model inference, this way it causes significant > overhead, if pyspark UDF support builtin python dependency management then we > don't need such poorly performing approach. > > h3. Proposed API > ``` > @pandas_udf("string", pip_requirements=...) > ``` > `pip_requirements` argument means either an iterable of pip requirement > strings (e.g. ``["scikit-learn", "-r /path/to/req2.txt", "-c > /path/to/constraints.txt"]``) or the string path to a pip requirements file > path on the local filesystem (e.g. ``"/path/to/requirements.txt"``) > represents the pip requirements for the python UDF. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org