[ 
https://issues.apache.org/jira/browse/SPARK-43289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu reassigned SPARK-43289:
----------------------------------

    Assignee: Weichen Xu

> PySpark UDF supports python package dependencies
> ------------------------------------------------
>
>                 Key: SPARK-43289
>                 URL: https://issues.apache.org/jira/browse/SPARK-43289
>             Project: Spark
>          Issue Type: New Feature
>          Components: Connect, ML, PySpark
>    Affects Versions: 3.5.0
>            Reporter: Weichen Xu
>            Assignee: Weichen Xu
>            Priority: Major
>
> h3. Requirements
>  
> Make the pyspark UDF support annotating python dependencies and when 
> executing UDF, the UDF worker creates a new python environment with provided 
> python dependencies.
> h3. Motivation
>  
> We have two major cases:
>  
>  * For spark connect case, the client python environment is very likely to be 
> different with pyspark server side python environment, this causes user's UDF 
> function execution failure in pyspark server side.
>  * Some machine learning third-party library (e.g. MLflow) requires pyspark 
> UDF supporting  dependencies, because in ML cases, we need to run model 
> inference by pyspark UDF in the exactly the same python environment that 
> trains the model. Currently MLflow supports it by creating a child python 
> process in pyspark UDF worker, and redirecting all UDF input data to the 
> child python process to run model inference, this way it causes significant 
> overhead, if pyspark UDF support builtin python dependency management then we 
> don't need such poorly performing approach.
>  
> h3. Proposed API
> ```
> @pandas_udf("string", pip_requirements=...)
> ```
> `pip_requirements` argument means either an iterable of pip requirement 
> strings (e.g. ``["scikit-learn", "-r /path/to/req2.txt", "-c 
> /path/to/constraints.txt"]``) or the string path to a pip requirements file 
> path on the local filesystem (e.g. ``"/path/to/requirements.txt"``) 
> represents the pip requirements for the python UDF.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to