[ https://issues.apache.org/jira/browse/SPARK-38648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17582071#comment-17582071 ]
Xiangrui Meng edited comment on SPARK-38648 at 8/19/22 10:55 PM: ----------------------------------------------------------------- I had an offline discussion with [~leewyang]. Summary: We might not need to introduce a new package in Spark with dependencies on DL frameworks. Instead, we can provide abstractions in pyspark.ml to implement the common data operations needed by DL inference, e.g., batching, tensor conversion, pipelining, etc. For example, we can define the following API (just to illustrate the idea, not proposing the final API): {code:scala} def dl_model_udf( predict_fn: Callable[pd.DataFrame, pd.DataFrame], # need to discuss the data format batch_size: int, input_tensor_shapes: Map[str, List[int]], output_data_type, preprocess_fn, ... ) -> PandasUDF {code} Users only need to supply predict_fn, which could return a (wrapped) TensorFlow model, a PyTorch model, or an MLflow model. Users are responsible for package dependency management and model loading logics. We doesn't cover everything proposed in the original SPIP but we do save the boilerplate code for users on creating batches over Iterator[DataFrame], converting 1d arrays to tensors, and async preprocessing (CPU) and prediction (GPU). If we go with this direction, I don't feel the change needs an SPIP because it doesn't introduce a new Spark package nor new dependencies. It is a just a wrapper over pandas_udf for DL inference. was (Author: mengxr): I had an offline discussion with [~leewyang]. Summary: We might not need to introduce a new package in Spark with dependencies on DL frameworks. Instead, we can provide abstractions in pyspark.ml to implement the common data operations needed by DL inference, e.g., batching, tensor conversion, pipelining, etc. For example, we can define the following API (just to illustrate the idea, not proposing the final API): {code:scala} def dl_model_udf( predict_fn: Callable[pd.DataFrame, pd.DataFrame], # need to discuss the data format batch_size: int, input_tensor_shapes: Map[str, List[int]], output_data_type, preprocess_fn, ... ) -> PandasUDF {code} Users only need to supply predict_fn, which could return a (wrapped) TensorFlow model, a PyTorch model, or an MLflow model. Users are responsible for package dependency management and model loading logics. We doesn't cover everything proposed in the original SPIP but we do save the boilerplate code for users on creating batches over Iterator[DataFrame], converting 1d arrays to tensors, and async preprocessing (CPU) and prediction (GPU). If we go with this direction, I don't free the change needs an SPIP because it doesn't introduce a new Spark package nor new dependencies. > SPIP: Simplified API for DL Inferencing > --------------------------------------- > > Key: SPARK-38648 > URL: https://issues.apache.org/jira/browse/SPARK-38648 > Project: Spark > Issue Type: Improvement > Components: ML > Affects Versions: 3.0.0 > Reporter: Lee Yang > Priority: Minor > > h1. Background and Motivation > The deployment of deep learning (DL) models to Spark clusters can be a point > of friction today. DL practitioners often aren't well-versed with Spark, and > Spark experts often aren't well-versed with the fast-changing DL frameworks. > Currently, the deployment of trained DL models is done in a fairly ad-hoc > manner, with each model integration usually requiring significant effort. > To simplify this process, we propose adding an integration layer for each > major DL framework that can introspect their respective saved models to > more-easily integrate these models into Spark applications. You can find a > detailed proposal here: > [https://docs.google.com/document/d/1n7QPHVZfmQknvebZEXxzndHPV2T71aBsDnP4COQa_v0] > h1. Goals > - Simplify the deployment of pre-trained single-node DL models to Spark > inference applications. > - Follow pandas_udf for simple inference use-cases. > - Follow Spark ML Pipelines APIs for transfer-learning use-cases. > - Enable integrations with popular third-party DL frameworks like > TensorFlow, PyTorch, and Huggingface. > - Focus on PySpark, since most of the DL frameworks use Python. > - Take advantage of built-in Spark features like GPU scheduling and Arrow > integration. > - Enable inference on both CPU and GPU. > h1. Non-goals > - DL model training. > - Inference w/ distributed models, i.e. "model parallel" inference. > h1. Target Personas > - Data scientists who need to deploy DL models on Spark. > - Developers who need to deploy DL models on Spark. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org