[jira] [Comment Edited] (SPARK-38648) SPIP: Simplified API for DL Inferencing
[ https://issues.apache.org/jira/browse/SPARK-38648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17582071#comment-17582071 ] Xiangrui Meng edited comment on SPARK-38648 at 8/19/22 10:55 PM: - I had an offline discussion with [~leewyang]. Summary: We might not need to introduce a new package in Spark with dependencies on DL frameworks. Instead, we can provide abstractions in pyspark.ml to implement the common data operations needed by DL inference, e.g., batching, tensor conversion, pipelining, etc. For example, we can define the following API (just to illustrate the idea, not proposing the final API): {code:scala} def dl_model_udf( predict_fn: Callable[pd.DataFrame, pd.DataFrame], # need to discuss the data format batch_size: int, input_tensor_shapes: Map[str, List[int]], output_data_type, preprocess_fn, ... ) -> PandasUDF {code} Users only need to supply predict_fn, which could return a (wrapped) TensorFlow model, a PyTorch model, or an MLflow model. Users are responsible for package dependency management and model loading logics. We doesn't cover everything proposed in the original SPIP but we do save the boilerplate code for users on creating batches over Iterator[DataFrame], converting 1d arrays to tensors, and async preprocessing (CPU) and prediction (GPU). If we go with this direction, I don't feel the change needs an SPIP because it doesn't introduce a new Spark package nor new dependencies. It is a just a wrapper over pandas_udf for DL inference. was (Author: mengxr): I had an offline discussion with [~leewyang]. Summary: We might not need to introduce a new package in Spark with dependencies on DL frameworks. Instead, we can provide abstractions in pyspark.ml to implement the common data operations needed by DL inference, e.g., batching, tensor conversion, pipelining, etc. For example, we can define the following API (just to illustrate the idea, not proposing the final API): {code:scala} def dl_model_udf( predict_fn: Callable[pd.DataFrame, pd.DataFrame], # need to discuss the data format batch_size: int, input_tensor_shapes: Map[str, List[int]], output_data_type, preprocess_fn, ... ) -> PandasUDF {code} Users only need to supply predict_fn, which could return a (wrapped) TensorFlow model, a PyTorch model, or an MLflow model. Users are responsible for package dependency management and model loading logics. We doesn't cover everything proposed in the original SPIP but we do save the boilerplate code for users on creating batches over Iterator[DataFrame], converting 1d arrays to tensors, and async preprocessing (CPU) and prediction (GPU). If we go with this direction, I don't free the change needs an SPIP because it doesn't introduce a new Spark package nor new dependencies. > SPIP: Simplified API for DL Inferencing > --- > > Key: SPARK-38648 > URL: https://issues.apache.org/jira/browse/SPARK-38648 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: Lee Yang >Priority: Minor > > h1. Background and Motivation > The deployment of deep learning (DL) models to Spark clusters can be a point > of friction today. DL practitioners often aren't well-versed with Spark, and > Spark experts often aren't well-versed with the fast-changing DL frameworks. > Currently, the deployment of trained DL models is done in a fairly ad-hoc > manner, with each model integration usually requiring significant effort. > To simplify this process, we propose adding an integration layer for each > major DL framework that can introspect their respective saved models to > more-easily integrate these models into Spark applications. You can find a > detailed proposal here: > [https://docs.google.com/document/d/1n7QPHVZfmQknvebZEXxzndHPV2T71aBsDnP4COQa_v0] > h1. Goals > - Simplify the deployment of pre-trained single-node DL models to Spark > inference applications. > - Follow pandas_udf for simple inference use-cases. > - Follow Spark ML Pipelines APIs for transfer-learning use-cases. > - Enable integrations with popular third-party DL frameworks like > TensorFlow, PyTorch, and Huggingface. > - Focus on PySpark, since most of the DL frameworks use Python. > - Take advantage of built-in Spark features like GPU scheduling and Arrow > integration. > - Enable inference on both CPU and GPU. > h1. Non-goals > - DL model training. > - Inference w/ distributed models, i.e. "model parallel" inference. > h1. Target Personas > - Data scientists who need to deploy DL models on Spark. > - Developers who need to deploy DL models on Spark. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubsc
[jira] [Commented] (SPARK-38648) SPIP: Simplified API for DL Inferencing
[ https://issues.apache.org/jira/browse/SPARK-38648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17582071#comment-17582071 ] Xiangrui Meng commented on SPARK-38648: --- I had an offline discussion with [~leewyang]. Summary: We might not need to introduce a new package in Spark with dependencies on DL frameworks. Instead, we can provide abstractions in pyspark.ml to implement the common data operations needed by DL inference, e.g., batching, tensor conversion, pipelining, etc. For example, we can define the following API (just to illustrate the idea, not proposing the final API): {code:scala} def dl_model_udf( predict_fn: Callable[pd.DataFrame, pd.DataFrame], # need to discuss the data format batch_size: int, input_tensor_shapes: Map[str, List[int]], output_data_type, preprocess_fn, ... ) -> PandasUDF {code} Users only need to supply predict_fn, which could return a (wrapped) TensorFlow model, a PyTorch model, or an MLflow model. Users are responsible for package dependency management and model loading logics. We doesn't cover everything proposed in the original SPIP but we do save the boilerplate code for users on creating batches over Iterator[DataFrame], converting 1d arrays to tensors, and async preprocessing (CPU) and prediction (GPU). If we go with this direction, I don't free the change needs an SPIP because it doesn't introduce a new Spark package nor new dependencies. > SPIP: Simplified API for DL Inferencing > --- > > Key: SPARK-38648 > URL: https://issues.apache.org/jira/browse/SPARK-38648 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: Lee Yang >Priority: Minor > > h1. Background and Motivation > The deployment of deep learning (DL) models to Spark clusters can be a point > of friction today. DL practitioners often aren't well-versed with Spark, and > Spark experts often aren't well-versed with the fast-changing DL frameworks. > Currently, the deployment of trained DL models is done in a fairly ad-hoc > manner, with each model integration usually requiring significant effort. > To simplify this process, we propose adding an integration layer for each > major DL framework that can introspect their respective saved models to > more-easily integrate these models into Spark applications. You can find a > detailed proposal here: > [https://docs.google.com/document/d/1n7QPHVZfmQknvebZEXxzndHPV2T71aBsDnP4COQa_v0] > h1. Goals > - Simplify the deployment of pre-trained single-node DL models to Spark > inference applications. > - Follow pandas_udf for simple inference use-cases. > - Follow Spark ML Pipelines APIs for transfer-learning use-cases. > - Enable integrations with popular third-party DL frameworks like > TensorFlow, PyTorch, and Huggingface. > - Focus on PySpark, since most of the DL frameworks use Python. > - Take advantage of built-in Spark features like GPU scheduling and Arrow > integration. > - Enable inference on both CPU and GPU. > h1. Non-goals > - DL model training. > - Inference w/ distributed models, i.e. "model parallel" inference. > h1. Target Personas > - Data scientists who need to deploy DL models on Spark. > - Developers who need to deploy DL models on Spark. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38648) SPIP: Simplified API for DL Inferencing
[ https://issues.apache.org/jira/browse/SPARK-38648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528862#comment-17528862 ] Xiangrui Meng commented on SPARK-38648: --- I think it is beneficial to both Spark and DL frameworks if Spark has state-of-the-art DL capabilities. We did some work in the past to make Spark work better with DL frameworks, e.g., iterator Scalar Pandas UDF, barrier mode, and GPU scheduling. But most of them are low level APIs for developers, not end users. Our Spark user guide contains little about DL and AI. The dependency on DL frameworks might create issues. One idea is to develop in the Spark repo and Spark namespace but publish to PyPI independently. For example, in order to use DL features, users need to explicitly install `pyspark-dl` and then use the features under `pyspark.dl` namespace. Putting development inside Spark and publishing under the spark namespace would help drive both development and adoption. > SPIP: Simplified API for DL Inferencing > --- > > Key: SPARK-38648 > URL: https://issues.apache.org/jira/browse/SPARK-38648 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: Lee Yang >Priority: Minor > > h1. Background and Motivation > The deployment of deep learning (DL) models to Spark clusters can be a point > of friction today. DL practitioners often aren't well-versed with Spark, and > Spark experts often aren't well-versed with the fast-changing DL frameworks. > Currently, the deployment of trained DL models is done in a fairly ad-hoc > manner, with each model integration usually requiring significant effort. > To simplify this process, we propose adding an integration layer for each > major DL framework that can introspect their respective saved models to > more-easily integrate these models into Spark applications. You can find a > detailed proposal here: > [https://docs.google.com/document/d/1n7QPHVZfmQknvebZEXxzndHPV2T71aBsDnP4COQa_v0] > h1. Goals > - Simplify the deployment of pre-trained single-node DL models to Spark > inference applications. > - Follow pandas_udf for simple inference use-cases. > - Follow Spark ML Pipelines APIs for transfer-learning use-cases. > - Enable integrations with popular third-party DL frameworks like > TensorFlow, PyTorch, and Huggingface. > - Focus on PySpark, since most of the DL frameworks use Python. > - Take advantage of built-in Spark features like GPU scheduling and Arrow > integration. > - Enable inference on both CPU and GPU. > h1. Non-goals > - DL model training. > - Inference w/ distributed models, i.e. "model parallel" inference. > h1. Target Personas > - Data scientists who need to deploy DL models on Spark. > - Developers who need to deploy DL models on Spark. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37004) Job cancellation causes py4j errors on Jupyter due to pinned thread mode
[ https://issues.apache.org/jira/browse/SPARK-37004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-37004: -- Description: Spark 3.2.0 turned on py4j pinned thread mode by default (SPARK-35303). However, in a jupyter notebook, after I cancel (interrupt) a long-running Spark job, the next Spark command will fail with some py4j errors. See attached notebook for repro. Cannot reproduce the issue after I turn off pinned thread mode . was: 3.2.0 turned on py4j pinned thread mode by default. However, in a jupyter notebook, after I cancel (interrupt) a long-running Spark job, the next Spark command will fail with some py4j errors. See attached notebook for repro. Cannot reproduce the issue after I turn off pinned thread mode . > Job cancellation causes py4j errors on Jupyter due to pinned thread mode > > > Key: SPARK-37004 > URL: https://issues.apache.org/jira/browse/SPARK-37004 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xiangrui Meng >Priority: Blocker > Attachments: pinned.ipynb > > > Spark 3.2.0 turned on py4j pinned thread mode by default (SPARK-35303). > However, in a jupyter notebook, after I cancel (interrupt) a long-running > Spark job, the next Spark command will fail with some py4j errors. See > attached notebook for repro. > Cannot reproduce the issue after I turn off pinned thread mode . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37004) Job cancellation causes py4j errors on Jupyter due to pinned thread mode
[ https://issues.apache.org/jira/browse/SPARK-37004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-37004: -- Issue Type: Bug (was: Improvement) > Job cancellation causes py4j errors on Jupyter due to pinned thread mode > > > Key: SPARK-37004 > URL: https://issues.apache.org/jira/browse/SPARK-37004 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xiangrui Meng >Priority: Blocker > Attachments: pinned.ipynb > > > 3.2.0 turned on py4j pinned thread mode by default. However, in a jupyter > notebook, after I cancel (interrupt) a long-running Spark job, the next Spark > command will fail with some py4j errors. See attached notebook for repro. > Cannot reproduce the issue after I turn off pinned thread mode . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37004) Job cancellation causes py4j errors on Jupyter due to pinned thread mode
[ https://issues.apache.org/jira/browse/SPARK-37004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-37004: -- Attachment: pinned.ipynb > Job cancellation causes py4j errors on Jupyter due to pinned thread mode > > > Key: SPARK-37004 > URL: https://issues.apache.org/jira/browse/SPARK-37004 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xiangrui Meng >Priority: Blocker > Attachments: pinned.ipynb > > > 3.2.0 turned on py4j pinned thread mode by default. However, in a jupyter > notebook, after I cancel (interrupt) a long-running Spark job, the next Spark > command will fail with some py4j errors. See attached notebook for repro. > Cannot reproduce the issue after I turn off pinned thread mode . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37004) Job cancellation causes py4j errors on Jupyter due to pinned thread mode
Xiangrui Meng created SPARK-37004: - Summary: Job cancellation causes py4j errors on Jupyter due to pinned thread mode Key: SPARK-37004 URL: https://issues.apache.org/jira/browse/SPARK-37004 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.2.0 Reporter: Xiangrui Meng 3.2.0 turned on py4j pinned thread mode by default. However, in a jupyter notebook, after I cancel (interrupt) a long-running Spark job, the next Spark command will fail with some py4j errors. See attached notebook for repro. Cannot reproduce the issue after I turn off pinned thread mode . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36578) Minor UnivariateFeatureSelector API doc improvement
[ https://issues.apache.org/jira/browse/SPARK-36578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-36578: - Assignee: Huaxin Gao > Minor UnivariateFeatureSelector API doc improvement > --- > > Key: SPARK-36578 > URL: https://issues.apache.org/jira/browse/SPARK-36578 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.1, 3.2.0 >Reporter: Xiangrui Meng >Assignee: Huaxin Gao >Priority: Minor > > * The ScalaDoc and Python doc of the UnivariateFeatureSelector starts with > "The user can set ...". The first sentence is the ScalaDoc should always > describe what this class is. > * `ANOVATest` is not public. We should just say ANOVA F-test. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36578) Minor UnivariateFeatureSelector API doc improvement
[ https://issues.apache.org/jira/browse/SPARK-36578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17404095#comment-17404095 ] Xiangrui Meng commented on SPARK-36578: --- [~huaxingao] I saw some possible improvement to the API doc. It would be great if you can address them when you get some time. Not blocking 3.2 release. > Minor UnivariateFeatureSelector API doc improvement > --- > > Key: SPARK-36578 > URL: https://issues.apache.org/jira/browse/SPARK-36578 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.1, 3.2.0 >Reporter: Xiangrui Meng >Priority: Minor > > * The ScalaDoc and Python doc of the UnivariateFeatureSelector starts with > "The user can set ...". The first sentence is the ScalaDoc should always > describe what this class is. > * `ANOVATest` is not public. We should just say ANOVA F-test. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36578) Minor UnivariateFeatureSelector API doc improvement
Xiangrui Meng created SPARK-36578: - Summary: Minor UnivariateFeatureSelector API doc improvement Key: SPARK-36578 URL: https://issues.apache.org/jira/browse/SPARK-36578 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 3.2.0 Reporter: Xiangrui Meng * The ScalaDoc and Python doc of the UnivariateFeatureSelector starts with "The user can set ...". The first sentence is the ScalaDoc should always describe what this class is. * `ANOVATest` is not public. We should just say ANOVA F-test. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36578) Minor UnivariateFeatureSelector API doc improvement
[ https://issues.apache.org/jira/browse/SPARK-36578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-36578: -- Affects Version/s: 3.1.1 > Minor UnivariateFeatureSelector API doc improvement > --- > > Key: SPARK-36578 > URL: https://issues.apache.org/jira/browse/SPARK-36578 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.1, 3.2.0 >Reporter: Xiangrui Meng >Priority: Minor > > * The ScalaDoc and Python doc of the UnivariateFeatureSelector starts with > "The user can set ...". The first sentence is the ScalaDoc should always > describe what this class is. > * `ANOVATest` is not public. We should just say ANOVA F-test. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36578) Minor UnivariateFeatureSelector API doc improvement
[ https://issues.apache.org/jira/browse/SPARK-36578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-36578: -- Target Version/s: (was: 3.2.0) > Minor UnivariateFeatureSelector API doc improvement > --- > > Key: SPARK-36578 > URL: https://issues.apache.org/jira/browse/SPARK-36578 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.2.0 >Reporter: Xiangrui Meng >Priority: Minor > > * The ScalaDoc and Python doc of the UnivariateFeatureSelector starts with > "The user can set ...". The first sentence is the ScalaDoc should always > describe what this class is. > * `ANOVATest` is not public. We should just say ANOVA F-test. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34415) Use randomization as a possibly better technique than grid search in optimizing hyperparameters
[ https://issues.apache.org/jira/browse/SPARK-34415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400644#comment-17400644 ] Xiangrui Meng commented on SPARK-34415: --- [~phenry] [~srowen] The implementation doesn't do uniform sampling of the hyper-parameter search space. Instead, it samples per params and then construct the cartesian product of all combinations. I think this would significantly reduce the effectiveness of the random search. Was it already discussed? > Use randomization as a possibly better technique than grid search in > optimizing hyperparameters > --- > > Key: SPARK-34415 > URL: https://issues.apache.org/jira/browse/SPARK-34415 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Affects Versions: 3.0.1 >Reporter: Phillip Henry >Assignee: Phillip Henry >Priority: Minor > Labels: pull-request-available > Fix For: 3.2.0 > > > Randomization can be a more effective techinique than a grid search in > finding optimal hyperparameters since min/max points can fall between the > grid lines and never be found. Randomisation is not so restricted although > the probability of finding minima/maxima is dependent on the number of > attempts. > Alice Zheng has an accessible description on how this technique works at > [https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html] > (Note that I have a PR for this work outstanding at > [https://github.com/apache/spark/pull/31535] ) > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24374) SPIP: Support Barrier Execution Mode in Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-24374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-24374. --- Fix Version/s: 2.4.0 Target Version/s: 2.4.0 Resolution: Fixed I'm marking this epic jira as done given the major feature was implemented in 2.4. > SPIP: Support Barrier Execution Mode in Apache Spark > > > Key: SPARK-24374 > URL: https://issues.apache.org/jira/browse/SPARK-24374 > Project: Spark > Issue Type: Epic > Components: ML, Spark Core >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > Labels: Hydrogen, SPIP > Fix For: 2.4.0 > > Attachments: SPIP_ Support Barrier Scheduling in Apache Spark.pdf > > > (See details in the linked/attached SPIP doc.) > {quote} > The proposal here is to add a new scheduling model to Apache Spark so users > can properly embed distributed DL training as a Spark stage to simplify the > distributed training workflow. For example, Horovod uses MPI to implement > all-reduce to accelerate distributed TensorFlow training. The computation > model is different from MapReduce used by Spark. In Spark, a task in a stage > doesn’t depend on any other tasks in the same stage, and hence it can be > scheduled independently. In MPI, all workers start at the same time and pass > messages around. To embed this workload in Spark, we need to introduce a new > scheduling model, tentatively named “barrier scheduling”, which launches > tasks at the same time and provides users enough information and tooling to > embed distributed DL training. Spark can also provide an extra layer of fault > tolerance in case some tasks failed in the middle, where Spark would abort > all tasks and restart the stage. > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34080) Add UnivariateFeatureSelector to deprecate existing selectors
[ https://issues.apache.org/jira/browse/SPARK-34080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263006#comment-17263006 ] Xiangrui Meng commented on SPARK-34080: --- Not sure if we have time for 3.1.1 release. But if there are other release blockers, it would be great if we can make the changes in without deprecating the APIs later. > Add UnivariateFeatureSelector to deprecate existing selectors > - > > Key: SPARK-34080 > URL: https://issues.apache.org/jira/browse/SPARK-34080 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 3.2.0 >Reporter: Xiangrui Meng >Priority: Major > > In SPARK-26111, we introduced a few univariate feature selectors, which share > a common set of params. And they are named after the underlying test, which > requires users to understand the test to find the matched scenarios. It would > be nice if we introduce a single class called UnivariateFeatureSelector that > accepts a selection criterion and a score method (string names). Then we can > deprecate all other univariate selectors. > For the params, instead of ask users to provide what score function to use, > it is more friendly to ask users to specify the feature and label types > (continuous or categorical) and we set a default score function for each > combo. We can also detect the types from feature metadata if given. Advanced > users can overwrite it (if there are multiple score function that is > compatible with the feature type and label type combo). Example (param names > are not finalized): > {code} > selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], > labelCol=["target"], featureType="categorical", labelType="continuous", > select="bestK", k=100) > {code} > cc: [~huaxingao] [~ruifengz] [~weichenxu123] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34080) Add UnivariateFeatureSelector to deprecate existing selectors
[ https://issues.apache.org/jira/browse/SPARK-34080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-34080: -- Affects Version/s: 3.1.1 > Add UnivariateFeatureSelector to deprecate existing selectors > - > > Key: SPARK-34080 > URL: https://issues.apache.org/jira/browse/SPARK-34080 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 3.2.0, 3.1.1 >Reporter: Xiangrui Meng >Priority: Major > > In SPARK-26111, we introduced a few univariate feature selectors, which share > a common set of params. And they are named after the underlying test, which > requires users to understand the test to find the matched scenarios. It would > be nice if we introduce a single class called UnivariateFeatureSelector that > accepts a selection criterion and a score method (string names). Then we can > deprecate all other univariate selectors. > For the params, instead of ask users to provide what score function to use, > it is more friendly to ask users to specify the feature and label types > (continuous or categorical) and we set a default score function for each > combo. We can also detect the types from feature metadata if given. Advanced > users can overwrite it (if there are multiple score function that is > compatible with the feature type and label type combo). Example (param names > are not finalized): > {code} > selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], > labelCol=["target"], featureType="categorical", labelType="continuous", > select="bestK", k=100) > {code} > cc: [~huaxingao] [~ruifengz] [~weichenxu123] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34080) Add UnivariateFeatureSelector to deprecate existing selectors
[ https://issues.apache.org/jira/browse/SPARK-34080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-34080: -- Description: In SPARK-26111, we introduced a few univariate feature selectors, which share a common set of params. And they are named after the underlying test, which requires users to understand the test to find the matched scenarios. It would be nice if we introduce a single class called UnivariateFeatureSelector that accepts a selection criterion and a score method (string names). Then we can deprecate all other univariate selectors. For the params, instead of ask users to provide what score function to use, it is more friendly to ask users to specify the feature and label types (continuous or categorical) and we set a default score function for each combo. We can also detect the types from feature metadata if given. Advanced users can overwrite it (if there are multiple score function that is compatible with the feature type and label type combo). Example (param names are not finalized): {code} selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], labelCol=["target"], featureType="categorical", labelType="continuous", select="bestK", k=100) {code} cc: [~huaxingao] [~ruifengz] [~weichenxu123] was: In SPARK-26111, we introduced a few univariate feature selectors, which share a common set of params. And they are named after the underlying test, which requires users to understand the test to find the matched scenarios. It would be nice if we introduce a single class called UnivariateFeatureSelector that accepts a selection criterion and a score method (string names). Then we can deprecate all other univariate selectors. For the params, instead of ask users to provide what score function to use, it is more friendly to ask users to specify the feature and label types (continuous or categorical) and we set a default score function for each combo. We can also detect the types from feature metadata if given. Advanced users can overwrite it (if there are multiple score function that is compatible with the feature type and label type combo). Example (param names are not finalized): {code} selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], labelCol=["target"], featureType="categorical", labelType="continuous") {code} cc: [~huaxingao] [~ruifengz] [~weichenxu123] > Add UnivariateFeatureSelector to deprecate existing selectors > - > > Key: SPARK-34080 > URL: https://issues.apache.org/jira/browse/SPARK-34080 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 3.2.0 >Reporter: Xiangrui Meng >Priority: Major > > In SPARK-26111, we introduced a few univariate feature selectors, which share > a common set of params. And they are named after the underlying test, which > requires users to understand the test to find the matched scenarios. It would > be nice if we introduce a single class called UnivariateFeatureSelector that > accepts a selection criterion and a score method (string names). Then we can > deprecate all other univariate selectors. > For the params, instead of ask users to provide what score function to use, > it is more friendly to ask users to specify the feature and label types > (continuous or categorical) and we set a default score function for each > combo. We can also detect the types from feature metadata if given. Advanced > users can overwrite it (if there are multiple score function that is > compatible with the feature type and label type combo). Example (param names > are not finalized): > {code} > selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], > labelCol=["target"], featureType="categorical", labelType="continuous", > select="bestK", k=100) > {code} > cc: [~huaxingao] [~ruifengz] [~weichenxu123] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34080) Add UnivariateFeatureSelector to deprecate existing selectors
[ https://issues.apache.org/jira/browse/SPARK-34080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-34080: -- Description: In SPARK-26111, we introduced a few univariate feature selectors, which share a common set of params. And they are named after the underlying test, which requires users to understand the test to find the matched scenarios. It would be nice if we introduce a single class called UnivariateFeatureSelector that accepts a selection criterion and a score method (string names). Then we can deprecate all other univariate selectors. For the params, instead of ask users to provide what score function to use, it is more friendly to ask users to specify the feature and label types (continuous or categorical) and we set a default score function for each combo. We can also detect the types from feature metadata if given. Advanced users can overwrite it (if there are multiple score function that is compatible with the feature type and label type combo). Example (param names are not finalized): {code} selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], labelCol=["target"], featureType="categorical", labelType="continuous") {code} cc: [~huaxingao] [~ruifengz] [~weichenxu123] was: In SPARK-26111, we introduced a few univariate feature selectors, which share a common set of params. And they are named after the underlying test, which requires users to understand the test to find the matched scenarios. It would be nice if we introduce a single class called UnivariateFeatureSelector that accepts a selection criterion and a score method (string names). Then we can deprecate all other univariate selectors. For the params, instead of ask users to provide what score function to use, it is more friendly to ask users to specify the feature and label types (continuous or categorical) and we set a default score function for each combo. We can also detect the types from feature metadata if given. Advanced users can overwrite it (if there are multiple score function that is compatible with the feature type and label type combo). Example (param names are not finalized): {code} selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], labelCol=["target"], featureType="categorical", labelType="continuous") {code} cc: [~huaxingao] @huaxin > Add UnivariateFeatureSelector to deprecate existing selectors > - > > Key: SPARK-34080 > URL: https://issues.apache.org/jira/browse/SPARK-34080 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 3.2.0 >Reporter: Xiangrui Meng >Priority: Major > > In SPARK-26111, we introduced a few univariate feature selectors, which share > a common set of params. And they are named after the underlying test, which > requires users to understand the test to find the matched scenarios. It would > be nice if we introduce a single class called UnivariateFeatureSelector that > accepts a selection criterion and a score method (string names). Then we can > deprecate all other univariate selectors. > For the params, instead of ask users to provide what score function to use, > it is more friendly to ask users to specify the feature and label types > (continuous or categorical) and we set a default score function for each > combo. We can also detect the types from feature metadata if given. Advanced > users can overwrite it (if there are multiple score function that is > compatible with the feature type and label type combo). Example (param names > are not finalized): > {code} > selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], > labelCol=["target"], featureType="categorical", labelType="continuous") > {code} > cc: [~huaxingao] [~ruifengz] [~weichenxu123] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34080) Add UnivariateFeatureSelector to deprecate existing selectors
[ https://issues.apache.org/jira/browse/SPARK-34080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-34080: -- Description: In SPARK-26111, we introduced a few univariate feature selectors, which share a common set of params. And they are named after the underlying test, which requires users to understand the test to find the matched scenarios. It would be nice if we introduce a single class called UnivariateFeatureSelector that accepts a selection criterion and a score method (string names). Then we can deprecate all other univariate selectors. For the params, instead of ask users to provide what score function to use, it is more friendly to ask users to specify the feature and label types (continuous or categorical) and we set a default score function for each combo. We can also detect the types from feature metadata if given. Advanced users can overwrite it (if there are multiple score function that is compatible with the feature type and label type combo). Example (param names are not finalized): {code} selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], labelCol=["target"], featureType="categorical", labelType="continuous") {code} cc: [~huaxingao] @huaxin was: In SPARK-26111, we introduced a few univariate feature selectors, which share a common set of params. And they are named after the underlying test, which requires users to understand the test to find the matched scenarios. It would be nice if we introduce a single class called UnivariateFeatureSelector that accepts a selection criterion and a score method (string names). Then we can deprecate all other univariate selectors. For the params, instead of ask users to provide what score function to use, it is more friendly to ask users to specify the feature and label types (continuous or categorical) and we set a default score function for each combo. We can also detect the types from feature metadata if given. Advanced users can overwrite it (if there are multiple score function that is compatible with the feature type and label type combo). Example (param names are not finalized): {code} selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], labelCol=["target"], featureType="categorical", labelType="continuous") {code} > Add UnivariateFeatureSelector to deprecate existing selectors > - > > Key: SPARK-34080 > URL: https://issues.apache.org/jira/browse/SPARK-34080 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 3.2.0 >Reporter: Xiangrui Meng >Priority: Major > > In SPARK-26111, we introduced a few univariate feature selectors, which share > a common set of params. And they are named after the underlying test, which > requires users to understand the test to find the matched scenarios. It would > be nice if we introduce a single class called UnivariateFeatureSelector that > accepts a selection criterion and a score method (string names). Then we can > deprecate all other univariate selectors. > For the params, instead of ask users to provide what score function to use, > it is more friendly to ask users to specify the feature and label types > (continuous or categorical) and we set a default score function for each > combo. We can also detect the types from feature metadata if given. Advanced > users can overwrite it (if there are multiple score function that is > compatible with the feature type and label type combo). Example (param names > are not finalized): > {code} > selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], > labelCol=["target"], featureType="categorical", labelType="continuous") > {code} > cc: [~huaxingao] @huaxin -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34080) Add UnivariateFeatureSelector to deprecate existing selectors
Xiangrui Meng created SPARK-34080: - Summary: Add UnivariateFeatureSelector to deprecate existing selectors Key: SPARK-34080 URL: https://issues.apache.org/jira/browse/SPARK-34080 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 3.2.0 Reporter: Xiangrui Meng In SPARK-26111, we introduced a few univariate feature selectors, which share a common set of params. And they are named after the underlying test, which requires users to understand the test to find the matched scenarios. It would be nice if we introduce a single class called UnivariateFeatureSelector that accepts a selection criterion and a score method (string names). Then we can deprecate all other univariate selectors. For the params, instead of ask users to provide what score function to use, it is more friendly to ask users to specify the feature and label types (continuous or categorical) and we set a default score function for each combo. We can also detect the types from feature metadata if given. Advanced users can overwrite it (if there are multiple score function that is compatible with the feature type and label type combo). Example (param names are not finalized): {code} selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], labelCol=["target"], featureType="categorical", labelType="continuous") {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32933) Use keyword-only syntax for keyword_only methods
[ https://issues.apache.org/jira/browse/SPARK-32933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17200580#comment-17200580 ] Xiangrui Meng commented on SPARK-32933: --- Why do we keep using the @keyword_only annotation after we switched to use `*` syntax? I understand that we do not want to remove the keyword_only definition at https://github.com/apache/spark/blob/master/python/pyspark/__init__.py#L101 because it was used widely outside Spark. But within Spark code base, I don't think we still need to use it to annotate methods. We should also deprecate keyword_only in Spark 3.1 so developers know to switch. > Use keyword-only syntax for keyword_only methods > > > Key: SPARK-32933 > URL: https://issues.apache.org/jira/browse/SPARK-32933 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Minor > Fix For: 3.1.0 > > > Since 3.0, provides syntax for indicating keyword-only arguments ([PEP > 3102|https://www.python.org/dev/peps/pep-3102/]). > It is not a full replacement for our current usage of {{keyword_only}}, but > it would allow us to make our expectations explicit: > {code:python} > @keyword_only > def __init__(self, degree=2, inputCol=None, outputCol=None): > {code} > {code:python} > @keyword_only > def __init__(self, *, degree=2, inputCol=None, outputCol=None): > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32429) Standalone Mode allow setting CUDA_VISIBLE_DEVICES on executor launch
[ https://issues.apache.org/jira/browse/SPARK-32429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166532#comment-17166532 ] Xiangrui Meng edited comment on SPARK-32429 at 7/28/20, 4:30 PM: - [~tgraves] Thanks for the clarification! It makes sense to add GPU isolation at executor level. Your prototype adds special meaning to the "gpu" resource name. I wonder if we want to make it more configurable in the final implementation. A scenario we considered previously was a cluster with two generation of GPUs: K80, V100. I think it is safe to assume that Spark application should only request one GPU type. Then we will need some configuration to tell based on which resource name user wants to set CUDA_VISIBLE_DEVICES. Btw, we found that setting CUDA_DEVICE_ORDER=PCI_BUS_ID is necessary to have consistent device ordering between different processes even CUDA_VISIBLE_DEVICES are set the same. Not sure if the same setting were used in YARN/k8s. was (Author: mengxr): [~tgraves] Thanks for the clarification! It makes sense to add GPU isolation at executor level. Your prototype adds special meaning to the "gpu" resource name. I wonder if we want to make it more configurable in the final implementation. A scenario we considered previously was a cluster with two generation of GPUs: K80, V100. I think it is safe to assume that Spark application should only request one GPU type. Then we will need some configuration to tell based on which resource name user wants to set CUDA_VISIBLE_DEVICES. Btw, we found that setting CUDA_DEVICE_ORDER=PCI_BUS_ID is necessary to have consistent device ordering between different processes even CUDA_VISIBLE_DEVICES are set the same. > Standalone Mode allow setting CUDA_VISIBLE_DEVICES on executor launch > - > > Key: SPARK-32429 > URL: https://issues.apache.org/jira/browse/SPARK-32429 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > > It would be nice if standalone mode could allow users to set > CUDA_VISIBLE_DEVICES before launching an executor. This has multiple > benefits. > * kind of an isolation in that the executor can only see the GPUs set there. > * If your GPU application doesn't support explicitly setting the GPU device > id, setting this will make any GPU look like the default (id 0) and things > generally just work without any explicit setting > * New features are being added on newer GPUs that require explicit setting > of CUDA_VISIBLE_DEVICES like MIG > ([https://www.nvidia.com/en-us/technologies/multi-instance-gpu/]) > The code changes to just set this are very small, once we set them we would > also possibly need to change the gpu addresses as it changes them to start > from device id 0 again. > The easiest implementation would just specifically support this and have it > behind a config and set when the config is on and GPU resources are > allocated. > Note we probably want to have this same thing set when we launch a python > process as well so that it gets same env. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32429) Standalone Mode allow setting CUDA_VISIBLE_DEVICES on executor launch
[ https://issues.apache.org/jira/browse/SPARK-32429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166532#comment-17166532 ] Xiangrui Meng commented on SPARK-32429: --- [~tgraves] Thanks for the clarification! It makes sense to add GPU isolation at executor level. Your prototype adds special meaning to the "gpu" resource name. I wonder if we want to make it more configurable in the final implementation. A scenario we considered previously was a cluster with two generation of GPUs: K80, V100. I think it is safe to assume that Spark application should only request one GPU type. Then we will need some configuration to tell based on which resource name user wants to set CUDA_VISIBLE_DEVICES. Btw, we found that setting CUDA_DEVICE_ORDER=PCI_BUS_ID is necessary to have consistent device ordering between different processes even CUDA_VISIBLE_DEVICES are set the same. > Standalone Mode allow setting CUDA_VISIBLE_DEVICES on executor launch > - > > Key: SPARK-32429 > URL: https://issues.apache.org/jira/browse/SPARK-32429 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > > It would be nice if standalone mode could allow users to set > CUDA_VISIBLE_DEVICES before launching an executor. This has multiple > benefits. > * kind of an isolation in that the executor can only see the GPUs set there. > * If your GPU application doesn't support explicitly setting the GPU device > id, setting this will make any GPU look like the default (id 0) and things > generally just work without any explicit setting > * New features are being added on newer GPUs that require explicit setting > of CUDA_VISIBLE_DEVICES like MIG > ([https://www.nvidia.com/en-us/technologies/multi-instance-gpu/]) > The code changes to just set this are very small, once we set them we would > also possibly need to change the gpu addresses as it changes them to start > from device id 0 again. > The easiest implementation would just specifically support this and have it > behind a config and set when the config is on and GPU resources are > allocated. > Note we probably want to have this same thing set when we launch a python > process as well so that it gets same env. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32429) Standalone Mode allow setting CUDA_VISIBLE_DEVICES on executor launch
[ https://issues.apache.org/jira/browse/SPARK-32429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17165903#comment-17165903 ] Xiangrui Meng commented on SPARK-32429: --- Couple questions: 1. Which GPU resource name do we use? "spark.task.resource.gpu" does not have special meaning in the current implemetnation. 2. I think we can do this for PySpark workers if 1) gets resolved. However, for executors running inside the same JVM, is there a way to set CUDA_VISIBLE_DEVICES differently per executor thread? > Standalone Mode allow setting CUDA_VISIBLE_DEVICES on executor launch > - > > Key: SPARK-32429 > URL: https://issues.apache.org/jira/browse/SPARK-32429 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > > It would be nice if standalone mode could allow users to set > CUDA_VISIBLE_DEVICES before launching an executor. This has multiple > benefits. > * kind of an isolation in that the executor can only see the GPUs set there. > * If your GPU application doesn't support explicitly setting the GPU device > id, setting this will make any GPU look like the default (id 0) and things > generally just work without any explicit setting > * New features are being added on newer GPUs that require explicit setting > of CUDA_VISIBLE_DEVICES like MIG > ([https://www.nvidia.com/en-us/technologies/multi-instance-gpu/]) > The code changes to just set this are very small, once we set them we would > also possibly need to change the gpu addresses as it changes them to start > from device id 0 again. > The easiest implementation would just specifically support this and have it > behind a config and set when the config is on and GPU resources are > allocated. > Note we probably want to have this same thing set when we launch a python > process as well so that it gets same env. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31777) CrossValidator supports user-supplied folds
Xiangrui Meng created SPARK-31777: - Summary: CrossValidator supports user-supplied folds Key: SPARK-31777 URL: https://issues.apache.org/jira/browse/SPARK-31777 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 3.1.0 Reporter: Xiangrui Meng As a user, I can specify how CrossValidator should create folds by specifying a foldCol, which should be integer type with range [0, numFolds). If foldCol is specified, Spark won't do random k-fold split. This is useful if there are custom logics to create folds, e.g., random split by users instead of random splits of events. This is similar to SPARK-16206, which is for the RDD-based APIs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31776) Literal lit() supports lists and numpy arrays
[ https://issues.apache.org/jira/browse/SPARK-31776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-31776: -- Issue Type: Improvement (was: New Feature) > Literal lit() supports lists and numpy arrays > - > > Key: SPARK-31776 > URL: https://issues.apache.org/jira/browse/SPARK-31776 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Xiangrui Meng >Priority: Major > > In ML workload, it is common to replace null feature vectors with some > default value. However, lit() does not support Python list and numpy arrays > at input. Users cannot simply use fillna() to get the job done. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31776) Literal lit() supports lists and numpy arrays
Xiangrui Meng created SPARK-31776: - Summary: Literal lit() supports lists and numpy arrays Key: SPARK-31776 URL: https://issues.apache.org/jira/browse/SPARK-31776 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.1.0 Reporter: Xiangrui Meng In ML workload, it is common to replace null feature vectors with some default value. However, lit() does not support Python list and numpy arrays at input. Users cannot simply use fillna() to get the job done. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31775) Support tensor type (TensorType) in Spark SQL/DataFrame
Xiangrui Meng created SPARK-31775: - Summary: Support tensor type (TensorType) in Spark SQL/DataFrame Key: SPARK-31775 URL: https://issues.apache.org/jira/browse/SPARK-31775 Project: Spark Issue Type: New Feature Components: ML, SQL Affects Versions: 3.1.0 Reporter: Xiangrui Meng More and more DS/ML workloads are dealing with tensors. For example, a decoded color image can be represented by a 3D tensor. It would be nice to natively support tensor type. A local tensor is essentially an array with shape info, stored together. Native support is better than user-defined type (UDT) because PyArrow does not support UDTs but it already supports tensors. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31610) Expose hashFuncVersion property in HashingTF
[ https://issues.apache.org/jira/browse/SPARK-31610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-31610: -- Issue Type: Improvement (was: Bug) > Expose hashFuncVersion property in HashingTF > > > Key: SPARK-31610 > URL: https://issues.apache.org/jira/browse/SPARK-31610 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > Fix For: 3.0.0 > > > Expose hashFuncVersion property in HashingTF > Some third-party library such as mleap need to access it. > See background description here: > https://github.com/combust/mleap/pull/665#issuecomment-621258623 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31610) Expose hashFuncVersion property in HashingTF
[ https://issues.apache.org/jira/browse/SPARK-31610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-31610: -- Priority: Major (was: Critical) > Expose hashFuncVersion property in HashingTF > > > Key: SPARK-31610 > URL: https://issues.apache.org/jira/browse/SPARK-31610 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 3.0.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > Fix For: 3.0.0 > > > Expose hashFuncVersion property in HashingTF > Some third-party library such as mleap need to access it. > See background description here: > https://github.com/combust/mleap/pull/665#issuecomment-621258623 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31610) Expose hashFuncVersion property in HashingTF
[ https://issues.apache.org/jira/browse/SPARK-31610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-31610: -- Description: Expose hashFuncVersion property in HashingTF Some third-party library such as mleap need to access it. See background description here: https://github.com/combust/mleap/pull/665#issuecomment-621258623 was: Expose hashFunc property in HashingTF Some third-party library such as mleap need to access it. See background description here: https://github.com/combust/mleap/pull/665#issuecomment-621258623 > Expose hashFuncVersion property in HashingTF > > > Key: SPARK-31610 > URL: https://issues.apache.org/jira/browse/SPARK-31610 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 3.0.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Critical > Fix For: 3.0.0 > > > Expose hashFuncVersion property in HashingTF > Some third-party library such as mleap need to access it. > See background description here: > https://github.com/combust/mleap/pull/665#issuecomment-621258623 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31610) Expose hashFunc property in HashingTF
[ https://issues.apache.org/jira/browse/SPARK-31610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-31610: - Assignee: Weichen Xu > Expose hashFunc property in HashingTF > - > > Key: SPARK-31610 > URL: https://issues.apache.org/jira/browse/SPARK-31610 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 3.0.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Critical > > Expose hashFunc property in HashingTF > Some third-party library such as mleap need to access it. > See background description here: > https://github.com/combust/mleap/pull/665#issuecomment-621258623 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31668) Saving and loading HashingTF leads to hash function changed
[ https://issues.apache.org/jira/browse/SPARK-31668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-31668. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28413 [https://github.com/apache/spark/pull/28413] > Saving and loading HashingTF leads to hash function changed > --- > > Key: SPARK-31668 > URL: https://issues.apache.org/jira/browse/SPARK-31668 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 3.0.0, 3.0.1, 3.1.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Blocker > Fix For: 3.0.0 > > > If we use spark 2.x save HashingTF, and then use spark 3.0 load it, and then > use spark 3.0 to save it again, and then use spark 3.0 to load it again, the > hash function will be changed. > This bug is hard to debug, we need to fix it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31610) Expose hashFuncVersion property in HashingTF
[ https://issues.apache.org/jira/browse/SPARK-31610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-31610: -- Summary: Expose hashFuncVersion property in HashingTF (was: Expose hashFunc property in HashingTF) > Expose hashFuncVersion property in HashingTF > > > Key: SPARK-31610 > URL: https://issues.apache.org/jira/browse/SPARK-31610 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 3.0.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Critical > Fix For: 3.0.0 > > > Expose hashFunc property in HashingTF > Some third-party library such as mleap need to access it. > See background description here: > https://github.com/combust/mleap/pull/665#issuecomment-621258623 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31610) Expose hashFunc property in HashingTF
[ https://issues.apache.org/jira/browse/SPARK-31610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-31610. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28413 [https://github.com/apache/spark/pull/28413] > Expose hashFunc property in HashingTF > - > > Key: SPARK-31610 > URL: https://issues.apache.org/jira/browse/SPARK-31610 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 3.0.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Critical > Fix For: 3.0.0 > > > Expose hashFunc property in HashingTF > Some third-party library such as mleap need to access it. > See background description here: > https://github.com/combust/mleap/pull/665#issuecomment-621258623 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31610) Expose hashFunc property in HashingTF
[ https://issues.apache.org/jira/browse/SPARK-31610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-31610: -- Issue Type: Bug (was: Improvement) > Expose hashFunc property in HashingTF > - > > Key: SPARK-31610 > URL: https://issues.apache.org/jira/browse/SPARK-31610 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 3.0.0 >Reporter: Weichen Xu >Priority: Critical > > Expose hashFunc property in HashingTF > Some third-party library such as mleap need to access it. > See background description here: > https://github.com/combust/mleap/pull/665#issuecomment-621258623 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31610) Expose hashFunc property in HashingTF
[ https://issues.apache.org/jira/browse/SPARK-31610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-31610: -- Priority: Critical (was: Major) > Expose hashFunc property in HashingTF > - > > Key: SPARK-31610 > URL: https://issues.apache.org/jira/browse/SPARK-31610 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: Weichen Xu >Priority: Critical > > Expose hashFunc property in HashingTF > Some third-party library such as mleap need to access it. > See background description here: > https://github.com/combust/mleap/pull/665#issuecomment-621258623 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31549) Pyspark SparkContext.cancelJobGroup do not work correctly
[ https://issues.apache.org/jira/browse/SPARK-31549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-31549: -- Target Version/s: 3.0.0 > Pyspark SparkContext.cancelJobGroup do not work correctly > - > > Key: SPARK-31549 > URL: https://issues.apache.org/jira/browse/SPARK-31549 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.5, 3.0.0 >Reporter: Weichen Xu >Priority: Critical > > Pyspark SparkContext.cancelJobGroup do not work correctly. This is an issue > existing for a long time. This is because of pyspark thread didn't pinned to > jvm thread when invoking java side methods, which leads to all pyspark API > which related to java local thread variables do not work correctly. > (Including `sc.setLocalProperty`, `sc.cancelJobGroup`, `sc.setJobDescription` > and so on.) > This is serious issue. Now there's an experimental pyspark 'PIN_THREAD' mode > added in spark-3.0 which address it, but the 'PIN_THREAD' mode exists two > issue: > * It is disabled by default. We need to set additional environment variable > to enable it. > * There's memory leak issue which haven't been addressed. > Now there's a series of project like hyperopt-spark, spark-joblib which rely > on `sc.cancelJobGroup` API (use it to stop running jobs in their code). So it > is critical to address this issue and we hope it work under default pyspark > mode. An optional approach is implementing methods like > `rdd.setGroupAndCollect`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31497) Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot save and load model
[ https://issues.apache.org/jira/browse/SPARK-31497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-31497. --- Resolution: Fixed Issue resolved by pull request 28279 [https://github.com/apache/spark/pull/28279] > Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot > save and load model > -- > > Key: SPARK-31497 > URL: https://issues.apache.org/jira/browse/SPARK-31497 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.4.5 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > Fix For: 3.0.0 > > > Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot > save and load model. > Reproduce code run in pyspark shell: > 1) Train model and save model in pyspark: > {code:python} > from pyspark.ml import Pipeline > from pyspark.ml.classification import LogisticRegression > from pyspark.ml.evaluation import BinaryClassificationEvaluator > from pyspark.ml.feature import HashingTF, Tokenizer > from pyspark.ml.tuning import CrossValidator, CrossValidatorModel, > ParamGridBuilder > training = spark.createDataFrame([ > (0, "a b c d e spark", 1.0), > (1, "b d", 0.0), > (2, "spark f g h", 1.0), > (3, "hadoop mapreduce", 0.0), > (4, "b spark who", 1.0), > (5, "g d a y", 0.0), > (6, "spark fly", 1.0), > (7, "was mapreduce", 0.0), > (8, "e spark program", 1.0), > (9, "a e c l", 0.0), > (10, "spark compile", 1.0), > (11, "hadoop software", 0.0) > ], ["id", "text", "label"]) > # Configure an ML pipeline, which consists of tree stages: tokenizer, > hashingTF, and lr. > tokenizer = Tokenizer(inputCol="text", outputCol="words") > hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features") > lr = LogisticRegression(maxIter=10) > pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) > paramGrid = ParamGridBuilder() \ > .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \ > .addGrid(lr.regParam, [0.1, 0.01]) \ > .build() > crossval = CrossValidator(estimator=pipeline, > estimatorParamMaps=paramGrid, > evaluator=BinaryClassificationEvaluator(), > numFolds=2) # use 3+ folds in practice > # Run cross-validation, and choose the best set of parameters. > cvModel = crossval.fit(training) > cvModel.save('/tmp/cv_model001') # save model failed. Rase error. > {code} > 2): Train crossvalidation model in scala with similar code above, and save to > '/tmp/model_cv_scala001', run following code in pyspark: > {code:python} > from pyspark.ml.tuning import CrossValidatorModel > CrossValidatorModel.load('/tmp/model_cv_scala001') # raise error > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31497) Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot save and load model
[ https://issues.apache.org/jira/browse/SPARK-31497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-31497: -- Fix Version/s: 3.0.0 > Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot > save and load model > -- > > Key: SPARK-31497 > URL: https://issues.apache.org/jira/browse/SPARK-31497 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.4.5 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > Fix For: 3.0.0 > > > Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot > save and load model. > Reproduce code run in pyspark shell: > 1) Train model and save model in pyspark: > {code:python} > from pyspark.ml import Pipeline > from pyspark.ml.classification import LogisticRegression > from pyspark.ml.evaluation import BinaryClassificationEvaluator > from pyspark.ml.feature import HashingTF, Tokenizer > from pyspark.ml.tuning import CrossValidator, CrossValidatorModel, > ParamGridBuilder > training = spark.createDataFrame([ > (0, "a b c d e spark", 1.0), > (1, "b d", 0.0), > (2, "spark f g h", 1.0), > (3, "hadoop mapreduce", 0.0), > (4, "b spark who", 1.0), > (5, "g d a y", 0.0), > (6, "spark fly", 1.0), > (7, "was mapreduce", 0.0), > (8, "e spark program", 1.0), > (9, "a e c l", 0.0), > (10, "spark compile", 1.0), > (11, "hadoop software", 0.0) > ], ["id", "text", "label"]) > # Configure an ML pipeline, which consists of tree stages: tokenizer, > hashingTF, and lr. > tokenizer = Tokenizer(inputCol="text", outputCol="words") > hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features") > lr = LogisticRegression(maxIter=10) > pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) > paramGrid = ParamGridBuilder() \ > .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \ > .addGrid(lr.regParam, [0.1, 0.01]) \ > .build() > crossval = CrossValidator(estimator=pipeline, > estimatorParamMaps=paramGrid, > evaluator=BinaryClassificationEvaluator(), > numFolds=2) # use 3+ folds in practice > # Run cross-validation, and choose the best set of parameters. > cvModel = crossval.fit(training) > cvModel.save('/tmp/cv_model001') # save model failed. Rase error. > {code} > 2): Train crossvalidation model in scala with similar code above, and save to > '/tmp/model_cv_scala001', run following code in pyspark: > {code:python} > from pyspark.ml.tuning import CrossValidatorModel > CrossValidatorModel.load('/tmp/model_cv_scala001') # raise error > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31497) Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot save and load model
[ https://issues.apache.org/jira/browse/SPARK-31497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-31497: -- Target Version/s: 3.0.0 > Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot > save and load model > -- > > Key: SPARK-31497 > URL: https://issues.apache.org/jira/browse/SPARK-31497 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.4.5 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > Fix For: 3.0.0 > > > Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot > save and load model. > Reproduce code run in pyspark shell: > 1) Train model and save model in pyspark: > {code:python} > from pyspark.ml import Pipeline > from pyspark.ml.classification import LogisticRegression > from pyspark.ml.evaluation import BinaryClassificationEvaluator > from pyspark.ml.feature import HashingTF, Tokenizer > from pyspark.ml.tuning import CrossValidator, CrossValidatorModel, > ParamGridBuilder > training = spark.createDataFrame([ > (0, "a b c d e spark", 1.0), > (1, "b d", 0.0), > (2, "spark f g h", 1.0), > (3, "hadoop mapreduce", 0.0), > (4, "b spark who", 1.0), > (5, "g d a y", 0.0), > (6, "spark fly", 1.0), > (7, "was mapreduce", 0.0), > (8, "e spark program", 1.0), > (9, "a e c l", 0.0), > (10, "spark compile", 1.0), > (11, "hadoop software", 0.0) > ], ["id", "text", "label"]) > # Configure an ML pipeline, which consists of tree stages: tokenizer, > hashingTF, and lr. > tokenizer = Tokenizer(inputCol="text", outputCol="words") > hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features") > lr = LogisticRegression(maxIter=10) > pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) > paramGrid = ParamGridBuilder() \ > .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \ > .addGrid(lr.regParam, [0.1, 0.01]) \ > .build() > crossval = CrossValidator(estimator=pipeline, > estimatorParamMaps=paramGrid, > evaluator=BinaryClassificationEvaluator(), > numFolds=2) # use 3+ folds in practice > # Run cross-validation, and choose the best set of parameters. > cvModel = crossval.fit(training) > cvModel.save('/tmp/cv_model001') # save model failed. Rase error. > {code} > 2): Train crossvalidation model in scala with similar code above, and save to > '/tmp/model_cv_scala001', run following code in pyspark: > {code:python} > from pyspark.ml.tuning import CrossValidatorModel > CrossValidatorModel.load('/tmp/model_cv_scala001') # raise error > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30969) Remove resource coordination support from Standalone
[ https://issues.apache.org/jira/browse/SPARK-30969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046795#comment-17046795 ] Xiangrui Meng commented on SPARK-30969: --- [~Ngone51] [~jiangxb1987] Is there a JIRA to deprecate multiple workers running on the same host? Could you create and link here? I think we should deprecate it in 3.0. > Remove resource coordination support from Standalone > > > Key: SPARK-30969 > URL: https://issues.apache.org/jira/browse/SPARK-30969 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: wuyi >Assignee: wuyi >Priority: Critical > > Resource coordination is used for the case where multiple workers running on > the same host. However, it should be a rarely or event impossible use case in > current Standalone(which already allow multiple executor in a single worker). > We should remove support for it to simply the implementation and reduce the > potential maintain cost in the future. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30969) Remove resource coordination support from Standalone
[ https://issues.apache.org/jira/browse/SPARK-30969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-30969: -- Environment: (was: Resource coordination is used for the case where multiple workers running on the same host. However, it should be a rarely or event impossible use case in current Standalone(which already allow multiple executor in a single worker). We should remove support for it to simply the implementation and reduce the potential maintain cost in the future.) > Remove resource coordination support from Standalone > > > Key: SPARK-30969 > URL: https://issues.apache.org/jira/browse/SPARK-30969 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: wuyi >Assignee: wuyi >Priority: Critical > > Resource coordination is used for the case where multiple workers running on > the same host. However, it should be a rarely or event impossible use case in > current Standalone(which already allow multiple executor in a single worker). > We should remove support for it to simply the implementation and reduce the > potential maintain cost in the future. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30969) Remove resource coordination support from Standalone
[ https://issues.apache.org/jira/browse/SPARK-30969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-30969: -- Priority: Critical (was: Major) > Remove resource coordination support from Standalone > > > Key: SPARK-30969 > URL: https://issues.apache.org/jira/browse/SPARK-30969 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 > Environment: Resource coordination is used for the case where > multiple workers running on the same host. However, it should be a rarely or > event impossible use case in current Standalone(which already allow multiple > executor in a single worker). We should remove support for it to simply the > implementation and reduce the potential maintain cost in the future. >Reporter: wuyi >Priority: Critical > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30969) Remove resource coordination support from Standalone
[ https://issues.apache.org/jira/browse/SPARK-30969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-30969: - Assignee: wuyi > Remove resource coordination support from Standalone > > > Key: SPARK-30969 > URL: https://issues.apache.org/jira/browse/SPARK-30969 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 > Environment: Resource coordination is used for the case where > multiple workers running on the same host. However, it should be a rarely or > event impossible use case in current Standalone(which already allow multiple > executor in a single worker). We should remove support for it to simply the > implementation and reduce the potential maintain cost in the future. >Reporter: wuyi >Assignee: wuyi >Priority: Critical > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30969) Remove resource coordination support from Standalone
[ https://issues.apache.org/jira/browse/SPARK-30969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-30969: -- Description: Resource coordination is used for the case where multiple workers running on the same host. However, it should be a rarely or event impossible use case in current Standalone(which already allow multiple executor in a single worker). We should remove support for it to simply the implementation and reduce the potential maintain cost in the future. > Remove resource coordination support from Standalone > > > Key: SPARK-30969 > URL: https://issues.apache.org/jira/browse/SPARK-30969 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 > Environment: Resource coordination is used for the case where > multiple workers running on the same host. However, it should be a rarely or > event impossible use case in current Standalone(which already allow multiple > executor in a single worker). We should remove support for it to simply the > implementation and reduce the potential maintain cost in the future. >Reporter: wuyi >Assignee: wuyi >Priority: Critical > > Resource coordination is used for the case where multiple workers running on > the same host. However, it should be a rarely or event impossible use case in > current Standalone(which already allow multiple executor in a single worker). > We should remove support for it to simply the implementation and reduce the > potential maintain cost in the future. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30667) Support simple all gather in barrier task context
[ https://issues.apache.org/jira/browse/SPARK-30667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-30667: -- Fix Version/s: (was: 3.0.0) > Support simple all gather in barrier task context > - > > Key: SPARK-30667 > URL: https://issues.apache.org/jira/browse/SPARK-30667 > Project: Spark > Issue Type: New Feature > Components: PySpark, Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Sarth Frey >Priority: Major > > Currently we offer task.barrier() to coordinate tasks in barrier mode. Tasks > can see all IP addresses from BarrierTaskContext. It would be simpler to > integrate with distributed frameworks like TensorFlow DistributionStrategy if > we provide all gather that can let tasks share additional information with > others, e.g., an available port. > Note that with all gather, tasks are share their IP addresses as well. > {code} > port = ... # get an available port > ports = context.all_gather(port) # get all available ports, ordered by task ID > ... # set up distributed training service > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-30667) Support simple all gather in barrier task context
[ https://issues.apache.org/jira/browse/SPARK-30667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reopened SPARK-30667: --- > Support simple all gather in barrier task context > - > > Key: SPARK-30667 > URL: https://issues.apache.org/jira/browse/SPARK-30667 > Project: Spark > Issue Type: New Feature > Components: PySpark, Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Sarth Frey >Priority: Major > Fix For: 3.0.0 > > > Currently we offer task.barrier() to coordinate tasks in barrier mode. Tasks > can see all IP addresses from BarrierTaskContext. It would be simpler to > integrate with distributed frameworks like TensorFlow DistributionStrategy if > we provide all gather that can let tasks share additional information with > others, e.g., an available port. > Note that with all gather, tasks are share their IP addresses as well. > {code} > port = ... # get an available port > ports = context.all_gather(port) # get all available ports, ordered by task ID > ... # set up distributed training service > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30667) Support simple all gather in barrier task context
[ https://issues.apache.org/jira/browse/SPARK-30667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-30667. --- Resolution: Fixed > Support simple all gather in barrier task context > - > > Key: SPARK-30667 > URL: https://issues.apache.org/jira/browse/SPARK-30667 > Project: Spark > Issue Type: New Feature > Components: PySpark, Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Sarth Frey >Priority: Major > Fix For: 3.0.0 > > > Currently we offer task.barrier() to coordinate tasks in barrier mode. Tasks > can see all IP addresses from BarrierTaskContext. It would be simpler to > integrate with distributed frameworks like TensorFlow DistributionStrategy if > we provide all gather that can let tasks share additional information with > others, e.g., an available port. > Note that with all gather, tasks are share their IP addresses as well. > {code} > port = ... # get an available port > ports = context.all_gather(port) # get all available ports, ordered by task ID > ... # set up distributed training service > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30667) Support simple all gather in barrier task context
[ https://issues.apache.org/jira/browse/SPARK-30667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-30667: - Assignee: Sarth Frey > Support simple all gather in barrier task context > - > > Key: SPARK-30667 > URL: https://issues.apache.org/jira/browse/SPARK-30667 > Project: Spark > Issue Type: New Feature > Components: PySpark, Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Sarth Frey >Priority: Major > Fix For: 3.0.0 > > > Currently we offer task.barrier() to coordinate tasks in barrier mode. Tasks > can see all IP addresses from BarrierTaskContext. It would be simpler to > integrate with distributed frameworks like TensorFlow DistributionStrategy if > we provide all gather that can let tasks share additional information with > others, e.g., an available port. > Note that with all gather, tasks are share their IP addresses as well. > {code} > port = ... # get an available port > ports = context.all_gather(port) # get all available ports, ordered by task ID > ... # set up distributed training service > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30667) Support simple all gather in barrier task context
[ https://issues.apache.org/jira/browse/SPARK-30667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-30667: -- Target Version/s: 3.0.0 (was: 3.1.0) > Support simple all gather in barrier task context > - > > Key: SPARK-30667 > URL: https://issues.apache.org/jira/browse/SPARK-30667 > Project: Spark > Issue Type: New Feature > Components: PySpark, Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > Currently we offer task.barrier() to coordinate tasks in barrier mode. Tasks > can see all IP addresses from BarrierTaskContext. It would be simpler to > integrate with distributed frameworks like TensorFlow DistributionStrategy if > we provide all gather that can let tasks share additional information with > others, e.g., an available port. > Note that with all gather, tasks are share their IP addresses as well. > {code} > port = ... # get an available port > ports = context.all_gather(port) # get all available ports, ordered by task ID > ... # set up distributed training service > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30667) Support simple all gather in barrier task context
[ https://issues.apache.org/jira/browse/SPARK-30667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-30667: -- Fix Version/s: 3.0.0 > Support simple all gather in barrier task context > - > > Key: SPARK-30667 > URL: https://issues.apache.org/jira/browse/SPARK-30667 > Project: Spark > Issue Type: New Feature > Components: PySpark, Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > Fix For: 3.0.0 > > > Currently we offer task.barrier() to coordinate tasks in barrier mode. Tasks > can see all IP addresses from BarrierTaskContext. It would be simpler to > integrate with distributed frameworks like TensorFlow DistributionStrategy if > we provide all gather that can let tasks share additional information with > others, e.g., an available port. > Note that with all gather, tasks are share their IP addresses as well. > {code} > port = ... # get an available port > ports = context.all_gather(port) # get all available ports, ordered by task ID > ... # set up distributed training service > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30762) Add dtype="float32" support to vector_to_array UDF
[ https://issues.apache.org/jira/browse/SPARK-30762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-30762: -- Component/s: PySpark > Add dtype="float32" support to vector_to_array UDF > -- > > Key: SPARK-30762 > URL: https://issues.apache.org/jira/browse/SPARK-30762 > Project: Spark > Issue Type: Story > Components: MLlib, PySpark >Affects Versions: 3.0.0 >Reporter: Liang Zhang >Assignee: Liang Zhang >Priority: Major > Original Estimate: 24h > Remaining Estimate: 24h > > Previous PR: > [https://github.com/apache/spark/blob/master/python/pyspark/ml/functions.py] > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30762) Add dtype="float32" support to vector_to_array UDF
[ https://issues.apache.org/jira/browse/SPARK-30762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-30762: - Assignee: Liang Zhang > Add dtype="float32" support to vector_to_array UDF > -- > > Key: SPARK-30762 > URL: https://issues.apache.org/jira/browse/SPARK-30762 > Project: Spark > Issue Type: Story > Components: MLlib >Affects Versions: 3.0.0 >Reporter: Liang Zhang >Assignee: Liang Zhang >Priority: Major > Original Estimate: 24h > Remaining Estimate: 24h > > Previous PR: > [https://github.com/apache/spark/blob/master/python/pyspark/ml/functions.py] > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30667) Support simple all gather in barrier task context
[ https://issues.apache.org/jira/browse/SPARK-30667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-30667: -- Description: Currently we offer task.barrier() to coordinate tasks in barrier mode. Tasks can see all IP addresses from BarrierTaskContext. It would be simpler to integrate with distributed frameworks like TensorFlow DistributionStrategy if we provide all gather that can let tasks share additional information with others, e.g., an available port. Note that with all gather, tasks are share their IP addresses as well. {code} port = ... # get an available port ports = context.all_gather(port) # get all available ports, ordered by task ID ... # set up distributed training service {code} was: Currently we offer task.barrier() to coordinate tasks in barrier mode. Tasks can see all IP addresses from BarrierTaskContext. It would be simpler to integrate with distributed frameworks like TensorFlow DistributionStrategy if we provide all gather that can let tasks share additional information with others, e.g., an available port. Note that with all gather, tasks are share their IP addresses as well. > Support simple all gather in barrier task context > - > > Key: SPARK-30667 > URL: https://issues.apache.org/jira/browse/SPARK-30667 > Project: Spark > Issue Type: New Feature > Components: PySpark, Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > Currently we offer task.barrier() to coordinate tasks in barrier mode. Tasks > can see all IP addresses from BarrierTaskContext. It would be simpler to > integrate with distributed frameworks like TensorFlow DistributionStrategy if > we provide all gather that can let tasks share additional information with > others, e.g., an available port. > Note that with all gather, tasks are share their IP addresses as well. > {code} > port = ... # get an available port > ports = context.all_gather(port) # get all available ports, ordered by task ID > ... # set up distributed training service > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30667) Support simple all gather in barrier task context
Xiangrui Meng created SPARK-30667: - Summary: Support simple all gather in barrier task context Key: SPARK-30667 URL: https://issues.apache.org/jira/browse/SPARK-30667 Project: Spark Issue Type: New Feature Components: PySpark, Spark Core Affects Versions: 3.0.0 Reporter: Xiangrui Meng Currently we offer task.barrier() to coordinate tasks in barrier mode. Tasks can see all IP addresses from BarrierTaskContext. It would be simpler to integrate with distributed frameworks like TensorFlow DistributionStrategy if we provide all gather that can let tasks share additional information with others, e.g., an available port. Note that with all gather, tasks are share their IP addresses as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30154) PySpark UDF to convert MLlib vectors to dense arrays
[ https://issues.apache.org/jira/browse/SPARK-30154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-30154. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26910 [https://github.com/apache/spark/pull/26910] > PySpark UDF to convert MLlib vectors to dense arrays > > > Key: SPARK-30154 > URL: https://issues.apache.org/jira/browse/SPARK-30154 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib, PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Weichen Xu >Priority: Major > Fix For: 3.0.0 > > > If a PySpark user wants to convert MLlib sparse/dense vectors in a DataFrame > into dense arrays, an efficient approach is to do that in JVM. However, it > requires PySpark user to write Scala code and register it as a UDF. Often > this is infeasible for a pure python project. > What we can do is to predefine those converters in Scala and expose them in > PySpark, e.g.: > {code} > from pyspark.ml.functions import vector_to_dense_array > df.select(vector_to_dense_array(col("features")) > {code} > cc: [~weichenxu123] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30154) PySpark UDF to convert MLlib vectors to dense arrays
[ https://issues.apache.org/jira/browse/SPARK-30154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-30154: -- Description: If a PySpark user wants to convert MLlib sparse/dense vectors in a DataFrame into dense arrays, an efficient approach is to do that in JVM. However, it requires PySpark user to write Scala code and register it as a UDF. Often this is infeasible for a pure python project. What we can do is to predefine those converters in Scala and expose them in PySpark, e.g.: {code} from pyspark.ml.functions import vector_to_dense_array df.select(vector_to_dense_array(col("features")) {code} cc: [~weichenxu123] was: If a PySpark user wants to convert MLlib sparse/dense vectors in a DataFrame into dense arrays, an efficient method is to do that in JVM. However, it requires PySpark user to write Scala code and register it as a UDF. Often this is infeasible for a pure python project. What we can do is to predefine those converters in Scala and expose them in PySpark, e.g.: {code} from pyspark.ml.functions import vector_to_dense_array df.select(vector_to_dense_array(col("features")) {code} cc: [~weichenxu123] > PySpark UDF to convert MLlib vectors to dense arrays > > > Key: SPARK-30154 > URL: https://issues.apache.org/jira/browse/SPARK-30154 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib, PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > If a PySpark user wants to convert MLlib sparse/dense vectors in a DataFrame > into dense arrays, an efficient approach is to do that in JVM. However, it > requires PySpark user to write Scala code and register it as a UDF. Often > this is infeasible for a pure python project. > What we can do is to predefine those converters in Scala and expose them in > PySpark, e.g.: > {code} > from pyspark.ml.functions import vector_to_dense_array > df.select(vector_to_dense_array(col("features")) > {code} > cc: [~weichenxu123] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30154) PySpark UDF to convert MLlib vectors to dense arrays
[ https://issues.apache.org/jira/browse/SPARK-30154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-30154: -- Summary: PySpark UDF to convert MLlib vectors to dense arrays (was: Allow PySpark code efficiently convert MLlib vectors to dense arrays) > PySpark UDF to convert MLlib vectors to dense arrays > > > Key: SPARK-30154 > URL: https://issues.apache.org/jira/browse/SPARK-30154 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib, PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > If a PySpark user wants to convert MLlib sparse/dense vectors in a DataFrame > into dense arrays, an efficient method is to do that in JVM. However, it > requires PySpark user to write Scala code and register it as a UDF. Often > this is infeasible for a pure python project. > What we can do is to predefine those converters in Scala and expose them in > PySpark, e.g.: > {code} > from pyspark.ml.functions import vector_to_dense_array > df.select(vector_to_dense_array(col("features")) > {code} > cc: [~weichenxu123] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30154) Allow PySpark code efficiently convert MLlib vectors to dense arrays
Xiangrui Meng created SPARK-30154: - Summary: Allow PySpark code efficiently convert MLlib vectors to dense arrays Key: SPARK-30154 URL: https://issues.apache.org/jira/browse/SPARK-30154 Project: Spark Issue Type: New Feature Components: ML, MLlib, PySpark Affects Versions: 3.0.0 Reporter: Xiangrui Meng If a PySpark user wants to convert MLlib sparse/dense vectors in a DataFrame into dense arrays, an efficient method is to do that in JVM. However, it requires PySpark user to write Scala code and register it as a UDF. Often this is infeasible for a pure python project. What we can do is to predefine those converters in Scala and expose them in PySpark, e.g.: {code} from pyspark.ml.functions import vector_to_dense_array df.select(vector_to_dense_array(col("features")) {code} cc: [~weichenxu123] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28978) PySpark: Can't pass more than 256 arguments to a UDF
[ https://issues.apache.org/jira/browse/SPARK-28978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-28978. --- Fix Version/s: 3.0.0 Assignee: Bago Amirbekian Resolution: Fixed > PySpark: Can't pass more than 256 arguments to a UDF > > > Key: SPARK-28978 > URL: https://issues.apache.org/jira/browse/SPARK-28978 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.2, 2.4.0, 2.4.4 >Reporter: Jim Fulton >Assignee: Bago Amirbekian >Priority: Major > Labels: koalas, mlflow, pyspark > Fix For: 3.0.0 > > > This code: > [https://github.com/apache/spark/blob/712874fa0937f0784f47740b127c3bab20da8569/python/pyspark/worker.py#L367-L379] > Creates Python lambdas that call UDF functions passing arguments singly, > rather than using varargs. For example: `lambda a: f(a[0], a[1], ...)`. > This fails when there are more than 256 arguments. > mlflow, when generating model predictions, uses an argument for each feature > column. I have a model with > 500 features. > I was able to easily hack around this by changing the generated lambdas to > use varargs, as in `lambda a: f(*a)`. > IDK why these lambdas were created the way they were. Using varargs is much > simpler and works fine in my testing. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29417) Resource Scheduling - add TaskContext.resource java api
[ https://issues.apache.org/jira/browse/SPARK-29417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-29417. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26083 [https://github.com/apache/spark/pull/26083] > Resource Scheduling - add TaskContext.resource java api > --- > > Key: SPARK-29417 > URL: https://issues.apache.org/jira/browse/SPARK-29417 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Major > Fix For: 3.0.0 > > > I noticed the TaskContext.resource() api we added returns a scala Map. This > isn't very nice for the java api usage, so we should add an api that returns > a java Map. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28206) "@pandas_udf" in doctest is rendered as ":pandas_udf" in html API doc
[ https://issues.apache.org/jira/browse/SPARK-28206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-28206: - Assignee: Hyukjin Kwon > "@pandas_udf" in doctest is rendered as ":pandas_udf" in html API doc > - > > Key: SPARK-28206 > URL: https://issues.apache.org/jira/browse/SPARK-28206 > Project: Spark > Issue Type: Bug > Components: Documentation, PySpark >Affects Versions: 2.4.1 >Reporter: Xiangrui Meng >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.0.0 > > Attachments: Screen Shot 2019-06-28 at 9.55.13 AM.png > > > Just noticed that in [pandas_udf API doc > |https://spark.apache.org/docs/2.4.1/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf], > "@pandas_udf" is render as ":pandas_udf". > cc: [~hyukjin.kwon] [~smilegator] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28206) "@pandas_udf" in doctest is rendered as ":pandas_udf" in html API doc
[ https://issues.apache.org/jira/browse/SPARK-28206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-28206. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 25060 [https://github.com/apache/spark/pull/25060] > "@pandas_udf" in doctest is rendered as ":pandas_udf" in html API doc > - > > Key: SPARK-28206 > URL: https://issues.apache.org/jira/browse/SPARK-28206 > Project: Spark > Issue Type: Bug > Components: Documentation, PySpark >Affects Versions: 2.4.1 >Reporter: Xiangrui Meng >Priority: Major > Fix For: 3.0.0 > > Attachments: Screen Shot 2019-06-28 at 9.55.13 AM.png > > > Just noticed that in [pandas_udf API doc > |https://spark.apache.org/docs/2.4.1/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf], > "@pandas_udf" is render as ":pandas_udf". > cc: [~hyukjin.kwon] [~smilegator] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28206) "@pandas_udf" in doctest is rendered as ":pandas_udf" in html API doc
[ https://issues.apache.org/jira/browse/SPARK-28206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-28206: -- Attachment: Screen Shot 2019-06-28 at 9.55.13 AM.png > "@pandas_udf" in doctest is rendered as ":pandas_udf" in html API doc > - > > Key: SPARK-28206 > URL: https://issues.apache.org/jira/browse/SPARK-28206 > Project: Spark > Issue Type: Bug > Components: Documentation, PySpark >Affects Versions: 2.4.1 >Reporter: Xiangrui Meng >Priority: Major > Attachments: Screen Shot 2019-06-28 at 9.55.13 AM.png > > > Just noticed that in [pandas_udf API doc > |https://spark.apache.org/docs/2.4.1/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf], > "@pandas_udf" is render as ":pandas_udf". > cc: [~hyukjin.kwon] [~smilegator] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28206) "@pandas_udf" in doctest is rendered as ":pandas_udf" in html API doc
[ https://issues.apache.org/jira/browse/SPARK-28206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-28206: -- Issue Type: Bug (was: Documentation) > "@pandas_udf" in doctest is rendered as ":pandas_udf" in html API doc > - > > Key: SPARK-28206 > URL: https://issues.apache.org/jira/browse/SPARK-28206 > Project: Spark > Issue Type: Bug > Components: Documentation, PySpark >Affects Versions: 2.4.1 >Reporter: Xiangrui Meng >Priority: Major > > Just noticed that in [pandas_udf API doc > |https://spark.apache.org/docs/2.4.1/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf], > "@pandas_udf" is render as ":pandas_udf". > cc: [~hyukjin.kwon] [~smilegator] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28206) "@pandas_udf" in doctest is rendered as ":pandas_udf" in html API doc
[ https://issues.apache.org/jira/browse/SPARK-28206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-28206: -- Summary: "@pandas_udf" in doctest is rendered as ":pandas_udf" in html API doc (was: "@pandas_udf" in doctest is rendered as ":pandas_udf" in html) > "@pandas_udf" in doctest is rendered as ":pandas_udf" in html API doc > - > > Key: SPARK-28206 > URL: https://issues.apache.org/jira/browse/SPARK-28206 > Project: Spark > Issue Type: Documentation > Components: Documentation, PySpark >Affects Versions: 2.4.1 >Reporter: Xiangrui Meng >Priority: Major > > Just noticed that in [pandas_udf API doc > |https://spark.apache.org/docs/2.4.1/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf], > "@pandas_udf" is render as ":pandas_udf". > cc: [~hyukjin.kwon] [~smilegator] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28206) "@pandas_udf" in doctest is rendered as ":pandas_udf" in html
[ https://issues.apache.org/jira/browse/SPARK-28206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-28206: -- Summary: "@pandas_udf" in doctest is rendered as ":pandas_udf" in html (was: "@" is rendered as ":" in doctest) > "@pandas_udf" in doctest is rendered as ":pandas_udf" in html > - > > Key: SPARK-28206 > URL: https://issues.apache.org/jira/browse/SPARK-28206 > Project: Spark > Issue Type: Documentation > Components: Documentation, PySpark >Affects Versions: 2.4.1 >Reporter: Xiangrui Meng >Priority: Major > > Just noticed that in [pandas_udf API doc > |https://spark.apache.org/docs/2.4.1/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf], > "@pandas_udf" is render as ":pandas_udf". > cc: [~hyukjin.kwon] [~smilegator] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28206) "@" is rendered as ":" in doctest
Xiangrui Meng created SPARK-28206: - Summary: "@" is rendered as ":" in doctest Key: SPARK-28206 URL: https://issues.apache.org/jira/browse/SPARK-28206 Project: Spark Issue Type: Documentation Components: Documentation, PySpark Affects Versions: 2.4.1 Reporter: Xiangrui Meng Just noticed that in [pandas_udf API doc |https://spark.apache.org/docs/2.4.1/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf], "@pandas_udf" is render as ":pandas_udf". cc: [~hyukjin.kwon] [~smilegator] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28115) Fix flaky test: SparkContextSuite.test resource scheduling under local-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-28115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-28115. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24917 [https://github.com/apache/spark/pull/24917] > Fix flaky test: SparkContextSuite.test resource scheduling under > local-cluster mode > --- > > Key: SPARK-28115 > URL: https://issues.apache.org/jira/browse/SPARK-28115 > Project: Spark > Issue Type: Test > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xingbo Jiang >Assignee: Xingbo Jiang >Priority: Major > Fix For: 3.0.0 > > > This test suite has two kind of failures. > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/6486/testReport/org.apache.spark/SparkContextSuite/test_resource_scheduling_under_local_cluster_mode/history/ > {code} > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 3921 times over > 1.88681567 minutes. Last failure message: > Array(org.apache.spark.SparkExecutorInfoImpl@533793be, > org.apache.spark.SparkExecutorInfoImpl@5018e57d, > org.apache.spark.SparkExecutorInfoImpl@dea2485, > org.apache.spark.SparkExecutorInfoImpl@9e63ecd) had size 4 instead of > expected size 3. > at > org.scalatest.concurrent.Eventually.tryTryAgain$1(Eventually.scala:432) > at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:439) > at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:391) > at > org.apache.spark.SparkContextSuite.eventually(SparkContextSuite.scala:50) > at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:337) > at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:336) > at > org.apache.spark.SparkContextSuite.eventually(SparkContextSuite.scala:50) > at > org.apache.spark.SparkContextSuite.$anonfun$new$93(SparkContextSuite.scala:885){code} > {code} > org.scalatest.exceptions.TestFailedException: Array("0", "0", "1", "1", "1", > "2", "2", "2", "2") did not equal List("0", "0", "0", "1", "1", "1", "2", > "2", "2") > at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:528) > at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:527) > at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) > at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501) > at > org.apache.spark.SparkContextSuite.$anonfun$new$93(SparkContextSuite.scala:894) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28056) Document SCALAR_ITER Pandas UDF
[ https://issues.apache.org/jira/browse/SPARK-28056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-28056. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24897 [https://github.com/apache/spark/pull/24897] > Document SCALAR_ITER Pandas UDF > --- > > Key: SPARK-28056 > URL: https://issues.apache.org/jira/browse/SPARK-28056 > Project: Spark > Issue Type: Documentation > Components: Documentation, PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > Fix For: 3.0.0 > > > After SPARK-26412, we should document the new SCALAR_ITER Pandas UDF so user > can discover the feature and learn how to use it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28056) Document SCALAR_ITER Pandas UDF
[ https://issues.apache.org/jira/browse/SPARK-28056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-28056: - Assignee: Xiangrui Meng > Document SCALAR_ITER Pandas UDF > --- > > Key: SPARK-28056 > URL: https://issues.apache.org/jira/browse/SPARK-28056 > Project: Spark > Issue Type: Documentation > Components: Documentation, PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > > After SPARK-26412, we should document the new SCALAR_ITER Pandas UDF so user > can discover the feature and learn how to use it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26412) Allow Pandas UDF to take an iterator of pd.DataFrames
[ https://issues.apache.org/jira/browse/SPARK-26412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-26412. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24643 [https://github.com/apache/spark/pull/24643] > Allow Pandas UDF to take an iterator of pd.DataFrames > - > > Key: SPARK-26412 > URL: https://issues.apache.org/jira/browse/SPARK-26412 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Weichen Xu >Priority: Major > Fix For: 3.0.0 > > > Pandas UDF is the ideal connection between PySpark and DL model inference > workload. However, user needs to load the model file first to make > predictions. It is common to see models of size ~100MB or bigger. If the > Pandas UDF execution is limited to each batch, user needs to repeatedly load > the same model for every batch in the same python worker process, which is > inefficient. > We can provide users the iterator of batches in pd.DataFrame and let user > code handle it: > {code} > @pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITER) > def predict(batch_iter): > model = ... # load model > for batch in batch_iter: > yield model.predict(batch) > {code} > The type of each batch is: > * a pd.Series if UDF is called with a single non-struct-type column > * a tuple of pd.Series if UDF is called with more than one Spark DF columns > * a pd.DataFrame if UDF is called with a single StructType column > Examples: > {code} > @pandas_udf(...) > def evaluate(batch_iter): > model = ... # load model > for features, label in batch_iter: > pred = model.predict(features) > yield (pred - label).abs() > df.select(evaluate(col("features"), col("label")).alias("err")) > {code} > {code} > @pandas_udf(...) > def evaluate(pdf_iter): > model = ... # load model > for pdf in pdf_iter: > pred = model.predict(pdf['x']) > yield (pred - pdf['y']).abs() > df.select(evaluate(struct(col("features"), col("label"))).alias("err")) > {code} > If the UDF doesn't return the same number of records for the entire > partition, user should see an error. We don't restrict that every yield > should match the input batch size. > Another benefit is with iterator interface and asyncio from Python, it is > flexible for users to implement data pipelining. > cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28056) Document SCALAR_ITER Pandas UDF
Xiangrui Meng created SPARK-28056: - Summary: Document SCALAR_ITER Pandas UDF Key: SPARK-28056 URL: https://issues.apache.org/jira/browse/SPARK-28056 Project: Spark Issue Type: Documentation Components: Documentation, PySpark Affects Versions: 3.0.0 Reporter: Xiangrui Meng After SPARK-26412, we should document the new SCALAR_ITER Pandas UDF so user can discover the feature and learn how to use it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28030) Binary file data source doesn't support space in file names
[ https://issues.apache.org/jira/browse/SPARK-28030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-28030. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24855 [https://github.com/apache/spark/pull/24855] > Binary file data source doesn't support space in file names > --- > > Key: SPARK-28030 > URL: https://issues.apache.org/jira/browse/SPARK-28030 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > Fix For: 3.0.0 > > > {code} > echo 123 > "/tmp/test space.txt" > spark.read.format("binaryFile").load("/tmp/test space.txt").count() > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27823) Add an abstraction layer for accelerator resource handling to avoid manipulating raw confs
[ https://issues.apache.org/jira/browse/SPARK-27823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-27823: - Assignee: Xiangrui Meng (was: Thomas Graves) > Add an abstraction layer for accelerator resource handling to avoid > manipulating raw confs > -- > > Key: SPARK-27823 > URL: https://issues.apache.org/jira/browse/SPARK-27823 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > > In SPARK-27488, we extract resource requests and allocation by parsing raw > Spark confs. This hurts readability because we didn't have the abstraction at > resource level. After we merge the core changes, we should do a refactoring > and make the code more readable. > See https://github.com/apache/spark/pull/24615#issuecomment-494580663. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28030) Binary file data source doesn't support space in file names
[ https://issues.apache.org/jira/browse/SPARK-28030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-28030: -- Description: {code} echo 123 > "/tmp/test space.txt" spark.read.format("binaryFile").load("/tmp/test space.txt").count() {code} > Binary file data source doesn't support space in file names > --- > > Key: SPARK-28030 > URL: https://issues.apache.org/jira/browse/SPARK-28030 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > > {code} > echo 123 > "/tmp/test space.txt" > spark.read.format("binaryFile").load("/tmp/test space.txt").count() > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28030) Binary file data source doesn't support space in file names
Xiangrui Meng created SPARK-28030: - Summary: Binary file data source doesn't support space in file names Key: SPARK-28030 URL: https://issues.apache.org/jira/browse/SPARK-28030 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27360) Standalone cluster mode support for GPU-aware scheduling
[ https://issues.apache.org/jira/browse/SPARK-27360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16861353#comment-16861353 ] Xiangrui Meng commented on SPARK-27360: --- Done. Thanks for taking the task! > Standalone cluster mode support for GPU-aware scheduling > > > Key: SPARK-27360 > URL: https://issues.apache.org/jira/browse/SPARK-27360 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > > Design and implement standalone manager support for GPU-aware scheduling: > 1. static conf to describe resources > 2. spark-submit to request resources > 2. auto discovery of GPUs > 3. executor process isolation -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Deleted] (SPARK-27999) setup resources when Standalone Worker starts up
[ https://issues.apache.org/jira/browse/SPARK-27999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng deleted SPARK-27999: -- > setup resources when Standalone Worker starts up > > > Key: SPARK-27999 > URL: https://issues.apache.org/jira/browse/SPARK-27999 > Project: Spark > Issue Type: Sub-task >Reporter: wuyi >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27372) Standalone executor process-level isolation to support GPU scheduling
[ https://issues.apache.org/jira/browse/SPARK-27372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-27372: -- Issue Type: Story (was: Sub-task) Parent: (was: SPARK-27360) > Standalone executor process-level isolation to support GPU scheduling > - > > Key: SPARK-27372 > URL: https://issues.apache.org/jira/browse/SPARK-27372 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > As an admin, I can configure standalone to have multiple executor processes > on the same worker node and processes are configured via cgroups so they only > have access to assigned GPUs. So I don't need to worry about resource > contention between processes on the same host. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27371) Standalone master receives resource info from worker and allocate driver/executor properly
[ https://issues.apache.org/jira/browse/SPARK-27371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-27371: -- Summary: Standalone master receives resource info from worker and allocate driver/executor properly (was: Master receives resource info from worker and allocate driver/executor properly) > Standalone master receives resource info from worker and allocate > driver/executor properly > -- > > Key: SPARK-27371 > URL: https://issues.apache.org/jira/browse/SPARK-27371 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > As an admin, I can let Spark Standalone worker automatically discover GPUs > installed on worker nodes. So I don't need to manually configure them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27371) Master receives resource info from worker and allocate driver/executor properly
[ https://issues.apache.org/jira/browse/SPARK-27371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-27371: -- Summary: Master receives resource info from worker and allocate driver/executor properly (was: Standalone worker can auto discover GPUs) > Master receives resource info from worker and allocate driver/executor > properly > --- > > Key: SPARK-27371 > URL: https://issues.apache.org/jira/browse/SPARK-27371 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > As an admin, I can let Spark Standalone worker automatically discover GPUs > installed on worker nodes. So I don't need to manually configure them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Deleted] (SPARK-27370) spark-submit requests GPUs in standalone mode
[ https://issues.apache.org/jira/browse/SPARK-27370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng deleted SPARK-27370: -- > spark-submit requests GPUs in standalone mode > - > > Key: SPARK-27370 > URL: https://issues.apache.org/jira/browse/SPARK-27370 > Project: Spark > Issue Type: Sub-task >Reporter: Xiangrui Meng >Priority: Major > > As a user, I can use spark-submit to request GPUs per task in standalone mode > when I submit an Spark application. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27369) Standalone worker can load resource conf and discover resources
[ https://issues.apache.org/jira/browse/SPARK-27369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-27369: -- Summary: Standalone worker can load resource conf and discover resources (was: Standalone support static conf to describe GPU resources) > Standalone worker can load resource conf and discover resources > --- > > Key: SPARK-27369 > URL: https://issues.apache.org/jira/browse/SPARK-27369 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27368) Design: Standalone supports GPU scheduling
[ https://issues.apache.org/jira/browse/SPARK-27368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-27368: -- Description: Design draft: Scenarios: * client-mode, worker might create one or more executor processes, from different Spark applications. * cluster-mode, worker might create driver process as well. * local-cluster model, there could be multiple worker processes on the same node. This is an undocumented use of standalone mode, which is mainly for tests. * Resource isolation is not considered here. Because executor and driver processes on the same node will share the accelerator resources, worker must take the role that allocates resources. So we will add spark.worker.resource.[resourceName].discoveryScript conf for workers to discover resources. User need to match the resourceName in driver and executor requests. Besides CPU cores and memory, worker now also considers resources in creating new executors or drivers. Example conf: {code} # static worker conf spark.worker.resource.gpu.discoveryScript=/path/to/list-gpus.sh # application conf spark.driver.resource.gpu.amount=4 spark.executor.resource.gpu.amount=2 spark.task.resource.gpu.amount=1 {code} In client mode, driver process is not launched by worker. So user can specify driver resource discovery script. In cluster mode, if user still specify driver resource discovery script, it is ignored with a warning. Supporting resource isolation is tricky because Spark worker doesn't know how to isolate resources unless we hardcode some resource names like GPU support in YARN, which is less ideal. Support resource isolation of multiple resource types is even harder. In the first version, we will implement accelerator support without resource isolation. Timeline: 1. Worker starts. 2. Worker loads `work.source.*` conf and runs discovery scripts to discover resources. 3. Worker reports to master cores, memory, and resources (new) and registers. 4. An application starts. 5. Master finds workers with sufficient available resources and let worker start executor or driver process. 6. Worker assigns executor / driver resources by passing the resource info from command-line. 7. Application ends. 8. Master requests worker to kill driver/executor process. 9. Master updates available resources. was: Design draft: Scenarios: * client-mode, worker might create one or more executor processes, from different Spark applications. * cluster-mode, worker might create driver process as well. * local-cluster model, there could be multiple worker processes on the same node. This is an undocumented use of standalone mode, which is mainly for tests. * Resource isolation is not considered here. Because executor and driver processes on the same node will share the accelerator resources, worker must take the role that allocates resources. So we will add spark.worker.resource.[resourceName].discoveryScript conf for workers to discover resources. User need to match the resourceName in driver and executor requests. Besides CPU cores and memory, worker now also considers resources in creating new executors or drivers. Example conf: {code} spark.worker.resource.gpu.discoveryScript=/path/to/list-gpus.sh spark.driver.resource.gpu.count=4 spark.executor.resource.gpu.count=1 {code} In client mode, driver process is not launched by worker. So user can specify driver resource discovery script. In cluster mode, if user still specify driver resource discovery script, it is ignored with a warning. Supporting resource isolation is tricky because Spark worker doesn't know how to isolate resources unless we hardcode some resource names like GPU support in YARN, which is less ideal. Support resource isolation of multiple resource types is even harder. In the first version, we will implement accelerator support without resource isolation. Timeline: 1. Worker starts. 2. Worker loads `work.source.*` conf and runs discovery scripts to discover resources. 3. Worker reports to master cores, memory, and resources (new) and registers. 4. An application starts. 5. Master finds workers with sufficient available resources and let worker start executor or driver process. 6. Worker assigns executor / driver resources by passing the resource info from command-line. 7. Application ends. 8. Master requests worker to kill driver/executor process. 9. Master updates available resources. > Design: Standalone supports GPU scheduling > -- > > Key: SPARK-27368 > URL: https://issues.apache.org/jira/browse/SPARK-27368 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > > Design draft: > Scenarios: > * client-mode, worker might create on
[jira] [Updated] (SPARK-27368) Design: Standalone supports GPU scheduling
[ https://issues.apache.org/jira/browse/SPARK-27368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-27368: -- Description: Design draft: Scenarios: * client-mode, worker might create one or more executor processes, from different Spark applications. * cluster-mode, worker might create driver process as well. * local-cluster model, there could be multiple worker processes on the same node. This is an undocumented use of standalone mode, which is mainly for tests. * Resource isolation is not considered here. Because executor and driver processes on the same node will share the accelerator resources, worker must take the role that allocates resources. So we will add spark.worker.resource.[resourceName].discoveryScript conf for workers to discover resources. User need to match the resourceName in driver and executor requests. Besides CPU cores and memory, worker now also considers resources in creating new executors or drivers. Example conf: {code} spark.worker.resource.gpu.discoveryScript=/path/to/list-gpus.sh spark.driver.resource.gpu.count=4 spark.executor.resource.gpu.count=1 {code} In client mode, driver process is not launched by worker. So user can specify driver resource discovery script. In cluster mode, if user still specify driver resource discovery script, it is ignored with a warning. Supporting resource isolation is tricky because Spark worker doesn't know how to isolate resources unless we hardcode some resource names like GPU support in YARN, which is less ideal. Support resource isolation of multiple resource types is even harder. In the first version, we will implement accelerator support without resource isolation. Timeline: 1. Worker starts. 2. Worker loads `work.source.*` conf and runs discovery scripts to discover resources. 3. Worker reports to master cores, memory, and resources (new) and registers. 4. An application starts. 5. Master finds workers with sufficient available resources and let worker start executor or driver process. 6. Worker assigns executor / driver resources by passing the resource info from command-line. 7. Application ends. 8. Master requests worker to kill driver/executor process. 9. Master updates available resources. was: Design draft: Scenarios: * client-mode, worker might create one or more executor processes, from different Spark applications. * cluster-mode, worker might create driver process as well. * local-cluster model, there could be multiple worker processes on the same node. This is an undocumented use of standalone mode, which is mainly for tests. * Resource isolation is not considered here. Because executor and driver processes on the same node will share the accelerator resources, worker must take the role that allocates resources. So we will add spark.worker.resource.[resourceName].discoveryScript conf for workers to discover resources. User need to match the resourceName in driver and executor requests. Besides CPU cores and memory, worker now also considers resources in creating new executors or drivers. Example conf: {code} spark.worker.resource.gpu.discoveryScript=/path/to/list-gpus.sh spark.driver.resource.gpu.count=4 spark.worker.resource.gpu.count=1 {code} In client mode, driver process is not launched by worker. So user can specify driver resource discovery script. In cluster mode, if user still specify driver resource discovery script, it is ignored with a warning. Supporting resource isolation is tricky because Spark worker doesn't know how to isolate resources unless we hardcode some resource names like GPU support in YARN, which is less ideal. Support resource isolation of multiple resource types is even harder. In the first version, we will implement accelerator support without resource isolation. Timeline: 1. Worker starts. 2. Worker loads `work.source.*` conf and runs discovery scripts to discover resources. 3. Worker reports to master cores, memory, and resources (new) and registers. 4. An application starts. 5. Master finds workers with sufficient available resources and let worker start executor or driver process. 6. Worker assigns executor / driver resources by passing the resource info from command-line. 7. Application ends. 8. Master requests worker to kill driver/executor process. 9. Master updates available resources. > Design: Standalone supports GPU scheduling > -- > > Key: SPARK-27368 > URL: https://issues.apache.org/jira/browse/SPARK-27368 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > > Design draft: > Scenarios: > * client-mode, worker might create one or more executor processes, from > different Spark applications. > * cluste
[jira] [Updated] (SPARK-27368) Design: Standalone supports GPU scheduling
[ https://issues.apache.org/jira/browse/SPARK-27368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-27368: -- Description: Design draft: Scenarios: * client-mode, worker might create one or more executor processes, from different Spark applications. * cluster-mode, worker might create driver process as well. * local-cluster model, there could be multiple worker processes on the same node. This is an undocumented use of standalone mode, which is mainly for tests. * Resource isolation is not considered here. Because executor and driver processes on the same node will share the accelerator resources, worker must take the role that allocates resources. So we will add spark.worker.resource.[resourceName].discoveryScript conf for workers to discover resources. User need to match the resourceName in driver and executor requests. Besides CPU cores and memory, worker now also considers resources in creating new executors or drivers. Example conf: {code} spark.worker.resource.gpu.discoveryScript=/path/to/list-gpus.sh spark.driver.resource.gpu.count=4 spark.worker.resource.gpu.count=1 {code} In client mode, driver process is not launched by worker. So user can specify driver resource discovery script. In cluster mode, if user still specify driver resource discovery script, it is ignored with a warning. Supporting resource isolation is tricky because Spark worker doesn't know how to isolate resources unless we hardcode some resource names like GPU support in YARN, which is less ideal. Support resource isolation of multiple resource types is even harder. In the first version, we will implement accelerator support without resource isolation. Timeline: 1. Worker starts. 2. Worker loads `work.source.*` conf and runs discovery scripts to discover resources. 3. Worker reports to master cores, memory, and resources (new) and registers. 4. An application starts. 5. Master finds workers with sufficient available resources and let worker start executor or driver process. 6. Worker assigns executor / driver resources by passing the resource info from command-line. 7. Application ends. 8. Master requests worker to kill driver/executor process. 9. Master updates available resources. was: Design draft: Scenarios: * client-mode, worker might create one or more executor processes, from different Spark applications. * cluster-mode, worker might create driver process as well. * local-cluster model, there could be multiple worker processes on the same node. This is an undocumented use of standalone mode, which is mainly for tests. * Resource isolation is not considered here. Because executor and driver processes on the same node will share the accelerator resources, worker must take the role that allocates resources. So we will add spark.worker.resource.[resourceName].discoveryScript conf for workers to discover resources. User need to match the resourceName in driver and executor requests. Besides CPU cores and memory, worker now also considers resources in creating new executors or drivers. Example conf: {code} spark.worker.resource.gpu.discoveryScript=/path/to/list-gpus.sh spark.driver.resource.gpu.count=4 spark.worker.resource.gpu.count=1 {code} In client mode, driver process is not launched by worker. So user can specify driver resource discovery script. In cluster mode, if user still specify driver resource discovery script, it is ignored with a warning. Supporting resource isolation is tricky because Spark worker doesn't know how to isolate resources unless we hardcode some resource names like GPU support in YARN, which is less ideal. Support resource isolation of multiple resource types is even harder. In the first version, we will implement accelerator support without resource isolation. > Design: Standalone supports GPU scheduling > -- > > Key: SPARK-27368 > URL: https://issues.apache.org/jira/browse/SPARK-27368 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > > Design draft: > Scenarios: > * client-mode, worker might create one or more executor processes, from > different Spark applications. > * cluster-mode, worker might create driver process as well. > * local-cluster model, there could be multiple worker processes on the same > node. This is an undocumented use of standalone mode, which is mainly for > tests. > * Resource isolation is not considered here. > Because executor and driver processes on the same node will share the > accelerator resources, worker must take the role that allocates resources. So > we will add spark.worker.resource.[resourceName].discoveryScript conf for > workers to discover resources. User need to match
[jira] [Resolved] (SPARK-27968) ArrowEvalPythonExec.evaluate shouldn't eagerly read the first row
[ https://issues.apache.org/jira/browse/SPARK-27968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-27968. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24816 [https://github.com/apache/spark/pull/24816] > ArrowEvalPythonExec.evaluate shouldn't eagerly read the first row > - > > Key: SPARK-27968 > URL: https://issues.apache.org/jira/browse/SPARK-27968 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > Fix For: 3.0.0 > > > An issue mentioned here: > https://github.com/apache/spark/pull/24734/files#r288377915, could be > decoupled from that PR. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26412) Allow Pandas UDF to take an iterator of pd.DataFrames
[ https://issues.apache.org/jira/browse/SPARK-26412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-26412: -- Description: Pandas UDF is the ideal connection between PySpark and DL model inference workload. However, user needs to load the model file first to make predictions. It is common to see models of size ~100MB or bigger. If the Pandas UDF execution is limited to each batch, user needs to repeatedly load the same model for every batch in the same python worker process, which is inefficient. We can provide users the iterator of batches in pd.DataFrame and let user code handle it: {code} @pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITER) def predict(batch_iter): model = ... # load model for batch in batch_iter: yield model.predict(batch) {code} The type of each batch is: * a pd.Series if UDF is called with a single non-struct-type column * a tuple of pd.Series if UDF is called with more than one Spark DF columns * a pd.DataFrame if UDF is called with a single StructType column Examples: {code} @pandas_udf(...) def evaluate(batch_iter): model = ... # load model for features, label in batch_iter: pred = model.predict(features) yield (pred - label).abs() df.select(evaluate(col("features"), col("label")).alias("err")) {code} {code} @pandas_udf(...) def evaluate(pdf_iter): model = ... # load model for pdf in pdf_iter: pred = model.predict(pdf['x']) yield (pred - pdf['y']).abs() df.select(evaluate(struct(col("features"), col("label"))).alias("err")) {code} If the UDF doesn't return the same number of records for the entire partition, user should see an error. We don't restrict that every yield should match the input batch size. Another benefit is with iterator interface and asyncio from Python, it is flexible for users to implement data pipelining. cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator] was: Pandas UDF is the ideal connection between PySpark and DL model inference workload. However, user needs to load the model file first to make predictions. It is common to see models of size ~100MB or bigger. If the Pandas UDF execution is limited to each batch, user needs to repeatedly load the same model for every batch in the same python worker process, which is inefficient. We can provide users the iterator of batches in pd.DataFrame and let user code handle it: {code} @pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITER) def predict(batch_iter): model = ... # load model for batch in batch_iter: yield model.predict(batch) {code} The type of each batch is: * a pd.Series if UDF is called with a single non-struct-type column * a tuple of pd.Series if predict is called with more than one Spark DF columns * a pd.DataFrame if predict is called with a single StructType column {code} @pandas_udf(...) def evaluate(batch_iter): model = ... # load model for features, label in batch_iter: pred = model.predict(features) yield (pred - label).abs() df.select(evaluate(col("features"), col("label")).alias("err")) {code} {code} @pandas_udf(...) def evaluate(pdf_iter): model = ... # load model for pdf in pdf_iter: pred = model.predict(pdf['x']) yield (pred - pdf['y']).abs() df.select(evaluate(struct(col("features"), col("label"))).alias("err")) {code} Another benefit is with iterator interface and asyncio from Python, it is flexible for users to implement data pipelining. cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator] > Allow Pandas UDF to take an iterator of pd.DataFrames > - > > Key: SPARK-26412 > URL: https://issues.apache.org/jira/browse/SPARK-26412 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Weichen Xu >Priority: Major > > Pandas UDF is the ideal connection between PySpark and DL model inference > workload. However, user needs to load the model file first to make > predictions. It is common to see models of size ~100MB or bigger. If the > Pandas UDF execution is limited to each batch, user needs to repeatedly load > the same model for every batch in the same python worker process, which is > inefficient. > We can provide users the iterator of batches in pd.DataFrame and let user > code handle it: > {code} > @pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITER) > def predict(batch_iter): > model = ... # load model > for batch in batch_iter: > yield model.predict(batch) > {code} > The type of each batch is: > * a pd.Series if UDF is called with a single non-struct-type column > * a tuple of pd.Series if UDF is called with more than one Spark DF columns > * a pd.DataFrame if UDF is called with a single StructType column > Exa
[jira] [Updated] (SPARK-26412) Allow Pandas UDF to take an iterator of pd.DataFrames
[ https://issues.apache.org/jira/browse/SPARK-26412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-26412: -- Description: Pandas UDF is the ideal connection between PySpark and DL model inference workload. However, user needs to load the model file first to make predictions. It is common to see models of size ~100MB or bigger. If the Pandas UDF execution is limited to each batch, user needs to repeatedly load the same model for every batch in the same python worker process, which is inefficient. We can provide users the iterator of batches in pd.DataFrame and let user code handle it: {code} @pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITER) def predict(batch_iter): model = ... # load model for batch in batch_iter: yield model.predict(batch) {code} The type of each batch is: * a pd.Series if UDF is called with a single non-struct-type column * a tuple of pd.Series if predict is called with more than one Spark DF columns * a pd.DataFrame if predict is called with a single StructType column {code} @pandas_udf(...) def evaluate(batch_iter): model = ... # load model for features, label in batch_iter: pred = model.predict(features) yield (pred - label).abs() df.select(evaluate(col("features"), col("label")).alias("err")) {code} {code} @pandas_udf(...) def evaluate(pdf_iter): model = ... # load model for pdf in pdf_iter: pred = model.predict(pdf['x']) yield (pred - pdf['y']).abs() df.select(evaluate(struct(col("features"), col("label"))).alias("err")) {code} Another benefit is with iterator interface and asyncio from Python, it is flexible for users to implement data pipelining. cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator] was: Pandas UDF is the ideal connection between PySpark and DL model inference workload. However, user needs to load the model file first to make predictions. It is common to see models of size ~100MB or bigger. If the Pandas UDF execution is limited to each batch, user needs to repeatedly load the same model for every batch in the same python worker process, which is inefficient. We can provide users the iterator of batches in pd.DataFrame and let user code handle it: {code} @pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITER) def predict(batch_iter): model = ... # load model for batch in batch_iter: yield model.predict(batch) {code} The type of each batch is: * pd.Series if UDF is called with a single non-struct-type column * a tuple of pd.Series if predict is called with more than one Spark DF columns * a pd.DataFrame if predict is called with a single StructType column {code} @pandas_udf(...) def evaluate(batch_iter): model = ... # load model for features, label in batch_iter: pred = model.predict(features) yield (pred - label).abs() df.select(evaluate(col("features"), col("label")).alias("err")) {code} {code} @pandas_udf(...) def evaluate(pdf_iter): model = ... # load model for pdf in pdf_iter: pred = model.predict(pdf['x']) yield (pred - pdf['y']).abs() df.select(evaluate(struct(col("features"), col("label"))).alias("err")) {code} Another benefit is with iterator interface and asyncio from Python, it is flexible for users to implement data pipelining. cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator] > Allow Pandas UDF to take an iterator of pd.DataFrames > - > > Key: SPARK-26412 > URL: https://issues.apache.org/jira/browse/SPARK-26412 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Weichen Xu >Priority: Major > > Pandas UDF is the ideal connection between PySpark and DL model inference > workload. However, user needs to load the model file first to make > predictions. It is common to see models of size ~100MB or bigger. If the > Pandas UDF execution is limited to each batch, user needs to repeatedly load > the same model for every batch in the same python worker process, which is > inefficient. > We can provide users the iterator of batches in pd.DataFrame and let user > code handle it: > {code} > @pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITER) > def predict(batch_iter): > model = ... # load model > for batch in batch_iter: > yield model.predict(batch) > {code} > The type of each batch is: > * a pd.Series if UDF is called with a single non-struct-type column > * a tuple of pd.Series if predict is called with more than one Spark DF > columns > * a pd.DataFrame if predict is called with a single StructType column > {code} > @pandas_udf(...) > def evaluate(batch_iter): > model = ... # load model > for features, label in batch_iter: > pred = model.predict(features) > yield (pre
[jira] [Updated] (SPARK-26412) Allow Pandas UDF to take an iterator of pd.DataFrames
[ https://issues.apache.org/jira/browse/SPARK-26412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-26412: -- Description: Pandas UDF is the ideal connection between PySpark and DL model inference workload. However, user needs to load the model file first to make predictions. It is common to see models of size ~100MB or bigger. If the Pandas UDF execution is limited to each batch, user needs to repeatedly load the same model for every batch in the same python worker process, which is inefficient. We can provide users the iterator of batches in pd.DataFrame and let user code handle it: {code} @pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITER) def predict(batch_iter): model = ... # load model for batch in batch_iter: yield model.predict(batch) {code} The type of each batch is: * pd.Series if UDF is called with a single non-struct-type column * a tuple of pd.Series if predict is called with more than one Spark DF columns * a pd.DataFrame if predict is called with a single StructType column {code} @pandas_udf(...) def evaluate(batch_iter): model = ... # load model for features, label in batch_iter: pred = model.predict(features) yield (pred - label).abs() df.select(evaluate(col("features"), col("label")).alias("err")) {code} {code} @pandas_udf(...) def evaluate(pdf_iter): model = ... # load model for pdf in pdf_iter: pred = model.predict(pdf['x']) yield (pred - pdf['y']).abs() df.select(evaluate(struct(col("features"), col("label"))).alias("err")) {code} Another benefit is with iterator interface and asyncio from Python, it is flexible for users to implement data pipelining. cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator] was: Pandas UDF is the ideal connection between PySpark and DL model inference workload. However, user needs to load the model file first to make predictions. It is common to see models of size ~100MB or bigger. If the Pandas UDF execution is limited to each batch, user needs to repeatedly load the same model for every batch in the same python worker process, which is inefficient. We can provide users the iterator of batches in pd.DataFrame and let user code handle it: {code} @pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITERATOR) def predict(batch_iter): model = ... # load model for batch in batch_iter: yield model.predict(batch) {code} We might add a contract that each yield must match the corresponding batch size. Another benefit is with iterator interface and asyncio from Python, it is flexible for users to implement data pipelining. cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator] > Allow Pandas UDF to take an iterator of pd.DataFrames > - > > Key: SPARK-26412 > URL: https://issues.apache.org/jira/browse/SPARK-26412 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Weichen Xu >Priority: Major > > Pandas UDF is the ideal connection between PySpark and DL model inference > workload. However, user needs to load the model file first to make > predictions. It is common to see models of size ~100MB or bigger. If the > Pandas UDF execution is limited to each batch, user needs to repeatedly load > the same model for every batch in the same python worker process, which is > inefficient. > We can provide users the iterator of batches in pd.DataFrame and let user > code handle it: > {code} > @pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITER) > def predict(batch_iter): > model = ... # load model > for batch in batch_iter: > yield model.predict(batch) > {code} > The type of each batch is: > * pd.Series if UDF is called with a single non-struct-type column > * a tuple of pd.Series if predict is called with more than one Spark DF > columns > * a pd.DataFrame if predict is called with a single StructType column > {code} > @pandas_udf(...) > def evaluate(batch_iter): > model = ... # load model > for features, label in batch_iter: > pred = model.predict(features) > yield (pred - label).abs() > df.select(evaluate(col("features"), col("label")).alias("err")) > {code} > {code} > @pandas_udf(...) > def evaluate(pdf_iter): > model = ... # load model > for pdf in pdf_iter: > pred = model.predict(pdf['x']) > yield (pred - pdf['y']).abs() > df.select(evaluate(struct(col("features"), col("label"))).alias("err")) > {code} > Another benefit is with iterator interface and asyncio from Python, it is > flexible for users to implement data pipelining. > cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (SPARK-27968) ArrowEvalPythonExec.evaluate shouldn't eagerly read the first row
Xiangrui Meng created SPARK-27968: - Summary: ArrowEvalPythonExec.evaluate shouldn't eagerly read the first row Key: SPARK-27968 URL: https://issues.apache.org/jira/browse/SPARK-27968 Project: Spark Issue Type: Improvement Components: PySpark, SQL Affects Versions: 3.0.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng An issue mentioned here: https://github.com/apache/spark/pull/24734/files#r288377915, could be decoupled from that PR. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27368) Design: Standalone supports GPU scheduling
[ https://issues.apache.org/jira/browse/SPARK-27368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-27368: -- Description: Design draft: Scenarios: * client-mode, worker might create one or more executor processes, from different Spark applications. * cluster-mode, worker might create driver process as well. * local-cluster model, there could be multiple worker processes on the same node. This is an undocumented use of standalone mode, which is mainly for tests. * Resource isolation is not considered here. Because executor and driver processes on the same node will share the accelerator resources, worker must take the role that allocates resources. So we will add spark.worker.resource.[resourceName].discoveryScript conf for workers to discover resources. User need to match the resourceName in driver and executor requests. Besides CPU cores and memory, worker now also considers resources in creating new executors or drivers. Example conf: {code} spark.worker.resource.gpu.discoveryScript=/path/to/list-gpus.sh spark.driver.resource.gpu.count=4 spark.worker.resource.gpu.count=1 {code} In client mode, driver process is not launched by worker. So user can specify driver resource discovery script. In cluster mode, if user still specify driver resource discovery script, it is ignored with a warning. Supporting resource isolation is tricky because Spark worker doesn't know how to isolate resources unless we hardcode some resource names like GPU support in YARN, which is less ideal. Support resource isolation of multiple resource types is even harder. In the first version, we will implement accelerator support without resource isolation. was: Design draft: Scenarios: * client-mode, worker might create one or more executor processes, from different Spark applications. * cluster-mode, worker might create driver process as well. * local-cluster model, there could be multiple worker processes on the same node. This is an undocumented use of standalone mode, which is mainly for tests. Because executor and driver processes on the same node will share the accelerator resources, worker must take the role that allocates resources. So we will add spark.worker.resource.[resourceName].discoveryScript conf for workers to discover resources. User need to match the resourceName in driver and executor requests and they don't need to specify discovery scripts separately. > Design: Standalone supports GPU scheduling > -- > > Key: SPARK-27368 > URL: https://issues.apache.org/jira/browse/SPARK-27368 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > > Design draft: > Scenarios: > * client-mode, worker might create one or more executor processes, from > different Spark applications. > * cluster-mode, worker might create driver process as well. > * local-cluster model, there could be multiple worker processes on the same > node. This is an undocumented use of standalone mode, which is mainly for > tests. > * Resource isolation is not considered here. > Because executor and driver processes on the same node will share the > accelerator resources, worker must take the role that allocates resources. So > we will add spark.worker.resource.[resourceName].discoveryScript conf for > workers to discover resources. User need to match the resourceName in driver > and executor requests. Besides CPU cores and memory, worker now also > considers resources in creating new executors or drivers. > Example conf: > {code} > spark.worker.resource.gpu.discoveryScript=/path/to/list-gpus.sh > spark.driver.resource.gpu.count=4 > spark.worker.resource.gpu.count=1 > {code} > In client mode, driver process is not launched by worker. So user can specify > driver resource discovery script. In cluster mode, if user still specify > driver resource discovery script, it is ignored with a warning. > Supporting resource isolation is tricky because Spark worker doesn't know how > to isolate resources unless we hardcode some resource names like GPU support > in YARN, which is less ideal. Support resource isolation of multiple resource > types is even harder. In the first version, we will implement accelerator > support without resource isolation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27368) Design: Standalone supports GPU scheduling
[ https://issues.apache.org/jira/browse/SPARK-27368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-27368: -- Description: Design draft: Scenarios: * client-mode, worker might create one or more executor processes, from different Spark applications. * cluster-mode, worker might create driver process as well. * local-cluster model, there could be multiple worker processes on the same node. This is an undocumented use of standalone mode, which is mainly for tests. Because executor and driver processes on the same node will share the accelerator resources, worker must take the role that allocates resources. So we will add spark.worker.resource.[resourceName].discoveryScript conf for workers to discover resources. User need to match the resourceName in driver and executor requests and they don't need to specify discovery scripts separately. > Design: Standalone supports GPU scheduling > -- > > Key: SPARK-27368 > URL: https://issues.apache.org/jira/browse/SPARK-27368 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > > Design draft: > Scenarios: > * client-mode, worker might create one or more executor processes, from > different Spark applications. > * cluster-mode, worker might create driver process as well. > * local-cluster model, there could be multiple worker processes on the same > node. This is an undocumented use of standalone mode, which is mainly for > tests. > Because executor and driver processes on the same node will share the > accelerator resources, worker must take the role that allocates resources. So > we will add spark.worker.resource.[resourceName].discoveryScript conf for > workers to discover resources. User need to match the resourceName in driver > and executor requests and they don't need to specify discovery scripts > separately. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27366) Spark scheduler internal changes to support GPU scheduling
[ https://issues.apache.org/jira/browse/SPARK-27366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-27366. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24374 [https://github.com/apache/spark/pull/24374] > Spark scheduler internal changes to support GPU scheduling > -- > > Key: SPARK-27366 > URL: https://issues.apache.org/jira/browse/SPARK-27366 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xingbo Jiang >Priority: Major > Fix For: 3.0.0 > > > Update Spark job scheduler to support accelerator resource requests submitted > at application level. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27888) Python 2->3 migration guide for PySpark users
[ https://issues.apache.org/jira/browse/SPARK-27888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855838#comment-16855838 ] Xiangrui Meng commented on SPARK-27888: --- It would be nice if we can find some PySpark users who already migrated from python 2 to 3 and tell us what issues users should expect and any benefits from the migration. > Python 2->3 migration guide for PySpark users > - > > Key: SPARK-27888 > URL: https://issues.apache.org/jira/browse/SPARK-27888 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > We might need a short Python 2->3 migration guide for PySpark users. It > doesn't need to be comprehensive given many Python 2->3 migration guides > around. We just need some pointers and list items that are specific to > PySpark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27884) Deprecate Python 2 support in Spark 3.0
[ https://issues.apache.org/jira/browse/SPARK-27884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-27884: - Assignee: Xiangrui Meng > Deprecate Python 2 support in Spark 3.0 > --- > > Key: SPARK-27884 > URL: https://issues.apache.org/jira/browse/SPARK-27884 > Project: Spark > Issue Type: Story > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > Labels: release-notes > > Officially deprecate Python 2 support in Spark 3.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27888) Python 2->3 migration guide for PySpark users
[ https://issues.apache.org/jira/browse/SPARK-27888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855832#comment-16855832 ] Xiangrui Meng commented on SPARK-27888: --- This JIRA is not to inform users that Python 2 is deprecated. It provides info to help users migrate to Python, though I don't know if there are things specific to PySpark. I think it should stay in the user guide instead of a release note. > Python 2->3 migration guide for PySpark users > - > > Key: SPARK-27888 > URL: https://issues.apache.org/jira/browse/SPARK-27888 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > We might need a short Python 2->3 migration guide for PySpark users. It > doesn't need to be comprehensive given many Python 2->3 migration guides > around. We just need some pointers and list items that are specific to > PySpark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27886) Add Apache Spark project to https://python3statement.org/
[ https://issues.apache.org/jira/browse/SPARK-27886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-27886. --- Resolution: Done > Add Apache Spark project to https://python3statement.org/ > - > > Key: SPARK-27886 > URL: https://issues.apache.org/jira/browse/SPARK-27886 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > > Add Spark to https://python3statement.org/ and indicate our timeline. I > reviewed the statement at https://python3statement.org/. Most projects listed > there will *drop* Python 2 before 2020/01/01 instead of deprecating it with > only [one > exception|https://github.com/python3statement/python3statement.github.io/blob/6ccacf8beb3cc49b1b3d572ab4f841e250853ca9/site.js#L206]. > We certainly cannot drop Python 2 support in 2019 given we haven't > deprecated it yet. > Maybe we can add the following time line: > * 2019/10 - 2020/04: Python 2 & 3 > * 2020/04 - : Python 3 only > The switching is the next release after Spark 3.0. If we want to hold another > release, it would be Sept or Oct. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org