from:"Xiangrui Meng \(JIRA\)"

[jira] [Comment Edited] (SPARK-38648) SPIP: Simplified API for DL Inferencing

2022-08-19 Thread Xiangrui Meng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17582071#comment-17582071
 ] 

Xiangrui Meng edited comment on SPARK-38648 at 8/19/22 10:55 PM:
-

I had an offline discussion with [~leewyang]. Summary:

We might not need to introduce a new package in Spark with dependencies on DL 
frameworks. Instead, we can provide abstractions in pyspark.ml to implement the 
common data operations needed by DL inference, e.g., batching, tensor 
conversion, pipelining, etc.

For example, we can define the following API (just to illustrate the idea, not 
proposing the final API):

{code:scala}
def dl_model_udf(
  predict_fn: Callable[pd.DataFrame, pd.DataFrame],  # need to discuss the data 
format
  batch_size: int,
  input_tensor_shapes: Map[str, List[int]],
  output_data_type,
  preprocess_fn,
  ...
) -> PandasUDF
{code}

Users only need to supply predict_fn, which could return a (wrapped) TensorFlow 
model, a PyTorch model, or an MLflow model. Users are responsible for package 
dependency management and model loading logics. We doesn't cover everything 
proposed in the original SPIP but we do save the boilerplate code for users on 
creating batches over Iterator[DataFrame], converting 1d arrays to tensors, and 
async preprocessing (CPU) and prediction (GPU).

If we go with this direction, I don't feel the change needs an SPIP because it 
doesn't introduce a new Spark package nor new dependencies. It is a just a 
wrapper over pandas_udf for DL inference.


was (Author: mengxr):
I had an offline discussion with [~leewyang]. Summary:

We might not need to introduce a new package in Spark with dependencies on DL 
frameworks. Instead, we can provide abstractions in pyspark.ml to implement the 
common data operations needed by DL inference, e.g., batching, tensor 
conversion, pipelining, etc.

For example, we can define the following API (just to illustrate the idea, not 
proposing the final API):

{code:scala}
def dl_model_udf(
  predict_fn: Callable[pd.DataFrame, pd.DataFrame],  # need to discuss the data 
format
  batch_size: int,
  input_tensor_shapes: Map[str, List[int]],
  output_data_type,
  preprocess_fn,
  ...
) -> PandasUDF
{code}

Users only need to supply predict_fn, which could return a (wrapped) TensorFlow 
model, a PyTorch model, or an MLflow model. Users are responsible for package 
dependency management and model loading logics. We doesn't cover everything 
proposed in the original SPIP but we do save the boilerplate code for users on 
creating batches over Iterator[DataFrame], converting 1d arrays to tensors, and 
async preprocessing (CPU) and prediction (GPU).

If we go with this direction, I don't free the change needs an SPIP because it 
doesn't introduce a new Spark package nor new dependencies.

> SPIP: Simplified API for DL Inferencing
> ---
>
> Key: SPARK-38648
> URL: https://issues.apache.org/jira/browse/SPARK-38648
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Lee Yang
>Priority: Minor
>
> h1. Background and Motivation
> The deployment of deep learning (DL) models to Spark clusters can be a point 
> of friction today.  DL practitioners often aren't well-versed with Spark, and 
> Spark experts often aren't well-versed with the fast-changing DL frameworks.  
> Currently, the deployment of trained DL models is done in a fairly ad-hoc 
> manner, with each model integration usually requiring significant effort.
> To simplify this process, we propose adding an integration layer for each 
> major DL framework that can introspect their respective saved models to 
> more-easily integrate these models into Spark applications.  You can find a 
> detailed proposal here: 
> [https://docs.google.com/document/d/1n7QPHVZfmQknvebZEXxzndHPV2T71aBsDnP4COQa_v0]
> h1. Goals
>  - Simplify the deployment of pre-trained single-node DL models to Spark 
> inference applications.
>  - Follow pandas_udf for simple inference use-cases.
>  - Follow Spark ML Pipelines APIs for transfer-learning use-cases.
>  - Enable integrations with popular third-party DL frameworks like 
> TensorFlow, PyTorch, and Huggingface.
>  - Focus on PySpark, since most of the DL frameworks use Python.
>  - Take advantage of built-in Spark features like GPU scheduling and Arrow 
> integration.
>  - Enable inference on both CPU and GPU.
> h1. Non-goals
>  - DL model training.
>  - Inference w/ distributed models, i.e. "model parallel" inference.
> h1. Target Personas
>  - Data scientists who need to deploy DL models on Spark.
>  - Developers who need to deploy DL models on Spark.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubsc

[jira] [Commented] (SPARK-38648) SPIP: Simplified API for DL Inferencing

2022-08-19 Thread Xiangrui Meng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17582071#comment-17582071
 ] 

Xiangrui Meng commented on SPARK-38648:
---

I had an offline discussion with [~leewyang]. Summary:

We might not need to introduce a new package in Spark with dependencies on DL 
frameworks. Instead, we can provide abstractions in pyspark.ml to implement the 
common data operations needed by DL inference, e.g., batching, tensor 
conversion, pipelining, etc.

For example, we can define the following API (just to illustrate the idea, not 
proposing the final API):

{code:scala}
def dl_model_udf(
  predict_fn: Callable[pd.DataFrame, pd.DataFrame],  # need to discuss the data 
format
  batch_size: int,
  input_tensor_shapes: Map[str, List[int]],
  output_data_type,
  preprocess_fn,
  ...
) -> PandasUDF
{code}

Users only need to supply predict_fn, which could return a (wrapped) TensorFlow 
model, a PyTorch model, or an MLflow model. Users are responsible for package 
dependency management and model loading logics. We doesn't cover everything 
proposed in the original SPIP but we do save the boilerplate code for users on 
creating batches over Iterator[DataFrame], converting 1d arrays to tensors, and 
async preprocessing (CPU) and prediction (GPU).

If we go with this direction, I don't free the change needs an SPIP because it 
doesn't introduce a new Spark package nor new dependencies.

> SPIP: Simplified API for DL Inferencing
> ---
>
> Key: SPARK-38648
> URL: https://issues.apache.org/jira/browse/SPARK-38648
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Lee Yang
>Priority: Minor
>
> h1. Background and Motivation
> The deployment of deep learning (DL) models to Spark clusters can be a point 
> of friction today.  DL practitioners often aren't well-versed with Spark, and 
> Spark experts often aren't well-versed with the fast-changing DL frameworks.  
> Currently, the deployment of trained DL models is done in a fairly ad-hoc 
> manner, with each model integration usually requiring significant effort.
> To simplify this process, we propose adding an integration layer for each 
> major DL framework that can introspect their respective saved models to 
> more-easily integrate these models into Spark applications.  You can find a 
> detailed proposal here: 
> [https://docs.google.com/document/d/1n7QPHVZfmQknvebZEXxzndHPV2T71aBsDnP4COQa_v0]
> h1. Goals
>  - Simplify the deployment of pre-trained single-node DL models to Spark 
> inference applications.
>  - Follow pandas_udf for simple inference use-cases.
>  - Follow Spark ML Pipelines APIs for transfer-learning use-cases.
>  - Enable integrations with popular third-party DL frameworks like 
> TensorFlow, PyTorch, and Huggingface.
>  - Focus on PySpark, since most of the DL frameworks use Python.
>  - Take advantage of built-in Spark features like GPU scheduling and Arrow 
> integration.
>  - Enable inference on both CPU and GPU.
> h1. Non-goals
>  - DL model training.
>  - Inference w/ distributed models, i.e. "model parallel" inference.
> h1. Target Personas
>  - Data scientists who need to deploy DL models on Spark.
>  - Developers who need to deploy DL models on Spark.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38648) SPIP: Simplified API for DL Inferencing

2022-04-27 Thread Xiangrui Meng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528862#comment-17528862
 ] 

Xiangrui Meng commented on SPARK-38648:
---

I think it is beneficial to both Spark and DL frameworks if Spark has 
state-of-the-art DL capabilities. We did some work in the past to make Spark 
work better with DL frameworks, e.g., iterator Scalar Pandas UDF, barrier mode, 
and GPU scheduling. But most of them are low level APIs for developers, not end 
users. Our Spark user guide contains little about DL and AI.

The dependency on DL frameworks might create issues. One idea is to develop in 
the Spark repo and Spark namespace but publish to PyPI independently. For 
example, in order to use DL features, users need to explicitly install 
`pyspark-dl` and then use the features under `pyspark.dl` namespace.

Putting development inside Spark and publishing under the spark namespace would 
help drive both development and adoption.

> SPIP: Simplified API for DL Inferencing
> ---
>
> Key: SPARK-38648
> URL: https://issues.apache.org/jira/browse/SPARK-38648
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Lee Yang
>Priority: Minor
>
> h1. Background and Motivation
> The deployment of deep learning (DL) models to Spark clusters can be a point 
> of friction today.  DL practitioners often aren't well-versed with Spark, and 
> Spark experts often aren't well-versed with the fast-changing DL frameworks.  
> Currently, the deployment of trained DL models is done in a fairly ad-hoc 
> manner, with each model integration usually requiring significant effort.
> To simplify this process, we propose adding an integration layer for each 
> major DL framework that can introspect their respective saved models to 
> more-easily integrate these models into Spark applications.  You can find a 
> detailed proposal here: 
> [https://docs.google.com/document/d/1n7QPHVZfmQknvebZEXxzndHPV2T71aBsDnP4COQa_v0]
> h1. Goals
>  - Simplify the deployment of pre-trained single-node DL models to Spark 
> inference applications.
>  - Follow pandas_udf for simple inference use-cases.
>  - Follow Spark ML Pipelines APIs for transfer-learning use-cases.
>  - Enable integrations with popular third-party DL frameworks like 
> TensorFlow, PyTorch, and Huggingface.
>  - Focus on PySpark, since most of the DL frameworks use Python.
>  - Take advantage of built-in Spark features like GPU scheduling and Arrow 
> integration.
>  - Enable inference on both CPU and GPU.
> h1. Non-goals
>  - DL model training.
>  - Inference w/ distributed models, i.e. "model parallel" inference.
> h1. Target Personas
>  - Data scientists who need to deploy DL models on Spark.
>  - Developers who need to deploy DL models on Spark.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37004) Job cancellation causes py4j errors on Jupyter due to pinned thread mode

2021-10-13 Thread Xiangrui Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-37004:
--
Description: 
Spark 3.2.0 turned on py4j pinned thread mode by default (SPARK-35303). 
However, in a jupyter notebook, after I cancel (interrupt) a long-running Spark 
job, the next Spark command will fail with some py4j errors. See attached 
notebook for repro.

Cannot reproduce the issue after I turn off pinned thread mode .

  was:
3.2.0 turned on py4j pinned thread mode by default. However, in a jupyter 
notebook, after I cancel (interrupt) a long-running Spark job, the next Spark 
command will fail with some py4j errors. See attached notebook for repro.

Cannot reproduce the issue after I turn off pinned thread mode .


> Job cancellation causes py4j errors on Jupyter due to pinned thread mode
> 
>
> Key: SPARK-37004
> URL: https://issues.apache.org/jira/browse/SPARK-37004
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xiangrui Meng
>Priority: Blocker
> Attachments: pinned.ipynb
>
>
> Spark 3.2.0 turned on py4j pinned thread mode by default (SPARK-35303). 
> However, in a jupyter notebook, after I cancel (interrupt) a long-running 
> Spark job, the next Spark command will fail with some py4j errors. See 
> attached notebook for repro.
> Cannot reproduce the issue after I turn off pinned thread mode .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37004) Job cancellation causes py4j errors on Jupyter due to pinned thread mode

2021-10-13 Thread Xiangrui Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-37004:
--
Issue Type: Bug  (was: Improvement)

> Job cancellation causes py4j errors on Jupyter due to pinned thread mode
> 
>
> Key: SPARK-37004
> URL: https://issues.apache.org/jira/browse/SPARK-37004
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xiangrui Meng
>Priority: Blocker
> Attachments: pinned.ipynb
>
>
> 3.2.0 turned on py4j pinned thread mode by default. However, in a jupyter 
> notebook, after I cancel (interrupt) a long-running Spark job, the next Spark 
> command will fail with some py4j errors. See attached notebook for repro.
> Cannot reproduce the issue after I turn off pinned thread mode .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37004) Job cancellation causes py4j errors on Jupyter due to pinned thread mode

2021-10-13 Thread Xiangrui Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-37004:
--
Attachment: pinned.ipynb

> Job cancellation causes py4j errors on Jupyter due to pinned thread mode
> 
>
> Key: SPARK-37004
> URL: https://issues.apache.org/jira/browse/SPARK-37004
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xiangrui Meng
>Priority: Blocker
> Attachments: pinned.ipynb
>
>
> 3.2.0 turned on py4j pinned thread mode by default. However, in a jupyter 
> notebook, after I cancel (interrupt) a long-running Spark job, the next Spark 
> command will fail with some py4j errors. See attached notebook for repro.
> Cannot reproduce the issue after I turn off pinned thread mode .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37004) Job cancellation causes py4j errors on Jupyter due to pinned thread mode

2021-10-13 Thread Xiangrui Meng (Jira)

Xiangrui Meng created SPARK-37004:
-

 Summary: Job cancellation causes py4j errors on Jupyter due to 
pinned thread mode
 Key: SPARK-37004
 URL: https://issues.apache.org/jira/browse/SPARK-37004
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Xiangrui Meng


3.2.0 turned on py4j pinned thread mode by default. However, in a jupyter 
notebook, after I cancel (interrupt) a long-running Spark job, the next Spark 
command will fail with some py4j errors. See attached notebook for repro.

Cannot reproduce the issue after I turn off pinned thread mode .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36578) Minor UnivariateFeatureSelector API doc improvement

2021-08-24 Thread Xiangrui Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-36578:
-

Assignee: Huaxin Gao

> Minor UnivariateFeatureSelector API doc improvement
> ---
>
> Key: SPARK-36578
> URL: https://issues.apache.org/jira/browse/SPARK-36578
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.1, 3.2.0
>Reporter: Xiangrui Meng
>Assignee: Huaxin Gao
>Priority: Minor
>
> * The ScalaDoc and Python doc of the UnivariateFeatureSelector starts with 
> "The user can set ...". The first sentence is the ScalaDoc should always 
> describe what this class is.
> * `ANOVATest` is not public. We should just say ANOVA F-test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36578) Minor UnivariateFeatureSelector API doc improvement

2021-08-24 Thread Xiangrui Meng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17404095#comment-17404095
 ] 

Xiangrui Meng commented on SPARK-36578:
---

[~huaxingao] I saw some possible improvement to the API doc. It would be great 
if you can address them when you get some time. Not blocking 3.2 release.

> Minor UnivariateFeatureSelector API doc improvement
> ---
>
> Key: SPARK-36578
> URL: https://issues.apache.org/jira/browse/SPARK-36578
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.1, 3.2.0
>Reporter: Xiangrui Meng
>Priority: Minor
>
> * The ScalaDoc and Python doc of the UnivariateFeatureSelector starts with 
> "The user can set ...". The first sentence is the ScalaDoc should always 
> describe what this class is.
> * `ANOVATest` is not public. We should just say ANOVA F-test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36578) Minor UnivariateFeatureSelector API doc improvement

2021-08-24 Thread Xiangrui Meng (Jira)

Xiangrui Meng created SPARK-36578:
-

 Summary: Minor UnivariateFeatureSelector API doc improvement
 Key: SPARK-36578
 URL: https://issues.apache.org/jira/browse/SPARK-36578
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.2.0
Reporter: Xiangrui Meng


* The ScalaDoc and Python doc of the UnivariateFeatureSelector starts with "The 
user can set ...". The first sentence is the ScalaDoc should always describe 
what this class is.
* `ANOVATest` is not public. We should just say ANOVA F-test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36578) Minor UnivariateFeatureSelector API doc improvement

2021-08-24 Thread Xiangrui Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-36578:
--
Affects Version/s: 3.1.1

> Minor UnivariateFeatureSelector API doc improvement
> ---
>
> Key: SPARK-36578
> URL: https://issues.apache.org/jira/browse/SPARK-36578
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.1, 3.2.0
>Reporter: Xiangrui Meng
>Priority: Minor
>
> * The ScalaDoc and Python doc of the UnivariateFeatureSelector starts with 
> "The user can set ...". The first sentence is the ScalaDoc should always 
> describe what this class is.
> * `ANOVATest` is not public. We should just say ANOVA F-test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36578) Minor UnivariateFeatureSelector API doc improvement

2021-08-24 Thread Xiangrui Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-36578:
--
Target Version/s:   (was: 3.2.0)

> Minor UnivariateFeatureSelector API doc improvement
> ---
>
> Key: SPARK-36578
> URL: https://issues.apache.org/jira/browse/SPARK-36578
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: Xiangrui Meng
>Priority: Minor
>
> * The ScalaDoc and Python doc of the UnivariateFeatureSelector starts with 
> "The user can set ...". The first sentence is the ScalaDoc should always 
> describe what this class is.
> * `ANOVATest` is not public. We should just say ANOVA F-test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34415) Use randomization as a possibly better technique than grid search in optimizing hyperparameters

2021-08-17 Thread Xiangrui Meng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400644#comment-17400644
 ] 

Xiangrui Meng commented on SPARK-34415:
---

[~phenry] [~srowen] The implementation doesn't do uniform sampling of the 
hyper-parameter search space. Instead, it samples per params and then construct 
the cartesian product of all combinations. I think this would significantly 
reduce the effectiveness of the random search. Was it already discussed?

> Use randomization as a possibly better technique than grid search in 
> optimizing hyperparameters
> ---
>
> Key: SPARK-34415
> URL: https://issues.apache.org/jira/browse/SPARK-34415
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Affects Versions: 3.0.1
>Reporter: Phillip Henry
>Assignee: Phillip Henry
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.2.0
>
>
> Randomization can be a more effective techinique than a grid search in 
> finding optimal hyperparameters since min/max points can fall between the 
> grid lines and never be found. Randomisation is not so restricted although 
> the probability of finding minima/maxima is dependent on the number of 
> attempts.
> Alice Zheng has an accessible description on how this technique works at 
> [https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html]
> (Note that I have a PR for this work outstanding at 
> [https://github.com/apache/spark/pull/31535] )
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24374) SPIP: Support Barrier Execution Mode in Apache Spark

2021-03-19 Thread Xiangrui Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-24374.
---
   Fix Version/s: 2.4.0
Target Version/s: 2.4.0
  Resolution: Fixed

I'm marking this epic jira as done given the major feature was implemented in 
2.4.

> SPIP: Support Barrier Execution Mode in Apache Spark
> 
>
> Key: SPARK-24374
> URL: https://issues.apache.org/jira/browse/SPARK-24374
> Project: Spark
>  Issue Type: Epic
>  Components: ML, Spark Core
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>  Labels: Hydrogen, SPIP
> Fix For: 2.4.0
>
> Attachments: SPIP_ Support Barrier Scheduling in Apache Spark.pdf
>
>
> (See details in the linked/attached SPIP doc.)
> {quote}
> The proposal here is to add a new scheduling model to Apache Spark so users 
> can properly embed distributed DL training as a Spark stage to simplify the 
> distributed training workflow. For example, Horovod uses MPI to implement 
> all-reduce to accelerate distributed TensorFlow training. The computation 
> model is different from MapReduce used by Spark. In Spark, a task in a stage 
> doesn’t depend on any other tasks in the same stage, and hence it can be 
> scheduled independently. In MPI, all workers start at the same time and pass 
> messages around. To embed this workload in Spark, we need to introduce a new 
> scheduling model, tentatively named “barrier scheduling”, which launches 
> tasks at the same time and provides users enough information and tooling to 
> embed distributed DL training. Spark can also provide an extra layer of fault 
> tolerance in case some tasks failed in the middle, where Spark would abort 
> all tasks and restart the stage.
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34080) Add UnivariateFeatureSelector to deprecate existing selectors

2021-01-11 Thread Xiangrui Meng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263006#comment-17263006
 ] 

Xiangrui Meng commented on SPARK-34080:
---

Not sure if we have time for 3.1.1 release. But if there are other release 
blockers, it would be great if we can make the changes in without deprecating 
the APIs later.

> Add UnivariateFeatureSelector to deprecate existing selectors
> -
>
> Key: SPARK-34080
> URL: https://issues.apache.org/jira/browse/SPARK-34080
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> In SPARK-26111, we introduced a few univariate feature selectors, which share 
> a common set of params. And they are named after the underlying test, which 
> requires users to understand the test to find the matched scenarios. It would 
> be nice if we introduce a single class called UnivariateFeatureSelector that 
> accepts a selection criterion and a score method (string names). Then we can 
> deprecate all other univariate selectors.
> For the params, instead of ask users to provide what score function to use, 
> it is more friendly to ask users to specify the feature and label types 
> (continuous or categorical) and we set a default score function for each 
> combo. We can also detect the types from feature metadata if given. Advanced 
> users can overwrite it (if there are multiple score function that is 
> compatible with the feature type and label type combo). Example (param names 
> are not finalized):
> {code}
> selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], 
> labelCol=["target"], featureType="categorical", labelType="continuous", 
> select="bestK", k=100)
> {code}
> cc: [~huaxingao] [~ruifengz] [~weichenxu123]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34080) Add UnivariateFeatureSelector to deprecate existing selectors

2021-01-11 Thread Xiangrui Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-34080:
--
Affects Version/s: 3.1.1

> Add UnivariateFeatureSelector to deprecate existing selectors
> -
>
> Key: SPARK-34080
> URL: https://issues.apache.org/jira/browse/SPARK-34080
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.2.0, 3.1.1
>Reporter: Xiangrui Meng
>Priority: Major
>
> In SPARK-26111, we introduced a few univariate feature selectors, which share 
> a common set of params. And they are named after the underlying test, which 
> requires users to understand the test to find the matched scenarios. It would 
> be nice if we introduce a single class called UnivariateFeatureSelector that 
> accepts a selection criterion and a score method (string names). Then we can 
> deprecate all other univariate selectors.
> For the params, instead of ask users to provide what score function to use, 
> it is more friendly to ask users to specify the feature and label types 
> (continuous or categorical) and we set a default score function for each 
> combo. We can also detect the types from feature metadata if given. Advanced 
> users can overwrite it (if there are multiple score function that is 
> compatible with the feature type and label type combo). Example (param names 
> are not finalized):
> {code}
> selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], 
> labelCol=["target"], featureType="categorical", labelType="continuous", 
> select="bestK", k=100)
> {code}
> cc: [~huaxingao] [~ruifengz] [~weichenxu123]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34080) Add UnivariateFeatureSelector to deprecate existing selectors

2021-01-11 Thread Xiangrui Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-34080:
--
Description: 
In SPARK-26111, we introduced a few univariate feature selectors, which share a 
common set of params. And they are named after the underlying test, which 
requires users to understand the test to find the matched scenarios. It would 
be nice if we introduce a single class called UnivariateFeatureSelector that 
accepts a selection criterion and a score method (string names). Then we can 
deprecate all other univariate selectors.

For the params, instead of ask users to provide what score function to use, it 
is more friendly to ask users to specify the feature and label types 
(continuous or categorical) and we set a default score function for each combo. 
We can also detect the types from feature metadata if given. Advanced users can 
overwrite it (if there are multiple score function that is compatible with the 
feature type and label type combo). Example (param names are not finalized):

{code}
selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], 
labelCol=["target"], featureType="categorical", labelType="continuous", 
select="bestK", k=100)
{code}

cc: [~huaxingao] [~ruifengz] [~weichenxu123]

  was:
In SPARK-26111, we introduced a few univariate feature selectors, which share a 
common set of params. And they are named after the underlying test, which 
requires users to understand the test to find the matched scenarios. It would 
be nice if we introduce a single class called UnivariateFeatureSelector that 
accepts a selection criterion and a score method (string names). Then we can 
deprecate all other univariate selectors.

For the params, instead of ask users to provide what score function to use, it 
is more friendly to ask users to specify the feature and label types 
(continuous or categorical) and we set a default score function for each combo. 
We can also detect the types from feature metadata if given. Advanced users can 
overwrite it (if there are multiple score function that is compatible with the 
feature type and label type combo). Example (param names are not finalized):

{code}
selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], 
labelCol=["target"], featureType="categorical", labelType="continuous")
{code}

cc: [~huaxingao] [~ruifengz] [~weichenxu123]


> Add UnivariateFeatureSelector to deprecate existing selectors
> -
>
> Key: SPARK-34080
> URL: https://issues.apache.org/jira/browse/SPARK-34080
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> In SPARK-26111, we introduced a few univariate feature selectors, which share 
> a common set of params. And they are named after the underlying test, which 
> requires users to understand the test to find the matched scenarios. It would 
> be nice if we introduce a single class called UnivariateFeatureSelector that 
> accepts a selection criterion and a score method (string names). Then we can 
> deprecate all other univariate selectors.
> For the params, instead of ask users to provide what score function to use, 
> it is more friendly to ask users to specify the feature and label types 
> (continuous or categorical) and we set a default score function for each 
> combo. We can also detect the types from feature metadata if given. Advanced 
> users can overwrite it (if there are multiple score function that is 
> compatible with the feature type and label type combo). Example (param names 
> are not finalized):
> {code}
> selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], 
> labelCol=["target"], featureType="categorical", labelType="continuous", 
> select="bestK", k=100)
> {code}
> cc: [~huaxingao] [~ruifengz] [~weichenxu123]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34080) Add UnivariateFeatureSelector to deprecate existing selectors

2021-01-11 Thread Xiangrui Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-34080:
--
Description: 
In SPARK-26111, we introduced a few univariate feature selectors, which share a 
common set of params. And they are named after the underlying test, which 
requires users to understand the test to find the matched scenarios. It would 
be nice if we introduce a single class called UnivariateFeatureSelector that 
accepts a selection criterion and a score method (string names). Then we can 
deprecate all other univariate selectors.

For the params, instead of ask users to provide what score function to use, it 
is more friendly to ask users to specify the feature and label types 
(continuous or categorical) and we set a default score function for each combo. 
We can also detect the types from feature metadata if given. Advanced users can 
overwrite it (if there are multiple score function that is compatible with the 
feature type and label type combo). Example (param names are not finalized):

{code}
selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], 
labelCol=["target"], featureType="categorical", labelType="continuous")
{code}

cc: [~huaxingao] [~ruifengz] [~weichenxu123]

  was:
In SPARK-26111, we introduced a few univariate feature selectors, which share a 
common set of params. And they are named after the underlying test, which 
requires users to understand the test to find the matched scenarios. It would 
be nice if we introduce a single class called UnivariateFeatureSelector that 
accepts a selection criterion and a score method (string names). Then we can 
deprecate all other univariate selectors.

For the params, instead of ask users to provide what score function to use, it 
is more friendly to ask users to specify the feature and label types 
(continuous or categorical) and we set a default score function for each combo. 
We can also detect the types from feature metadata if given. Advanced users can 
overwrite it (if there are multiple score function that is compatible with the 
feature type and label type combo). Example (param names are not finalized):

{code}
selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], 
labelCol=["target"], featureType="categorical", labelType="continuous")
{code}

cc: [~huaxingao] @huaxin


> Add UnivariateFeatureSelector to deprecate existing selectors
> -
>
> Key: SPARK-34080
> URL: https://issues.apache.org/jira/browse/SPARK-34080
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> In SPARK-26111, we introduced a few univariate feature selectors, which share 
> a common set of params. And they are named after the underlying test, which 
> requires users to understand the test to find the matched scenarios. It would 
> be nice if we introduce a single class called UnivariateFeatureSelector that 
> accepts a selection criterion and a score method (string names). Then we can 
> deprecate all other univariate selectors.
> For the params, instead of ask users to provide what score function to use, 
> it is more friendly to ask users to specify the feature and label types 
> (continuous or categorical) and we set a default score function for each 
> combo. We can also detect the types from feature metadata if given. Advanced 
> users can overwrite it (if there are multiple score function that is 
> compatible with the feature type and label type combo). Example (param names 
> are not finalized):
> {code}
> selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], 
> labelCol=["target"], featureType="categorical", labelType="continuous")
> {code}
> cc: [~huaxingao] [~ruifengz] [~weichenxu123]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34080) Add UnivariateFeatureSelector to deprecate existing selectors

2021-01-11 Thread Xiangrui Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-34080:
--
Description: 
In SPARK-26111, we introduced a few univariate feature selectors, which share a 
common set of params. And they are named after the underlying test, which 
requires users to understand the test to find the matched scenarios. It would 
be nice if we introduce a single class called UnivariateFeatureSelector that 
accepts a selection criterion and a score method (string names). Then we can 
deprecate all other univariate selectors.

For the params, instead of ask users to provide what score function to use, it 
is more friendly to ask users to specify the feature and label types 
(continuous or categorical) and we set a default score function for each combo. 
We can also detect the types from feature metadata if given. Advanced users can 
overwrite it (if there are multiple score function that is compatible with the 
feature type and label type combo). Example (param names are not finalized):

{code}
selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], 
labelCol=["target"], featureType="categorical", labelType="continuous")
{code}

cc: [~huaxingao] @huaxin

  was:
In SPARK-26111, we introduced a few univariate feature selectors, which share a 
common set of params. And they are named after the underlying test, which 
requires users to understand the test to find the matched scenarios. It would 
be nice if we introduce a single class called UnivariateFeatureSelector that 
accepts a selection criterion and a score method (string names). Then we can 
deprecate all other univariate selectors.

For the params, instead of ask users to provide what score function to use, it 
is more friendly to ask users to specify the feature and label types 
(continuous or categorical) and we set a default score function for each combo. 
We can also detect the types from feature metadata if given. Advanced users can 
overwrite it (if there are multiple score function that is compatible with the 
feature type and label type combo). Example (param names are not finalized):

{code}
selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], 
labelCol=["target"], featureType="categorical", labelType="continuous")
{code}


> Add UnivariateFeatureSelector to deprecate existing selectors
> -
>
> Key: SPARK-34080
> URL: https://issues.apache.org/jira/browse/SPARK-34080
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> In SPARK-26111, we introduced a few univariate feature selectors, which share 
> a common set of params. And they are named after the underlying test, which 
> requires users to understand the test to find the matched scenarios. It would 
> be nice if we introduce a single class called UnivariateFeatureSelector that 
> accepts a selection criterion and a score method (string names). Then we can 
> deprecate all other univariate selectors.
> For the params, instead of ask users to provide what score function to use, 
> it is more friendly to ask users to specify the feature and label types 
> (continuous or categorical) and we set a default score function for each 
> combo. We can also detect the types from feature metadata if given. Advanced 
> users can overwrite it (if there are multiple score function that is 
> compatible with the feature type and label type combo). Example (param names 
> are not finalized):
> {code}
> selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], 
> labelCol=["target"], featureType="categorical", labelType="continuous")
> {code}
> cc: [~huaxingao] @huaxin



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34080) Add UnivariateFeatureSelector to deprecate existing selectors

2021-01-11 Thread Xiangrui Meng (Jira)

Xiangrui Meng created SPARK-34080:
-

 Summary: Add UnivariateFeatureSelector to deprecate existing 
selectors
 Key: SPARK-34080
 URL: https://issues.apache.org/jira/browse/SPARK-34080
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 3.2.0
Reporter: Xiangrui Meng


In SPARK-26111, we introduced a few univariate feature selectors, which share a 
common set of params. And they are named after the underlying test, which 
requires users to understand the test to find the matched scenarios. It would 
be nice if we introduce a single class called UnivariateFeatureSelector that 
accepts a selection criterion and a score method (string names). Then we can 
deprecate all other univariate selectors.

For the params, instead of ask users to provide what score function to use, it 
is more friendly to ask users to specify the feature and label types 
(continuous or categorical) and we set a default score function for each combo. 
We can also detect the types from feature metadata if given. Advanced users can 
overwrite it (if there are multiple score function that is compatible with the 
feature type and label type combo). Example (param names are not finalized):

{code}
selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], 
labelCol=["target"], featureType="categorical", labelType="continuous")
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32933) Use keyword-only syntax for keyword_only methods

2020-09-22 Thread Xiangrui Meng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17200580#comment-17200580
 ] 

Xiangrui Meng commented on SPARK-32933:
---

Why do we keep using the @keyword_only annotation after we switched to use `*` 
syntax? I understand that we do not want to remove the keyword_only definition 
at https://github.com/apache/spark/blob/master/python/pyspark/__init__.py#L101 
because it was used widely outside Spark. But within Spark code base, I don't 
think we still need to use it to annotate methods. We should also deprecate 
keyword_only in Spark 3.1 so developers know to switch. 

> Use keyword-only syntax for keyword_only methods
> 
>
> Key: SPARK-32933
> URL: https://issues.apache.org/jira/browse/SPARK-32933
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Minor
> Fix For: 3.1.0
>
>
> Since 3.0, provides syntax for indicating keyword-only arguments ([PEP 
> 3102|https://www.python.org/dev/peps/pep-3102/]).
> It is not a full replacement for our current usage of {{keyword_only}}, but 
> it would allow us to make our expectations explicit:
> {code:python}
> @keyword_only
> def __init__(self, degree=2, inputCol=None, outputCol=None):
> {code}
> {code:python}
> @keyword_only
> def __init__(self, *, degree=2, inputCol=None, outputCol=None):
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-32429) Standalone Mode allow setting CUDA_VISIBLE_DEVICES on executor launch

2020-07-28 Thread Xiangrui Meng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166532#comment-17166532
 ] 

Xiangrui Meng edited comment on SPARK-32429 at 7/28/20, 4:30 PM:
-

[~tgraves] Thanks for the clarification! It makes sense to add GPU isolation at 
executor level. Your prototype adds special meaning to the "gpu" resource name. 
I wonder if we want to make it more configurable in the final implementation. A 
scenario we considered previously was a cluster with two generation of GPUs: 
K80, V100. I think it is safe to assume that Spark application should only 
request one GPU type. Then we will need some configuration to tell based on 
which resource name user wants to set CUDA_VISIBLE_DEVICES.

Btw, we found that setting CUDA_DEVICE_ORDER=PCI_BUS_ID is necessary to have 
consistent device ordering between different processes even 
CUDA_VISIBLE_DEVICES are set the same. Not sure if the same setting were used 
in YARN/k8s.


was (Author: mengxr):
[~tgraves] Thanks for the clarification! It makes sense to add GPU isolation at 
executor level. Your prototype adds special meaning to the "gpu" resource name. 
I wonder if we want to make it more configurable in the final implementation. A 
scenario we considered previously was a cluster with two generation of GPUs: 
K80, V100. I think it is safe to assume that Spark application should only 
request one GPU type. Then we will need some configuration to tell based on 
which resource name user wants to set CUDA_VISIBLE_DEVICES.

Btw, we found that setting CUDA_DEVICE_ORDER=PCI_BUS_ID is necessary to have 
consistent device ordering between different processes even 
CUDA_VISIBLE_DEVICES are set the same.

> Standalone Mode allow setting CUDA_VISIBLE_DEVICES on executor launch
> -
>
> Key: SPARK-32429
> URL: https://issues.apache.org/jira/browse/SPARK-32429
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> It would be nice if standalone mode could allow users to set 
> CUDA_VISIBLE_DEVICES before launching an executor.  This has multiple 
> benefits. 
>  * kind of an isolation in that the executor can only see the GPUs set there. 
>  * If your GPU application doesn't support explicitly setting the GPU device 
> id, setting this will make any GPU look like the default (id 0) and things 
> generally just work without any explicit setting
>  * New features are being added on newer GPUs that require explicit setting 
> of CUDA_VISIBLE_DEVICES like MIG 
> ([https://www.nvidia.com/en-us/technologies/multi-instance-gpu/])
> The code changes to just set this are very small, once we set them we would 
> also possibly need to change the gpu addresses as it changes them to start 
> from device id 0 again.
> The easiest implementation would just specifically support this and have it 
> behind a config and set when the config is on and GPU resources are 
> allocated. 
> Note we probably want to have this same thing set when we launch a python 
> process as well so that it gets same env.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32429) Standalone Mode allow setting CUDA_VISIBLE_DEVICES on executor launch

2020-07-28 Thread Xiangrui Meng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166532#comment-17166532
 ] 

Xiangrui Meng commented on SPARK-32429:
---

[~tgraves] Thanks for the clarification! It makes sense to add GPU isolation at 
executor level. Your prototype adds special meaning to the "gpu" resource name. 
I wonder if we want to make it more configurable in the final implementation. A 
scenario we considered previously was a cluster with two generation of GPUs: 
K80, V100. I think it is safe to assume that Spark application should only 
request one GPU type. Then we will need some configuration to tell based on 
which resource name user wants to set CUDA_VISIBLE_DEVICES.

Btw, we found that setting CUDA_DEVICE_ORDER=PCI_BUS_ID is necessary to have 
consistent device ordering between different processes even 
CUDA_VISIBLE_DEVICES are set the same.

> Standalone Mode allow setting CUDA_VISIBLE_DEVICES on executor launch
> -
>
> Key: SPARK-32429
> URL: https://issues.apache.org/jira/browse/SPARK-32429
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> It would be nice if standalone mode could allow users to set 
> CUDA_VISIBLE_DEVICES before launching an executor.  This has multiple 
> benefits. 
>  * kind of an isolation in that the executor can only see the GPUs set there. 
>  * If your GPU application doesn't support explicitly setting the GPU device 
> id, setting this will make any GPU look like the default (id 0) and things 
> generally just work without any explicit setting
>  * New features are being added on newer GPUs that require explicit setting 
> of CUDA_VISIBLE_DEVICES like MIG 
> ([https://www.nvidia.com/en-us/technologies/multi-instance-gpu/])
> The code changes to just set this are very small, once we set them we would 
> also possibly need to change the gpu addresses as it changes them to start 
> from device id 0 again.
> The easiest implementation would just specifically support this and have it 
> behind a config and set when the config is on and GPU resources are 
> allocated. 
> Note we probably want to have this same thing set when we launch a python 
> process as well so that it gets same env.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32429) Standalone Mode allow setting CUDA_VISIBLE_DEVICES on executor launch

2020-07-27 Thread Xiangrui Meng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17165903#comment-17165903
 ] 

Xiangrui Meng commented on SPARK-32429:
---

Couple questions:

1. Which GPU resource name do we use? "spark.task.resource.gpu" does not have 
special meaning in the current implemetnation.
2. I think we can do this for PySpark workers if 1) gets resolved. However, for 
executors running inside the same JVM, is there a way to set 
CUDA_VISIBLE_DEVICES differently per executor thread?

> Standalone Mode allow setting CUDA_VISIBLE_DEVICES on executor launch
> -
>
> Key: SPARK-32429
> URL: https://issues.apache.org/jira/browse/SPARK-32429
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> It would be nice if standalone mode could allow users to set 
> CUDA_VISIBLE_DEVICES before launching an executor.  This has multiple 
> benefits. 
>  * kind of an isolation in that the executor can only see the GPUs set there. 
>  * If your GPU application doesn't support explicitly setting the GPU device 
> id, setting this will make any GPU look like the default (id 0) and things 
> generally just work without any explicit setting
>  * New features are being added on newer GPUs that require explicit setting 
> of CUDA_VISIBLE_DEVICES like MIG 
> ([https://www.nvidia.com/en-us/technologies/multi-instance-gpu/])
> The code changes to just set this are very small, once we set them we would 
> also possibly need to change the gpu addresses as it changes them to start 
> from device id 0 again.
> The easiest implementation would just specifically support this and have it 
> behind a config and set when the config is on and GPU resources are 
> allocated. 
> Note we probably want to have this same thing set when we launch a python 
> process as well so that it gets same env.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31777) CrossValidator supports user-supplied folds

2020-05-20 Thread Xiangrui Meng (Jira)

Xiangrui Meng created SPARK-31777:
-

 Summary: CrossValidator supports user-supplied folds 
 Key: SPARK-31777
 URL: https://issues.apache.org/jira/browse/SPARK-31777
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 3.1.0
Reporter: Xiangrui Meng


As a user, I can specify how CrossValidator should create folds by specifying a 
foldCol, which should be integer type with range [0, numFolds). If foldCol is 
specified, Spark won't do random k-fold split. This is useful if there are 
custom logics to create folds, e.g., random split by users instead of random 
splits of events.

This is similar to SPARK-16206, which is for the RDD-based APIs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31776) Literal lit() supports lists and numpy arrays

2020-05-20 Thread Xiangrui Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-31776:
--
Issue Type: Improvement  (was: New Feature)

> Literal lit() supports lists and numpy arrays
> -
>
> Key: SPARK-31776
> URL: https://issues.apache.org/jira/browse/SPARK-31776
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> In ML workload, it is common to replace null feature vectors with some 
> default value. However, lit() does not support Python list and numpy arrays 
> at input. Users cannot simply use fillna() to get the job done.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31776) Literal lit() supports lists and numpy arrays

2020-05-20 Thread Xiangrui Meng (Jira)

Xiangrui Meng created SPARK-31776:
-

 Summary: Literal lit() supports lists and numpy arrays
 Key: SPARK-31776
 URL: https://issues.apache.org/jira/browse/SPARK-31776
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.1.0
Reporter: Xiangrui Meng


In ML workload, it is common to replace null feature vectors with some default 
value. However, lit() does not support Python list and numpy arrays at input. 
Users cannot simply use fillna() to get the job done.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31775) Support tensor type (TensorType) in Spark SQL/DataFrame

2020-05-20 Thread Xiangrui Meng (Jira)

Xiangrui Meng created SPARK-31775:
-

 Summary: Support tensor type (TensorType) in Spark SQL/DataFrame
 Key: SPARK-31775
 URL: https://issues.apache.org/jira/browse/SPARK-31775
 Project: Spark
  Issue Type: New Feature
  Components: ML, SQL
Affects Versions: 3.1.0
Reporter: Xiangrui Meng


More and more DS/ML workloads are dealing with tensors. For example, a decoded 
color image can be represented by a 3D tensor. It would be nice to natively 
support tensor type. A local tensor is essentially an array with shape info, 
stored together. Native support is better than user-defined type (UDT) because 
PyArrow does not support UDTs but it already supports tensors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31610) Expose hashFuncVersion property in HashingTF

2020-05-12 Thread Xiangrui Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-31610:
--
Issue Type: Improvement  (was: Bug)

> Expose hashFuncVersion property in HashingTF
> 
>
> Key: SPARK-31610
> URL: https://issues.apache.org/jira/browse/SPARK-31610
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 3.0.0
>
>
> Expose hashFuncVersion property in HashingTF
> Some third-party library such as mleap need to access it.
> See background description here:
> https://github.com/combust/mleap/pull/665#issuecomment-621258623



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31610) Expose hashFuncVersion property in HashingTF

2020-05-12 Thread Xiangrui Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-31610:
--
Priority: Major  (was: Critical)

> Expose hashFuncVersion property in HashingTF
> 
>
> Key: SPARK-31610
> URL: https://issues.apache.org/jira/browse/SPARK-31610
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 3.0.0
>
>
> Expose hashFuncVersion property in HashingTF
> Some third-party library such as mleap need to access it.
> See background description here:
> https://github.com/combust/mleap/pull/665#issuecomment-621258623



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31610) Expose hashFuncVersion property in HashingTF

2020-05-12 Thread Xiangrui Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-31610:
--
Description: 
Expose hashFuncVersion property in HashingTF

Some third-party library such as mleap need to access it.
See background description here:
https://github.com/combust/mleap/pull/665#issuecomment-621258623


  was:
Expose hashFunc property in HashingTF

Some third-party library such as mleap need to access it.
See background description here:
https://github.com/combust/mleap/pull/665#issuecomment-621258623



> Expose hashFuncVersion property in HashingTF
> 
>
> Key: SPARK-31610
> URL: https://issues.apache.org/jira/browse/SPARK-31610
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Critical
> Fix For: 3.0.0
>
>
> Expose hashFuncVersion property in HashingTF
> Some third-party library such as mleap need to access it.
> See background description here:
> https://github.com/combust/mleap/pull/665#issuecomment-621258623



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31610) Expose hashFunc property in HashingTF

2020-05-12 Thread Xiangrui Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-31610:
-

Assignee: Weichen Xu

> Expose hashFunc property in HashingTF
> -
>
> Key: SPARK-31610
> URL: https://issues.apache.org/jira/browse/SPARK-31610
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Critical
>
> Expose hashFunc property in HashingTF
> Some third-party library such as mleap need to access it.
> See background description here:
> https://github.com/combust/mleap/pull/665#issuecomment-621258623



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31668) Saving and loading HashingTF leads to hash function changed

2020-05-12 Thread Xiangrui Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-31668.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28413
[https://github.com/apache/spark/pull/28413]

> Saving and loading HashingTF leads to hash function changed
> ---
>
> Key: SPARK-31668
> URL: https://issues.apache.org/jira/browse/SPARK-31668
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Blocker
> Fix For: 3.0.0
>
>
> If we use spark 2.x save HashingTF, and then use spark 3.0 load it, and then 
> use spark 3.0 to save it again, and then use spark 3.0 to load it again, the 
> hash function will be changed.
> This bug is hard to debug, we need to fix it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31610) Expose hashFuncVersion property in HashingTF

2020-05-12 Thread Xiangrui Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-31610:
--
Summary: Expose hashFuncVersion property in HashingTF  (was: Expose 
hashFunc property in HashingTF)

> Expose hashFuncVersion property in HashingTF
> 
>
> Key: SPARK-31610
> URL: https://issues.apache.org/jira/browse/SPARK-31610
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Critical
> Fix For: 3.0.0
>
>
> Expose hashFunc property in HashingTF
> Some third-party library such as mleap need to access it.
> See background description here:
> https://github.com/combust/mleap/pull/665#issuecomment-621258623



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31610) Expose hashFunc property in HashingTF

2020-05-12 Thread Xiangrui Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-31610.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28413
[https://github.com/apache/spark/pull/28413]

> Expose hashFunc property in HashingTF
> -
>
> Key: SPARK-31610
> URL: https://issues.apache.org/jira/browse/SPARK-31610
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Critical
> Fix For: 3.0.0
>
>
> Expose hashFunc property in HashingTF
> Some third-party library such as mleap need to access it.
> See background description here:
> https://github.com/combust/mleap/pull/665#issuecomment-621258623



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31610) Expose hashFunc property in HashingTF

2020-05-08 Thread Xiangrui Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-31610:
--
Issue Type: Bug  (was: Improvement)

> Expose hashFunc property in HashingTF
> -
>
> Key: SPARK-31610
> URL: https://issues.apache.org/jira/browse/SPARK-31610
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Weichen Xu
>Priority: Critical
>
> Expose hashFunc property in HashingTF
> Some third-party library such as mleap need to access it.
> See background description here:
> https://github.com/combust/mleap/pull/665#issuecomment-621258623



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31610) Expose hashFunc property in HashingTF

2020-05-08 Thread Xiangrui Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-31610:
--
Priority: Critical  (was: Major)

> Expose hashFunc property in HashingTF
> -
>
> Key: SPARK-31610
> URL: https://issues.apache.org/jira/browse/SPARK-31610
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Weichen Xu
>Priority: Critical
>
> Expose hashFunc property in HashingTF
> Some third-party library such as mleap need to access it.
> See background description here:
> https://github.com/combust/mleap/pull/665#issuecomment-621258623



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31549) Pyspark SparkContext.cancelJobGroup do not work correctly

2020-04-28 Thread Xiangrui Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-31549:
--
Target Version/s: 3.0.0

> Pyspark SparkContext.cancelJobGroup do not work correctly
> -
>
> Key: SPARK-31549
> URL: https://issues.apache.org/jira/browse/SPARK-31549
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Weichen Xu
>Priority: Critical
>
> Pyspark SparkContext.cancelJobGroup do not work correctly. This is an issue 
> existing for a long time. This is because of pyspark thread didn't pinned to 
> jvm thread when invoking java side methods, which leads to all pyspark API 
> which related to java local thread variables do not work correctly. 
> (Including `sc.setLocalProperty`, `sc.cancelJobGroup`, `sc.setJobDescription` 
> and so on.)
> This is serious issue. Now there's an experimental pyspark 'PIN_THREAD' mode 
> added in spark-3.0 which address it, but the 'PIN_THREAD' mode exists two 
> issue:
> * It is disabled by default. We need to set additional environment variable 
> to enable it.
> * There's memory leak issue which haven't been addressed.
> Now there's a series of project like hyperopt-spark, spark-joblib which rely 
> on `sc.cancelJobGroup` API (use it to stop running jobs in their code). So it 
> is critical to address this issue and we hope it work under default pyspark 
> mode. An optional approach is implementing methods like 
> `rdd.setGroupAndCollect`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31497) Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot save and load model

2020-04-26 Thread Xiangrui Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-31497.
---
Resolution: Fixed

Issue resolved by pull request 28279
[https://github.com/apache/spark/pull/28279]

> Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot 
> save and load model
> --
>
> Key: SPARK-31497
> URL: https://issues.apache.org/jira/browse/SPARK-31497
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.5
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 3.0.0
>
>
> Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot 
> save and load model.
> Reproduce code run in pyspark shell:
> 1) Train model and save model in pyspark:
> {code:python}
> from pyspark.ml import Pipeline
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.evaluation import BinaryClassificationEvaluator
> from pyspark.ml.feature import HashingTF, Tokenizer
> from pyspark.ml.tuning import CrossValidator, CrossValidatorModel, 
> ParamGridBuilder
> training = spark.createDataFrame([
> (0, "a b c d e spark", 1.0),
> (1, "b d", 0.0),
> (2, "spark f g h", 1.0),
> (3, "hadoop mapreduce", 0.0),
> (4, "b spark who", 1.0),
> (5, "g d a y", 0.0),
> (6, "spark fly", 1.0),
> (7, "was mapreduce", 0.0),
> (8, "e spark program", 1.0),
> (9, "a e c l", 0.0),
> (10, "spark compile", 1.0),
> (11, "hadoop software", 0.0)
> ], ["id", "text", "label"])
> # Configure an ML pipeline, which consists of tree stages: tokenizer, 
> hashingTF, and lr.
> tokenizer = Tokenizer(inputCol="text", outputCol="words")
> hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
> lr = LogisticRegression(maxIter=10)
> pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
> paramGrid = ParamGridBuilder() \
> .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \
> .addGrid(lr.regParam, [0.1, 0.01]) \
> .build()
> crossval = CrossValidator(estimator=pipeline,
>   estimatorParamMaps=paramGrid,
>   evaluator=BinaryClassificationEvaluator(),
>   numFolds=2)  # use 3+ folds in practice
> # Run cross-validation, and choose the best set of parameters.
> cvModel = crossval.fit(training)
> cvModel.save('/tmp/cv_model001') # save model failed. Rase error.
> {code}
> 2): Train crossvalidation model in scala with similar code above, and save to 
> '/tmp/model_cv_scala001', run following code in pyspark:
> {code:python}
> from pyspark.ml.tuning import CrossValidatorModel
> CrossValidatorModel.load('/tmp/model_cv_scala001') # raise error
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31497) Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot save and load model

2020-04-26 Thread Xiangrui Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-31497:
--
Fix Version/s: 3.0.0

> Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot 
> save and load model
> --
>
> Key: SPARK-31497
> URL: https://issues.apache.org/jira/browse/SPARK-31497
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.5
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 3.0.0
>
>
> Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot 
> save and load model.
> Reproduce code run in pyspark shell:
> 1) Train model and save model in pyspark:
> {code:python}
> from pyspark.ml import Pipeline
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.evaluation import BinaryClassificationEvaluator
> from pyspark.ml.feature import HashingTF, Tokenizer
> from pyspark.ml.tuning import CrossValidator, CrossValidatorModel, 
> ParamGridBuilder
> training = spark.createDataFrame([
> (0, "a b c d e spark", 1.0),
> (1, "b d", 0.0),
> (2, "spark f g h", 1.0),
> (3, "hadoop mapreduce", 0.0),
> (4, "b spark who", 1.0),
> (5, "g d a y", 0.0),
> (6, "spark fly", 1.0),
> (7, "was mapreduce", 0.0),
> (8, "e spark program", 1.0),
> (9, "a e c l", 0.0),
> (10, "spark compile", 1.0),
> (11, "hadoop software", 0.0)
> ], ["id", "text", "label"])
> # Configure an ML pipeline, which consists of tree stages: tokenizer, 
> hashingTF, and lr.
> tokenizer = Tokenizer(inputCol="text", outputCol="words")
> hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
> lr = LogisticRegression(maxIter=10)
> pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
> paramGrid = ParamGridBuilder() \
> .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \
> .addGrid(lr.regParam, [0.1, 0.01]) \
> .build()
> crossval = CrossValidator(estimator=pipeline,
>   estimatorParamMaps=paramGrid,
>   evaluator=BinaryClassificationEvaluator(),
>   numFolds=2)  # use 3+ folds in practice
> # Run cross-validation, and choose the best set of parameters.
> cvModel = crossval.fit(training)
> cvModel.save('/tmp/cv_model001') # save model failed. Rase error.
> {code}
> 2): Train crossvalidation model in scala with similar code above, and save to 
> '/tmp/model_cv_scala001', run following code in pyspark:
> {code:python}
> from pyspark.ml.tuning import CrossValidatorModel
> CrossValidatorModel.load('/tmp/model_cv_scala001') # raise error
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31497) Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot save and load model

2020-04-26 Thread Xiangrui Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-31497:
--
Target Version/s: 3.0.0

> Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot 
> save and load model
> --
>
> Key: SPARK-31497
> URL: https://issues.apache.org/jira/browse/SPARK-31497
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.5
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 3.0.0
>
>
> Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot 
> save and load model.
> Reproduce code run in pyspark shell:
> 1) Train model and save model in pyspark:
> {code:python}
> from pyspark.ml import Pipeline
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.evaluation import BinaryClassificationEvaluator
> from pyspark.ml.feature import HashingTF, Tokenizer
> from pyspark.ml.tuning import CrossValidator, CrossValidatorModel, 
> ParamGridBuilder
> training = spark.createDataFrame([
> (0, "a b c d e spark", 1.0),
> (1, "b d", 0.0),
> (2, "spark f g h", 1.0),
> (3, "hadoop mapreduce", 0.0),
> (4, "b spark who", 1.0),
> (5, "g d a y", 0.0),
> (6, "spark fly", 1.0),
> (7, "was mapreduce", 0.0),
> (8, "e spark program", 1.0),
> (9, "a e c l", 0.0),
> (10, "spark compile", 1.0),
> (11, "hadoop software", 0.0)
> ], ["id", "text", "label"])
> # Configure an ML pipeline, which consists of tree stages: tokenizer, 
> hashingTF, and lr.
> tokenizer = Tokenizer(inputCol="text", outputCol="words")
> hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
> lr = LogisticRegression(maxIter=10)
> pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
> paramGrid = ParamGridBuilder() \
> .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \
> .addGrid(lr.regParam, [0.1, 0.01]) \
> .build()
> crossval = CrossValidator(estimator=pipeline,
>   estimatorParamMaps=paramGrid,
>   evaluator=BinaryClassificationEvaluator(),
>   numFolds=2)  # use 3+ folds in practice
> # Run cross-validation, and choose the best set of parameters.
> cvModel = crossval.fit(training)
> cvModel.save('/tmp/cv_model001') # save model failed. Rase error.
> {code}
> 2): Train crossvalidation model in scala with similar code above, and save to 
> '/tmp/model_cv_scala001', run following code in pyspark:
> {code:python}
> from pyspark.ml.tuning import CrossValidatorModel
> CrossValidatorModel.load('/tmp/model_cv_scala001') # raise error
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30969) Remove resource coordination support from Standalone

2020-02-27 Thread Xiangrui Meng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046795#comment-17046795
 ] 

Xiangrui Meng commented on SPARK-30969:
---

[~Ngone51] [~jiangxb1987] Is there a JIRA to deprecate multiple workers running 
on the same host? Could you create and link here? I think we should deprecate 
it in 3.0.

> Remove resource coordination support from Standalone
> 
>
> Key: SPARK-30969
> URL: https://issues.apache.org/jira/browse/SPARK-30969
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Critical
>
> Resource coordination is used for the case where multiple workers running on 
> the same host. However, it should be a rarely or event impossible use case in 
> current Standalone(which already allow multiple executor in a single worker). 
> We should remove support for it to simply the implementation and reduce the 
> potential maintain cost in the future.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30969) Remove resource coordination support from Standalone

2020-02-27 Thread Xiangrui Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-30969:
--
Environment: (was: Resource coordination is used for the case where 
multiple workers running on the same host. However, it should be a rarely or 
event impossible use case in current Standalone(which already allow multiple 
executor in a single worker). We should remove support for it to simply the 
implementation and reduce the potential maintain cost in the future.)

> Remove resource coordination support from Standalone
> 
>
> Key: SPARK-30969
> URL: https://issues.apache.org/jira/browse/SPARK-30969
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Critical
>
> Resource coordination is used for the case where multiple workers running on 
> the same host. However, it should be a rarely or event impossible use case in 
> current Standalone(which already allow multiple executor in a single worker). 
> We should remove support for it to simply the implementation and reduce the 
> potential maintain cost in the future.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30969) Remove resource coordination support from Standalone

2020-02-27 Thread Xiangrui Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-30969:
--
Priority: Critical  (was: Major)

> Remove resource coordination support from Standalone
> 
>
> Key: SPARK-30969
> URL: https://issues.apache.org/jira/browse/SPARK-30969
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: Resource coordination is used for the case where 
> multiple workers running on the same host. However, it should be a rarely or 
> event impossible use case in current Standalone(which already allow multiple 
> executor in a single worker). We should remove support for it to simply the 
> implementation and reduce the potential maintain cost in the future.
>Reporter: wuyi
>Priority: Critical
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30969) Remove resource coordination support from Standalone

2020-02-27 Thread Xiangrui Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-30969:
-

Assignee: wuyi

> Remove resource coordination support from Standalone
> 
>
> Key: SPARK-30969
> URL: https://issues.apache.org/jira/browse/SPARK-30969
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: Resource coordination is used for the case where 
> multiple workers running on the same host. However, it should be a rarely or 
> event impossible use case in current Standalone(which already allow multiple 
> executor in a single worker). We should remove support for it to simply the 
> implementation and reduce the potential maintain cost in the future.
>Reporter: wuyi
>Assignee: wuyi
>Priority: Critical
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30969) Remove resource coordination support from Standalone

2020-02-27 Thread Xiangrui Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-30969:
--
Description: Resource coordination is used for the case where multiple 
workers running on the same host. However, it should be a rarely or event 
impossible use case in current Standalone(which already allow multiple executor 
in a single worker). We should remove support for it to simply the 
implementation and reduce the potential maintain cost in the future.

> Remove resource coordination support from Standalone
> 
>
> Key: SPARK-30969
> URL: https://issues.apache.org/jira/browse/SPARK-30969
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: Resource coordination is used for the case where 
> multiple workers running on the same host. However, it should be a rarely or 
> event impossible use case in current Standalone(which already allow multiple 
> executor in a single worker). We should remove support for it to simply the 
> implementation and reduce the potential maintain cost in the future.
>Reporter: wuyi
>Assignee: wuyi
>Priority: Critical
>
> Resource coordination is used for the case where multiple workers running on 
> the same host. However, it should be a rarely or event impossible use case in 
> current Standalone(which already allow multiple executor in a single worker). 
> We should remove support for it to simply the implementation and reduce the 
> potential maintain cost in the future.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30667) Support simple all gather in barrier task context

2020-02-20 Thread Xiangrui Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-30667:
--
Fix Version/s: (was: 3.0.0)

> Support simple all gather in barrier task context
> -
>
> Key: SPARK-30667
> URL: https://issues.apache.org/jira/browse/SPARK-30667
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Sarth Frey
>Priority: Major
>
> Currently we offer task.barrier() to coordinate tasks in barrier mode. Tasks 
> can see all IP addresses from BarrierTaskContext. It would be simpler to 
> integrate with distributed frameworks like TensorFlow DistributionStrategy if 
> we provide all gather that can let tasks share additional information with 
> others, e.g., an available port.
> Note that with all gather, tasks are share their IP addresses as well.
> {code}
> port = ... # get an available port
> ports = context.all_gather(port) # get all available ports, ordered by task ID
> ...  # set up distributed training service
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-30667) Support simple all gather in barrier task context

2020-02-20 Thread Xiangrui Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reopened SPARK-30667:
---

> Support simple all gather in barrier task context
> -
>
> Key: SPARK-30667
> URL: https://issues.apache.org/jira/browse/SPARK-30667
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Sarth Frey
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently we offer task.barrier() to coordinate tasks in barrier mode. Tasks 
> can see all IP addresses from BarrierTaskContext. It would be simpler to 
> integrate with distributed frameworks like TensorFlow DistributionStrategy if 
> we provide all gather that can let tasks share additional information with 
> others, e.g., an available port.
> Note that with all gather, tasks are share their IP addresses as well.
> {code}
> port = ... # get an available port
> ports = context.all_gather(port) # get all available ports, ordered by task ID
> ...  # set up distributed training service
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30667) Support simple all gather in barrier task context

2020-02-13 Thread Xiangrui Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-30667.
---
Resolution: Fixed

> Support simple all gather in barrier task context
> -
>
> Key: SPARK-30667
> URL: https://issues.apache.org/jira/browse/SPARK-30667
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Sarth Frey
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently we offer task.barrier() to coordinate tasks in barrier mode. Tasks 
> can see all IP addresses from BarrierTaskContext. It would be simpler to 
> integrate with distributed frameworks like TensorFlow DistributionStrategy if 
> we provide all gather that can let tasks share additional information with 
> others, e.g., an available port.
> Note that with all gather, tasks are share their IP addresses as well.
> {code}
> port = ... # get an available port
> ports = context.all_gather(port) # get all available ports, ordered by task ID
> ...  # set up distributed training service
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30667) Support simple all gather in barrier task context

2020-02-13 Thread Xiangrui Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-30667:
-

Assignee: Sarth Frey

> Support simple all gather in barrier task context
> -
>
> Key: SPARK-30667
> URL: https://issues.apache.org/jira/browse/SPARK-30667
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Sarth Frey
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently we offer task.barrier() to coordinate tasks in barrier mode. Tasks 
> can see all IP addresses from BarrierTaskContext. It would be simpler to 
> integrate with distributed frameworks like TensorFlow DistributionStrategy if 
> we provide all gather that can let tasks share additional information with 
> others, e.g., an available port.
> Note that with all gather, tasks are share their IP addresses as well.
> {code}
> port = ... # get an available port
> ports = context.all_gather(port) # get all available ports, ordered by task ID
> ...  # set up distributed training service
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30667) Support simple all gather in barrier task context

2020-02-13 Thread Xiangrui Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-30667:
--
Target Version/s: 3.0.0  (was: 3.1.0)

> Support simple all gather in barrier task context
> -
>
> Key: SPARK-30667
> URL: https://issues.apache.org/jira/browse/SPARK-30667
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Currently we offer task.barrier() to coordinate tasks in barrier mode. Tasks 
> can see all IP addresses from BarrierTaskContext. It would be simpler to 
> integrate with distributed frameworks like TensorFlow DistributionStrategy if 
> we provide all gather that can let tasks share additional information with 
> others, e.g., an available port.
> Note that with all gather, tasks are share their IP addresses as well.
> {code}
> port = ... # get an available port
> ports = context.all_gather(port) # get all available ports, ordered by task ID
> ...  # set up distributed training service
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30667) Support simple all gather in barrier task context

2020-02-13 Thread Xiangrui Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-30667:
--
Fix Version/s: 3.0.0

> Support simple all gather in barrier task context
> -
>
> Key: SPARK-30667
> URL: https://issues.apache.org/jira/browse/SPARK-30667
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently we offer task.barrier() to coordinate tasks in barrier mode. Tasks 
> can see all IP addresses from BarrierTaskContext. It would be simpler to 
> integrate with distributed frameworks like TensorFlow DistributionStrategy if 
> we provide all gather that can let tasks share additional information with 
> others, e.g., an available port.
> Note that with all gather, tasks are share their IP addresses as well.
> {code}
> port = ... # get an available port
> ports = context.all_gather(port) # get all available ports, ordered by task ID
> ...  # set up distributed training service
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30762) Add dtype="float32" support to vector_to_array UDF

2020-02-09 Thread Xiangrui Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-30762:
--
Component/s: PySpark

> Add dtype="float32" support to vector_to_array UDF
> --
>
> Key: SPARK-30762
> URL: https://issues.apache.org/jira/browse/SPARK-30762
> Project: Spark
>  Issue Type: Story
>  Components: MLlib, PySpark
>Affects Versions: 3.0.0
>Reporter: Liang Zhang
>Assignee: Liang Zhang
>Priority: Major
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Previous PR: 
> [https://github.com/apache/spark/blob/master/python/pyspark/ml/functions.py]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30762) Add dtype="float32" support to vector_to_array UDF

2020-02-09 Thread Xiangrui Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-30762:
-

Assignee: Liang Zhang

> Add dtype="float32" support to vector_to_array UDF
> --
>
> Key: SPARK-30762
> URL: https://issues.apache.org/jira/browse/SPARK-30762
> Project: Spark
>  Issue Type: Story
>  Components: MLlib
>Affects Versions: 3.0.0
>Reporter: Liang Zhang
>Assignee: Liang Zhang
>Priority: Major
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Previous PR: 
> [https://github.com/apache/spark/blob/master/python/pyspark/ml/functions.py]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30667) Support simple all gather in barrier task context

2020-01-28 Thread Xiangrui Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-30667:
--
Description: 
Currently we offer task.barrier() to coordinate tasks in barrier mode. Tasks 
can see all IP addresses from BarrierTaskContext. It would be simpler to 
integrate with distributed frameworks like TensorFlow DistributionStrategy if 
we provide all gather that can let tasks share additional information with 
others, e.g., an available port.

Note that with all gather, tasks are share their IP addresses as well.

{code}
port = ... # get an available port
ports = context.all_gather(port) # get all available ports, ordered by task ID
...  # set up distributed training service
{code}

  was:
Currently we offer task.barrier() to coordinate tasks in barrier mode. Tasks 
can see all IP addresses from BarrierTaskContext. It would be simpler to 
integrate with distributed frameworks like TensorFlow DistributionStrategy if 
we provide all gather that can let tasks share additional information with 
others, e.g., an available port.

Note that with all gather, tasks are share their IP addresses as well.


> Support simple all gather in barrier task context
> -
>
> Key: SPARK-30667
> URL: https://issues.apache.org/jira/browse/SPARK-30667
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Currently we offer task.barrier() to coordinate tasks in barrier mode. Tasks 
> can see all IP addresses from BarrierTaskContext. It would be simpler to 
> integrate with distributed frameworks like TensorFlow DistributionStrategy if 
> we provide all gather that can let tasks share additional information with 
> others, e.g., an available port.
> Note that with all gather, tasks are share their IP addresses as well.
> {code}
> port = ... # get an available port
> ports = context.all_gather(port) # get all available ports, ordered by task ID
> ...  # set up distributed training service
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30667) Support simple all gather in barrier task context

2020-01-28 Thread Xiangrui Meng (Jira)

Xiangrui Meng created SPARK-30667:
-

 Summary: Support simple all gather in barrier task context
 Key: SPARK-30667
 URL: https://issues.apache.org/jira/browse/SPARK-30667
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, Spark Core
Affects Versions: 3.0.0
Reporter: Xiangrui Meng


Currently we offer task.barrier() to coordinate tasks in barrier mode. Tasks 
can see all IP addresses from BarrierTaskContext. It would be simpler to 
integrate with distributed frameworks like TensorFlow DistributionStrategy if 
we provide all gather that can let tasks share additional information with 
others, e.g., an available port.

Note that with all gather, tasks are share their IP addresses as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30154) PySpark UDF to convert MLlib vectors to dense arrays

2020-01-06 Thread Xiangrui Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-30154.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26910
[https://github.com/apache/spark/pull/26910]

> PySpark UDF to convert MLlib vectors to dense arrays
> 
>
> Key: SPARK-30154
> URL: https://issues.apache.org/jira/browse/SPARK-30154
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib, PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 3.0.0
>
>
> If a PySpark user wants to convert MLlib sparse/dense vectors in a DataFrame 
> into dense arrays, an efficient approach is to do that in JVM. However, it 
> requires PySpark user to write Scala code and register it as a UDF. Often 
> this is infeasible for a pure python project.
> What we can do is to predefine those converters in Scala and expose them in 
> PySpark, e.g.:
> {code}
> from pyspark.ml.functions import vector_to_dense_array
> df.select(vector_to_dense_array(col("features"))
> {code}
> cc: [~weichenxu123]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30154) PySpark UDF to convert MLlib vectors to dense arrays

2019-12-06 Thread Xiangrui Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-30154:
--
Description: 
If a PySpark user wants to convert MLlib sparse/dense vectors in a DataFrame 
into dense arrays, an efficient approach is to do that in JVM. However, it 
requires PySpark user to write Scala code and register it as a UDF. Often this 
is infeasible for a pure python project.

What we can do is to predefine those converters in Scala and expose them in 
PySpark, e.g.:

{code}
from pyspark.ml.functions import vector_to_dense_array

df.select(vector_to_dense_array(col("features"))
{code}

cc: [~weichenxu123]

  was:
If a PySpark user wants to convert MLlib sparse/dense vectors in a DataFrame 
into dense arrays, an efficient method is to do that in JVM. However, it 
requires PySpark user to write Scala code and register it as a UDF. Often this 
is infeasible for a pure python project.

What we can do is to predefine those converters in Scala and expose them in 
PySpark, e.g.:

{code}
from pyspark.ml.functions import vector_to_dense_array

df.select(vector_to_dense_array(col("features"))
{code}

cc: [~weichenxu123]


> PySpark UDF to convert MLlib vectors to dense arrays
> 
>
> Key: SPARK-30154
> URL: https://issues.apache.org/jira/browse/SPARK-30154
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib, PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> If a PySpark user wants to convert MLlib sparse/dense vectors in a DataFrame 
> into dense arrays, an efficient approach is to do that in JVM. However, it 
> requires PySpark user to write Scala code and register it as a UDF. Often 
> this is infeasible for a pure python project.
> What we can do is to predefine those converters in Scala and expose them in 
> PySpark, e.g.:
> {code}
> from pyspark.ml.functions import vector_to_dense_array
> df.select(vector_to_dense_array(col("features"))
> {code}
> cc: [~weichenxu123]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30154) PySpark UDF to convert MLlib vectors to dense arrays

2019-12-06 Thread Xiangrui Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-30154:
--
Summary: PySpark UDF to convert MLlib vectors to dense arrays  (was: Allow 
PySpark code efficiently convert MLlib vectors to dense arrays)

> PySpark UDF to convert MLlib vectors to dense arrays
> 
>
> Key: SPARK-30154
> URL: https://issues.apache.org/jira/browse/SPARK-30154
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib, PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> If a PySpark user wants to convert MLlib sparse/dense vectors in a DataFrame 
> into dense arrays, an efficient method is to do that in JVM. However, it 
> requires PySpark user to write Scala code and register it as a UDF. Often 
> this is infeasible for a pure python project.
> What we can do is to predefine those converters in Scala and expose them in 
> PySpark, e.g.:
> {code}
> from pyspark.ml.functions import vector_to_dense_array
> df.select(vector_to_dense_array(col("features"))
> {code}
> cc: [~weichenxu123]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30154) Allow PySpark code efficiently convert MLlib vectors to dense arrays

2019-12-06 Thread Xiangrui Meng (Jira)

Xiangrui Meng created SPARK-30154:
-

 Summary: Allow PySpark code efficiently convert MLlib vectors to 
dense arrays
 Key: SPARK-30154
 URL: https://issues.apache.org/jira/browse/SPARK-30154
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib, PySpark
Affects Versions: 3.0.0
Reporter: Xiangrui Meng


If a PySpark user wants to convert MLlib sparse/dense vectors in a DataFrame 
into dense arrays, an efficient method is to do that in JVM. However, it 
requires PySpark user to write Scala code and register it as a UDF. Often this 
is infeasible for a pure python project.

What we can do is to predefine those converters in Scala and expose them in 
PySpark, e.g.:

{code}
from pyspark.ml.functions import vector_to_dense_array

df.select(vector_to_dense_array(col("features"))
{code}

cc: [~weichenxu123]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28978) PySpark: Can't pass more than 256 arguments to a UDF

2019-11-08 Thread Xiangrui Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-28978.
---
Fix Version/s: 3.0.0
 Assignee: Bago Amirbekian
   Resolution: Fixed

> PySpark: Can't pass more than 256 arguments to a UDF
> 
>
> Key: SPARK-28978
> URL: https://issues.apache.org/jira/browse/SPARK-28978
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0, 2.4.4
>Reporter: Jim Fulton
>Assignee: Bago Amirbekian
>Priority: Major
>  Labels: koalas, mlflow, pyspark
> Fix For: 3.0.0
>
>
> This code:
> [https://github.com/apache/spark/blob/712874fa0937f0784f47740b127c3bab20da8569/python/pyspark/worker.py#L367-L379]
> Creates Python lambdas that call UDF functions passing arguments singly, 
> rather than using varargs.  For example: `lambda a: f(a[0], a[1], ...)`.
> This fails when there are more than 256 arguments.
> mlflow, when generating model predictions, uses an argument for each feature 
> column.  I have a model with > 500 features.
> I was able to easily hack around this by changing the generated lambdas to 
> use varargs, as in `lambda a: f(*a)`. 
> IDK why these lambdas were created the way they were.  Using varargs is much 
> simpler and works fine in my testing.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29417) Resource Scheduling - add TaskContext.resource java api

2019-10-14 Thread Xiangrui Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-29417.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26083
[https://github.com/apache/spark/pull/26083]

> Resource Scheduling - add TaskContext.resource java api
> ---
>
> Key: SPARK-29417
> URL: https://issues.apache.org/jira/browse/SPARK-29417
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Major
> Fix For: 3.0.0
>
>
> I noticed the TaskContext.resource() api we added returns a scala Map. This 
> isn't very nice for the java api usage, so we should add an api that returns 
> a java Map.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28206) "@pandas_udf" in doctest is rendered as ":pandas_udf" in html API doc

2019-07-05 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-28206:
-

Assignee: Hyukjin Kwon

> "@pandas_udf" in doctest is rendered as ":pandas_udf" in html API doc
> -
>
> Key: SPARK-28206
> URL: https://issues.apache.org/jira/browse/SPARK-28206
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, PySpark
>Affects Versions: 2.4.1
>Reporter: Xiangrui Meng
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: Screen Shot 2019-06-28 at 9.55.13 AM.png
>
>
> Just noticed that in [pandas_udf API doc 
> |https://spark.apache.org/docs/2.4.1/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf],
>  "@pandas_udf" is render as ":pandas_udf".
> cc: [~hyukjin.kwon] [~smilegator]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28206) "@pandas_udf" in doctest is rendered as ":pandas_udf" in html API doc

2019-07-05 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-28206.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 25060
[https://github.com/apache/spark/pull/25060]

> "@pandas_udf" in doctest is rendered as ":pandas_udf" in html API doc
> -
>
> Key: SPARK-28206
> URL: https://issues.apache.org/jira/browse/SPARK-28206
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, PySpark
>Affects Versions: 2.4.1
>Reporter: Xiangrui Meng
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: Screen Shot 2019-06-28 at 9.55.13 AM.png
>
>
> Just noticed that in [pandas_udf API doc 
> |https://spark.apache.org/docs/2.4.1/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf],
>  "@pandas_udf" is render as ":pandas_udf".
> cc: [~hyukjin.kwon] [~smilegator]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28206) "@pandas_udf" in doctest is rendered as ":pandas_udf" in html API doc

2019-06-28 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-28206:
--
Attachment: Screen Shot 2019-06-28 at 9.55.13 AM.png

> "@pandas_udf" in doctest is rendered as ":pandas_udf" in html API doc
> -
>
> Key: SPARK-28206
> URL: https://issues.apache.org/jira/browse/SPARK-28206
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, PySpark
>Affects Versions: 2.4.1
>Reporter: Xiangrui Meng
>Priority: Major
> Attachments: Screen Shot 2019-06-28 at 9.55.13 AM.png
>
>
> Just noticed that in [pandas_udf API doc 
> |https://spark.apache.org/docs/2.4.1/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf],
>  "@pandas_udf" is render as ":pandas_udf".
> cc: [~hyukjin.kwon] [~smilegator]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28206) "@pandas_udf" in doctest is rendered as ":pandas_udf" in html API doc

2019-06-28 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-28206:
--
Issue Type: Bug  (was: Documentation)

> "@pandas_udf" in doctest is rendered as ":pandas_udf" in html API doc
> -
>
> Key: SPARK-28206
> URL: https://issues.apache.org/jira/browse/SPARK-28206
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, PySpark
>Affects Versions: 2.4.1
>Reporter: Xiangrui Meng
>Priority: Major
>
> Just noticed that in [pandas_udf API doc 
> |https://spark.apache.org/docs/2.4.1/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf],
>  "@pandas_udf" is render as ":pandas_udf".
> cc: [~hyukjin.kwon] [~smilegator]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28206) "@pandas_udf" in doctest is rendered as ":pandas_udf" in html API doc

2019-06-28 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-28206:
--
Summary: "@pandas_udf" in doctest is rendered as ":pandas_udf" in html API 
doc  (was: "@pandas_udf" in doctest is rendered as ":pandas_udf" in html)

> "@pandas_udf" in doctest is rendered as ":pandas_udf" in html API doc
> -
>
> Key: SPARK-28206
> URL: https://issues.apache.org/jira/browse/SPARK-28206
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, PySpark
>Affects Versions: 2.4.1
>Reporter: Xiangrui Meng
>Priority: Major
>
> Just noticed that in [pandas_udf API doc 
> |https://spark.apache.org/docs/2.4.1/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf],
>  "@pandas_udf" is render as ":pandas_udf".
> cc: [~hyukjin.kwon] [~smilegator]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28206) "@pandas_udf" in doctest is rendered as ":pandas_udf" in html

2019-06-28 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-28206:
--
Summary: "@pandas_udf" in doctest is rendered as ":pandas_udf" in html  
(was: "@" is rendered as ":" in doctest)

> "@pandas_udf" in doctest is rendered as ":pandas_udf" in html
> -
>
> Key: SPARK-28206
> URL: https://issues.apache.org/jira/browse/SPARK-28206
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, PySpark
>Affects Versions: 2.4.1
>Reporter: Xiangrui Meng
>Priority: Major
>
> Just noticed that in [pandas_udf API doc 
> |https://spark.apache.org/docs/2.4.1/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf],
>  "@pandas_udf" is render as ":pandas_udf".
> cc: [~hyukjin.kwon] [~smilegator]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28206) "@" is rendered as ":" in doctest

2019-06-28 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-28206:
-

 Summary: "@" is rendered as ":" in doctest
 Key: SPARK-28206
 URL: https://issues.apache.org/jira/browse/SPARK-28206
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, PySpark
Affects Versions: 2.4.1
Reporter: Xiangrui Meng


Just noticed that in [pandas_udf API doc 
|https://spark.apache.org/docs/2.4.1/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf],
 "@pandas_udf" is render as ":pandas_udf".

cc: [~hyukjin.kwon] [~smilegator]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28115) Fix flaky test: SparkContextSuite.test resource scheduling under local-cluster mode

2019-06-20 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-28115.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24917
[https://github.com/apache/spark/pull/24917]

> Fix flaky test: SparkContextSuite.test resource scheduling under 
> local-cluster mode
> ---
>
> Key: SPARK-28115
> URL: https://issues.apache.org/jira/browse/SPARK-28115
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xingbo Jiang
>Assignee: Xingbo Jiang
>Priority: Major
> Fix For: 3.0.0
>
>
> This test suite has two kind of failures.
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/6486/testReport/org.apache.spark/SparkContextSuite/test_resource_scheduling_under_local_cluster_mode/history/
> {code}
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 3921 times over 
> 1.88681567 minutes. Last failure message: 
> Array(org.apache.spark.SparkExecutorInfoImpl@533793be, 
> org.apache.spark.SparkExecutorInfoImpl@5018e57d, 
> org.apache.spark.SparkExecutorInfoImpl@dea2485, 
> org.apache.spark.SparkExecutorInfoImpl@9e63ecd) had size 4 instead of 
> expected size 3.
>   at 
> org.scalatest.concurrent.Eventually.tryTryAgain$1(Eventually.scala:432)
>   at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:439)
>   at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:391)
>   at 
> org.apache.spark.SparkContextSuite.eventually(SparkContextSuite.scala:50)
>   at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:337)
>   at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:336)
>   at 
> org.apache.spark.SparkContextSuite.eventually(SparkContextSuite.scala:50)
>   at 
> org.apache.spark.SparkContextSuite.$anonfun$new$93(SparkContextSuite.scala:885){code}
> {code}
> org.scalatest.exceptions.TestFailedException: Array("0", "0", "1", "1", "1", 
> "2", "2", "2", "2") did not equal List("0", "0", "0", "1", "1", "1", "2", 
> "2", "2")
>   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:528)
>   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:527)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
>   at 
> org.apache.spark.SparkContextSuite.$anonfun$new$93(SparkContextSuite.scala:894)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28056) Document SCALAR_ITER Pandas UDF

2019-06-17 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-28056.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24897
[https://github.com/apache/spark/pull/24897]

> Document SCALAR_ITER Pandas UDF
> ---
>
> Key: SPARK-28056
> URL: https://issues.apache.org/jira/browse/SPARK-28056
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
> Fix For: 3.0.0
>
>
> After SPARK-26412, we should document the new SCALAR_ITER Pandas UDF so user 
> can discover the feature and learn how to use it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28056) Document SCALAR_ITER Pandas UDF

2019-06-17 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-28056:
-

Assignee: Xiangrui Meng

> Document SCALAR_ITER Pandas UDF
> ---
>
> Key: SPARK-28056
> URL: https://issues.apache.org/jira/browse/SPARK-28056
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>
> After SPARK-26412, we should document the new SCALAR_ITER Pandas UDF so user 
> can discover the feature and learn how to use it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26412) Allow Pandas UDF to take an iterator of pd.DataFrames

2019-06-15 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-26412.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24643
[https://github.com/apache/spark/pull/24643]

> Allow Pandas UDF to take an iterator of pd.DataFrames
> -
>
> Key: SPARK-26412
> URL: https://issues.apache.org/jira/browse/SPARK-26412
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 3.0.0
>
>
> Pandas UDF is the ideal connection between PySpark and DL model inference 
> workload. However, user needs to load the model file first to make 
> predictions. It is common to see models of size ~100MB or bigger. If the 
> Pandas UDF execution is limited to each batch, user needs to repeatedly load 
> the same model for every batch in the same python worker process, which is 
> inefficient.
> We can provide users the iterator of batches in pd.DataFrame and let user 
> code handle it:
> {code}
> @pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITER)
> def predict(batch_iter):
>   model = ... # load model
>   for batch in batch_iter:
> yield model.predict(batch)
> {code}
> The type of each batch is:
> * a pd.Series if UDF is called with a single non-struct-type column
> * a tuple of pd.Series if UDF is called with more than one Spark DF columns
> * a pd.DataFrame if UDF is called with a single StructType column
> Examples:
> {code}
> @pandas_udf(...)
> def evaluate(batch_iter):
>   model = ... # load model
>   for features, label in batch_iter:
> pred = model.predict(features)
> yield (pred - label).abs()
> df.select(evaluate(col("features"), col("label")).alias("err"))
> {code}
> {code}
> @pandas_udf(...)
> def evaluate(pdf_iter):
>   model = ... # load model
>   for pdf in pdf_iter:
> pred = model.predict(pdf['x'])
> yield (pred - pdf['y']).abs()
> df.select(evaluate(struct(col("features"), col("label"))).alias("err"))
> {code}
> If the UDF doesn't return the same number of records for the entire 
> partition, user should see an error. We don't restrict that every yield 
> should match the input batch size.
> Another benefit is with iterator interface and asyncio from Python, it is 
> flexible for users to implement data pipelining.
> cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28056) Document SCALAR_ITER Pandas UDF

2019-06-14 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-28056:
-

 Summary: Document SCALAR_ITER Pandas UDF
 Key: SPARK-28056
 URL: https://issues.apache.org/jira/browse/SPARK-28056
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, PySpark
Affects Versions: 3.0.0
Reporter: Xiangrui Meng


After SPARK-26412, we should document the new SCALAR_ITER Pandas UDF so user 
can discover the feature and learn how to use it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28030) Binary file data source doesn't support space in file names

2019-06-12 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-28030.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24855
[https://github.com/apache/spark/pull/24855]

> Binary file data source doesn't support space in file names
> ---
>
> Key: SPARK-28030
> URL: https://issues.apache.org/jira/browse/SPARK-28030
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
> Fix For: 3.0.0
>
>
> {code}
> echo 123 > "/tmp/test space.txt"
> spark.read.format("binaryFile").load("/tmp/test space.txt").count()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27823) Add an abstraction layer for accelerator resource handling to avoid manipulating raw confs

2019-06-12 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-27823:
-

Assignee: Xiangrui Meng  (was: Thomas Graves)

> Add an abstraction layer for accelerator resource handling to avoid 
> manipulating raw confs
> --
>
> Key: SPARK-27823
> URL: https://issues.apache.org/jira/browse/SPARK-27823
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>
> In SPARK-27488, we extract resource requests and allocation by parsing raw 
> Spark confs. This hurts readability because we didn't have the abstraction at 
> resource level. After we merge the core changes, we should do a refactoring 
> and make the code more readable.
> See https://github.com/apache/spark/pull/24615#issuecomment-494580663.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28030) Binary file data source doesn't support space in file names

2019-06-12 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-28030:
--
Description: 
{code}
echo 123 > "/tmp/test space.txt"


spark.read.format("binaryFile").load("/tmp/test space.txt").count()
{code}

> Binary file data source doesn't support space in file names
> ---
>
> Key: SPARK-28030
> URL: https://issues.apache.org/jira/browse/SPARK-28030
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>
> {code}
> echo 123 > "/tmp/test space.txt"
> spark.read.format("binaryFile").load("/tmp/test space.txt").count()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28030) Binary file data source doesn't support space in file names

2019-06-12 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-28030:
-

 Summary: Binary file data source doesn't support space in file 
names
 Key: SPARK-28030
 URL: https://issues.apache.org/jira/browse/SPARK-28030
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27360) Standalone cluster mode support for GPU-aware scheduling

2019-06-11 Thread Xiangrui Meng (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16861353#comment-16861353
 ] 

Xiangrui Meng commented on SPARK-27360:
---

Done. Thanks for taking the task!

> Standalone cluster mode support for GPU-aware scheduling
> 
>
> Key: SPARK-27360
> URL: https://issues.apache.org/jira/browse/SPARK-27360
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>
> Design and implement standalone manager support for GPU-aware scheduling:
> 1. static conf to describe resources
> 2. spark-submit to request resources 
> 2. auto discovery of GPUs
> 3. executor process isolation



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Deleted] (SPARK-27999) setup resources when Standalone Worker starts up

2019-06-11 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng deleted SPARK-27999:
--


> setup resources when Standalone Worker starts up
> 
>
> Key: SPARK-27999
> URL: https://issues.apache.org/jira/browse/SPARK-27999
> Project: Spark
>  Issue Type: Sub-task
>Reporter: wuyi
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27372) Standalone executor process-level isolation to support GPU scheduling

2019-06-10 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-27372:
--
Issue Type: Story  (was: Sub-task)
Parent: (was: SPARK-27360)

> Standalone executor process-level isolation to support GPU scheduling
> -
>
> Key: SPARK-27372
> URL: https://issues.apache.org/jira/browse/SPARK-27372
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> As an admin, I can configure standalone to have multiple executor processes 
> on the same worker node and processes are configured via cgroups so they only 
> have access to assigned GPUs. So I don't need to worry about resource 
> contention between processes on the same host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27371) Standalone master receives resource info from worker and allocate driver/executor properly

2019-06-10 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-27371:
--
Summary: Standalone master receives resource info from worker and allocate 
driver/executor properly  (was: Master receives resource info from worker and 
allocate driver/executor properly)

> Standalone master receives resource info from worker and allocate 
> driver/executor properly
> --
>
> Key: SPARK-27371
> URL: https://issues.apache.org/jira/browse/SPARK-27371
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> As an admin, I can let Spark Standalone worker automatically discover GPUs 
> installed on worker nodes. So I don't need to manually configure them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27371) Master receives resource info from worker and allocate driver/executor properly

2019-06-10 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-27371:
--
Summary: Master receives resource info from worker and allocate 
driver/executor properly  (was: Standalone worker can auto discover GPUs)

> Master receives resource info from worker and allocate driver/executor 
> properly
> ---
>
> Key: SPARK-27371
> URL: https://issues.apache.org/jira/browse/SPARK-27371
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> As an admin, I can let Spark Standalone worker automatically discover GPUs 
> installed on worker nodes. So I don't need to manually configure them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Deleted] (SPARK-27370) spark-submit requests GPUs in standalone mode

2019-06-10 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng deleted SPARK-27370:
--


> spark-submit requests GPUs in standalone mode
> -
>
> Key: SPARK-27370
> URL: https://issues.apache.org/jira/browse/SPARK-27370
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Xiangrui Meng
>Priority: Major
>
> As a user, I can use spark-submit to request GPUs per task in standalone mode 
> when I submit an Spark application.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27369) Standalone worker can load resource conf and discover resources

2019-06-10 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-27369:
--
Summary: Standalone worker can load resource conf and discover resources  
(was: Standalone support static conf to describe GPU resources)

> Standalone worker can load resource conf and discover resources
> ---
>
> Key: SPARK-27369
> URL: https://issues.apache.org/jira/browse/SPARK-27369
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27368) Design: Standalone supports GPU scheduling

2019-06-10 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-27368:
--
Description: 
Design draft:

Scenarios:
* client-mode, worker might create one or more executor processes, from 
different Spark applications.
* cluster-mode, worker might create driver process as well.
* local-cluster model, there could be multiple worker processes on the same 
node. This is an undocumented use of standalone mode, which is mainly for tests.
* Resource isolation is not considered here.

Because executor and driver processes on the same node will share the 
accelerator resources, worker must take the role that allocates resources. So 
we will add spark.worker.resource.[resourceName].discoveryScript conf for 
workers to discover resources. User need to match the resourceName in driver 
and executor requests. Besides CPU cores and memory, worker now also considers 
resources in creating new executors or drivers.

Example conf:

{code}
# static worker conf
spark.worker.resource.gpu.discoveryScript=/path/to/list-gpus.sh

# application conf
spark.driver.resource.gpu.amount=4
spark.executor.resource.gpu.amount=2
spark.task.resource.gpu.amount=1
{code}

In client mode, driver process is not launched by worker. So user can specify 
driver resource discovery script. In cluster mode, if user still specify driver 
resource discovery script, it is ignored with a warning.

Supporting resource isolation is tricky because Spark worker doesn't know how 
to isolate resources unless we hardcode some resource names like GPU support in 
YARN, which is less ideal. Support resource isolation of multiple resource 
types is even harder. In the first version, we will implement accelerator 
support without resource isolation.

Timeline:
1. Worker starts.
2. Worker loads `work.source.*` conf and runs discovery scripts to discover 
resources.
3. Worker reports to master cores, memory, and resources (new) and registers.
4. An application starts.
5. Master finds workers with sufficient available resources and let worker 
start executor or driver process.
6. Worker assigns executor / driver resources by passing the resource info from 
command-line.
7. Application ends.
8. Master requests worker to kill driver/executor process.
9. Master updates available resources.

  was:
Design draft:

Scenarios:
* client-mode, worker might create one or more executor processes, from 
different Spark applications.
* cluster-mode, worker might create driver process as well.
* local-cluster model, there could be multiple worker processes on the same 
node. This is an undocumented use of standalone mode, which is mainly for tests.
* Resource isolation is not considered here.

Because executor and driver processes on the same node will share the 
accelerator resources, worker must take the role that allocates resources. So 
we will add spark.worker.resource.[resourceName].discoveryScript conf for 
workers to discover resources. User need to match the resourceName in driver 
and executor requests. Besides CPU cores and memory, worker now also considers 
resources in creating new executors or drivers.

Example conf:

{code}
spark.worker.resource.gpu.discoveryScript=/path/to/list-gpus.sh
spark.driver.resource.gpu.count=4
spark.executor.resource.gpu.count=1
{code}

In client mode, driver process is not launched by worker. So user can specify 
driver resource discovery script. In cluster mode, if user still specify driver 
resource discovery script, it is ignored with a warning.

Supporting resource isolation is tricky because Spark worker doesn't know how 
to isolate resources unless we hardcode some resource names like GPU support in 
YARN, which is less ideal. Support resource isolation of multiple resource 
types is even harder. In the first version, we will implement accelerator 
support without resource isolation.

Timeline:
1. Worker starts.
2. Worker loads `work.source.*` conf and runs discovery scripts to discover 
resources.
3. Worker reports to master cores, memory, and resources (new) and registers.
4. An application starts.
5. Master finds workers with sufficient available resources and let worker 
start executor or driver process.
6. Worker assigns executor / driver resources by passing the resource info from 
command-line.
7. Application ends.
8. Master requests worker to kill driver/executor process.
9. Master updates available resources.


> Design: Standalone supports GPU scheduling
> --
>
> Key: SPARK-27368
> URL: https://issues.apache.org/jira/browse/SPARK-27368
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>
> Design draft:
> Scenarios:
> * client-mode, worker might create on

[jira] [Updated] (SPARK-27368) Design: Standalone supports GPU scheduling

2019-06-10 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-27368:
--
Description: 
Design draft:

Scenarios:
* client-mode, worker might create one or more executor processes, from 
different Spark applications.
* cluster-mode, worker might create driver process as well.
* local-cluster model, there could be multiple worker processes on the same 
node. This is an undocumented use of standalone mode, which is mainly for tests.
* Resource isolation is not considered here.

Because executor and driver processes on the same node will share the 
accelerator resources, worker must take the role that allocates resources. So 
we will add spark.worker.resource.[resourceName].discoveryScript conf for 
workers to discover resources. User need to match the resourceName in driver 
and executor requests. Besides CPU cores and memory, worker now also considers 
resources in creating new executors or drivers.

Example conf:

{code}
spark.worker.resource.gpu.discoveryScript=/path/to/list-gpus.sh
spark.driver.resource.gpu.count=4
spark.executor.resource.gpu.count=1
{code}

In client mode, driver process is not launched by worker. So user can specify 
driver resource discovery script. In cluster mode, if user still specify driver 
resource discovery script, it is ignored with a warning.

Supporting resource isolation is tricky because Spark worker doesn't know how 
to isolate resources unless we hardcode some resource names like GPU support in 
YARN, which is less ideal. Support resource isolation of multiple resource 
types is even harder. In the first version, we will implement accelerator 
support without resource isolation.

Timeline:
1. Worker starts.
2. Worker loads `work.source.*` conf and runs discovery scripts to discover 
resources.
3. Worker reports to master cores, memory, and resources (new) and registers.
4. An application starts.
5. Master finds workers with sufficient available resources and let worker 
start executor or driver process.
6. Worker assigns executor / driver resources by passing the resource info from 
command-line.
7. Application ends.
8. Master requests worker to kill driver/executor process.
9. Master updates available resources.

  was:
Design draft:

Scenarios:
* client-mode, worker might create one or more executor processes, from 
different Spark applications.
* cluster-mode, worker might create driver process as well.
* local-cluster model, there could be multiple worker processes on the same 
node. This is an undocumented use of standalone mode, which is mainly for tests.
* Resource isolation is not considered here.

Because executor and driver processes on the same node will share the 
accelerator resources, worker must take the role that allocates resources. So 
we will add spark.worker.resource.[resourceName].discoveryScript conf for 
workers to discover resources. User need to match the resourceName in driver 
and executor requests. Besides CPU cores and memory, worker now also considers 
resources in creating new executors or drivers.

Example conf:

{code}
spark.worker.resource.gpu.discoveryScript=/path/to/list-gpus.sh
spark.driver.resource.gpu.count=4
spark.worker.resource.gpu.count=1
{code}

In client mode, driver process is not launched by worker. So user can specify 
driver resource discovery script. In cluster mode, if user still specify driver 
resource discovery script, it is ignored with a warning.

Supporting resource isolation is tricky because Spark worker doesn't know how 
to isolate resources unless we hardcode some resource names like GPU support in 
YARN, which is less ideal. Support resource isolation of multiple resource 
types is even harder. In the first version, we will implement accelerator 
support without resource isolation.

Timeline:
1. Worker starts.
2. Worker loads `work.source.*` conf and runs discovery scripts to discover 
resources.
3. Worker reports to master cores, memory, and resources (new) and registers.
4. An application starts.
5. Master finds workers with sufficient available resources and let worker 
start executor or driver process.
6. Worker assigns executor / driver resources by passing the resource info from 
command-line.
7. Application ends.
8. Master requests worker to kill driver/executor process.
9. Master updates available resources.


> Design: Standalone supports GPU scheduling
> --
>
> Key: SPARK-27368
> URL: https://issues.apache.org/jira/browse/SPARK-27368
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>
> Design draft:
> Scenarios:
> * client-mode, worker might create one or more executor processes, from 
> different Spark applications.
> * cluste

[jira] [Updated] (SPARK-27368) Design: Standalone supports GPU scheduling

2019-06-10 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-27368:
--
Description: 
Design draft:

Scenarios:
* client-mode, worker might create one or more executor processes, from 
different Spark applications.
* cluster-mode, worker might create driver process as well.
* local-cluster model, there could be multiple worker processes on the same 
node. This is an undocumented use of standalone mode, which is mainly for tests.
* Resource isolation is not considered here.

Because executor and driver processes on the same node will share the 
accelerator resources, worker must take the role that allocates resources. So 
we will add spark.worker.resource.[resourceName].discoveryScript conf for 
workers to discover resources. User need to match the resourceName in driver 
and executor requests. Besides CPU cores and memory, worker now also considers 
resources in creating new executors or drivers.

Example conf:

{code}
spark.worker.resource.gpu.discoveryScript=/path/to/list-gpus.sh
spark.driver.resource.gpu.count=4
spark.worker.resource.gpu.count=1
{code}

In client mode, driver process is not launched by worker. So user can specify 
driver resource discovery script. In cluster mode, if user still specify driver 
resource discovery script, it is ignored with a warning.

Supporting resource isolation is tricky because Spark worker doesn't know how 
to isolate resources unless we hardcode some resource names like GPU support in 
YARN, which is less ideal. Support resource isolation of multiple resource 
types is even harder. In the first version, we will implement accelerator 
support without resource isolation.

Timeline:
1. Worker starts.
2. Worker loads `work.source.*` conf and runs discovery scripts to discover 
resources.
3. Worker reports to master cores, memory, and resources (new) and registers.
4. An application starts.
5. Master finds workers with sufficient available resources and let worker 
start executor or driver process.
6. Worker assigns executor / driver resources by passing the resource info from 
command-line.
7. Application ends.
8. Master requests worker to kill driver/executor process.
9. Master updates available resources.

  was:
Design draft:

Scenarios:
* client-mode, worker might create one or more executor processes, from 
different Spark applications.
* cluster-mode, worker might create driver process as well.
* local-cluster model, there could be multiple worker processes on the same 
node. This is an undocumented use of standalone mode, which is mainly for tests.
* Resource isolation is not considered here.

Because executor and driver processes on the same node will share the 
accelerator resources, worker must take the role that allocates resources. So 
we will add spark.worker.resource.[resourceName].discoveryScript conf for 
workers to discover resources. User need to match the resourceName in driver 
and executor requests. Besides CPU cores and memory, worker now also considers 
resources in creating new executors or drivers.

Example conf:

{code}
spark.worker.resource.gpu.discoveryScript=/path/to/list-gpus.sh
spark.driver.resource.gpu.count=4
spark.worker.resource.gpu.count=1
{code}

In client mode, driver process is not launched by worker. So user can specify 
driver resource discovery script. In cluster mode, if user still specify driver 
resource discovery script, it is ignored with a warning.

Supporting resource isolation is tricky because Spark worker doesn't know how 
to isolate resources unless we hardcode some resource names like GPU support in 
YARN, which is less ideal. Support resource isolation of multiple resource 
types is even harder. In the first version, we will implement accelerator 
support without resource isolation.


> Design: Standalone supports GPU scheduling
> --
>
> Key: SPARK-27368
> URL: https://issues.apache.org/jira/browse/SPARK-27368
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>
> Design draft:
> Scenarios:
> * client-mode, worker might create one or more executor processes, from 
> different Spark applications.
> * cluster-mode, worker might create driver process as well.
> * local-cluster model, there could be multiple worker processes on the same 
> node. This is an undocumented use of standalone mode, which is mainly for 
> tests.
> * Resource isolation is not considered here.
> Because executor and driver processes on the same node will share the 
> accelerator resources, worker must take the role that allocates resources. So 
> we will add spark.worker.resource.[resourceName].discoveryScript conf for 
> workers to discover resources. User need to match

[jira] [Resolved] (SPARK-27968) ArrowEvalPythonExec.evaluate shouldn't eagerly read the first row

2019-06-06 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-27968.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24816
[https://github.com/apache/spark/pull/24816]

> ArrowEvalPythonExec.evaluate shouldn't eagerly read the first row
> -
>
> Key: SPARK-27968
> URL: https://issues.apache.org/jira/browse/SPARK-27968
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
> Fix For: 3.0.0
>
>
> An issue mentioned here: 
> https://github.com/apache/spark/pull/24734/files#r288377915, could be 
> decoupled from that PR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26412) Allow Pandas UDF to take an iterator of pd.DataFrames

2019-06-06 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-26412:
--
Description: 
Pandas UDF is the ideal connection between PySpark and DL model inference 
workload. However, user needs to load the model file first to make predictions. 
It is common to see models of size ~100MB or bigger. If the Pandas UDF 
execution is limited to each batch, user needs to repeatedly load the same 
model for every batch in the same python worker process, which is inefficient.

We can provide users the iterator of batches in pd.DataFrame and let user code 
handle it:

{code}
@pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITER)
def predict(batch_iter):
  model = ... # load model
  for batch in batch_iter:
yield model.predict(batch)
{code}

The type of each batch is:
* a pd.Series if UDF is called with a single non-struct-type column
* a tuple of pd.Series if UDF is called with more than one Spark DF columns
* a pd.DataFrame if UDF is called with a single StructType column

Examples:

{code}
@pandas_udf(...)
def evaluate(batch_iter):
  model = ... # load model
  for features, label in batch_iter:
pred = model.predict(features)
yield (pred - label).abs()

df.select(evaluate(col("features"), col("label")).alias("err"))
{code}

{code}
@pandas_udf(...)
def evaluate(pdf_iter):
  model = ... # load model
  for pdf in pdf_iter:
pred = model.predict(pdf['x'])
yield (pred - pdf['y']).abs()

df.select(evaluate(struct(col("features"), col("label"))).alias("err"))
{code}

If the UDF doesn't return the same number of records for the entire partition, 
user should see an error. We don't restrict that every yield should match the 
input batch size.

Another benefit is with iterator interface and asyncio from Python, it is 
flexible for users to implement data pipelining.

cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator]

  was:
Pandas UDF is the ideal connection between PySpark and DL model inference 
workload. However, user needs to load the model file first to make predictions. 
It is common to see models of size ~100MB or bigger. If the Pandas UDF 
execution is limited to each batch, user needs to repeatedly load the same 
model for every batch in the same python worker process, which is inefficient.

We can provide users the iterator of batches in pd.DataFrame and let user code 
handle it:

{code}
@pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITER)
def predict(batch_iter):
  model = ... # load model
  for batch in batch_iter:
yield model.predict(batch)
{code}

The type of each batch is:
* a pd.Series if UDF is called with a single non-struct-type column
* a tuple of pd.Series if predict is called with more than one Spark DF columns
* a pd.DataFrame if predict is called with a single StructType column

{code}
@pandas_udf(...)
def evaluate(batch_iter):
  model = ... # load model
  for features, label in batch_iter:
pred = model.predict(features)
yield (pred - label).abs()

df.select(evaluate(col("features"), col("label")).alias("err"))
{code}

{code}
@pandas_udf(...)
def evaluate(pdf_iter):
  model = ... # load model
  for pdf in pdf_iter:
pred = model.predict(pdf['x'])
yield (pred - pdf['y']).abs()

df.select(evaluate(struct(col("features"), col("label"))).alias("err"))
{code}

Another benefit is with iterator interface and asyncio from Python, it is 
flexible for users to implement data pipelining.

cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator]


> Allow Pandas UDF to take an iterator of pd.DataFrames
> -
>
> Key: SPARK-26412
> URL: https://issues.apache.org/jira/browse/SPARK-26412
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Weichen Xu
>Priority: Major
>
> Pandas UDF is the ideal connection between PySpark and DL model inference 
> workload. However, user needs to load the model file first to make 
> predictions. It is common to see models of size ~100MB or bigger. If the 
> Pandas UDF execution is limited to each batch, user needs to repeatedly load 
> the same model for every batch in the same python worker process, which is 
> inefficient.
> We can provide users the iterator of batches in pd.DataFrame and let user 
> code handle it:
> {code}
> @pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITER)
> def predict(batch_iter):
>   model = ... # load model
>   for batch in batch_iter:
> yield model.predict(batch)
> {code}
> The type of each batch is:
> * a pd.Series if UDF is called with a single non-struct-type column
> * a tuple of pd.Series if UDF is called with more than one Spark DF columns
> * a pd.DataFrame if UDF is called with a single StructType column
> Exa

[jira] [Updated] (SPARK-26412) Allow Pandas UDF to take an iterator of pd.DataFrames

2019-06-06 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-26412:
--
Description: 
Pandas UDF is the ideal connection between PySpark and DL model inference 
workload. However, user needs to load the model file first to make predictions. 
It is common to see models of size ~100MB or bigger. If the Pandas UDF 
execution is limited to each batch, user needs to repeatedly load the same 
model for every batch in the same python worker process, which is inefficient.

We can provide users the iterator of batches in pd.DataFrame and let user code 
handle it:

{code}
@pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITER)
def predict(batch_iter):
  model = ... # load model
  for batch in batch_iter:
yield model.predict(batch)
{code}

The type of each batch is:
* a pd.Series if UDF is called with a single non-struct-type column
* a tuple of pd.Series if predict is called with more than one Spark DF columns
* a pd.DataFrame if predict is called with a single StructType column

{code}
@pandas_udf(...)
def evaluate(batch_iter):
  model = ... # load model
  for features, label in batch_iter:
pred = model.predict(features)
yield (pred - label).abs()

df.select(evaluate(col("features"), col("label")).alias("err"))
{code}

{code}
@pandas_udf(...)
def evaluate(pdf_iter):
  model = ... # load model
  for pdf in pdf_iter:
pred = model.predict(pdf['x'])
yield (pred - pdf['y']).abs()

df.select(evaluate(struct(col("features"), col("label"))).alias("err"))
{code}

Another benefit is with iterator interface and asyncio from Python, it is 
flexible for users to implement data pipelining.

cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator]

  was:
Pandas UDF is the ideal connection between PySpark and DL model inference 
workload. However, user needs to load the model file first to make predictions. 
It is common to see models of size ~100MB or bigger. If the Pandas UDF 
execution is limited to each batch, user needs to repeatedly load the same 
model for every batch in the same python worker process, which is inefficient.

We can provide users the iterator of batches in pd.DataFrame and let user code 
handle it:

{code}
@pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITER)
def predict(batch_iter):
  model = ... # load model
  for batch in batch_iter:
yield model.predict(batch)
{code}

The type of each batch is:
* pd.Series if UDF is called with a single non-struct-type column
* a tuple of pd.Series if predict is called with more than one Spark DF columns
* a pd.DataFrame if predict is called with a single StructType column

{code}
@pandas_udf(...)
def evaluate(batch_iter):
  model = ... # load model
  for features, label in batch_iter:
pred = model.predict(features)
yield (pred - label).abs()

df.select(evaluate(col("features"), col("label")).alias("err"))
{code}

{code}
@pandas_udf(...)
def evaluate(pdf_iter):
  model = ... # load model
  for pdf in pdf_iter:
pred = model.predict(pdf['x'])
yield (pred - pdf['y']).abs()

df.select(evaluate(struct(col("features"), col("label"))).alias("err"))
{code}

Another benefit is with iterator interface and asyncio from Python, it is 
flexible for users to implement data pipelining.

cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator]


> Allow Pandas UDF to take an iterator of pd.DataFrames
> -
>
> Key: SPARK-26412
> URL: https://issues.apache.org/jira/browse/SPARK-26412
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Weichen Xu
>Priority: Major
>
> Pandas UDF is the ideal connection between PySpark and DL model inference 
> workload. However, user needs to load the model file first to make 
> predictions. It is common to see models of size ~100MB or bigger. If the 
> Pandas UDF execution is limited to each batch, user needs to repeatedly load 
> the same model for every batch in the same python worker process, which is 
> inefficient.
> We can provide users the iterator of batches in pd.DataFrame and let user 
> code handle it:
> {code}
> @pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITER)
> def predict(batch_iter):
>   model = ... # load model
>   for batch in batch_iter:
> yield model.predict(batch)
> {code}
> The type of each batch is:
> * a pd.Series if UDF is called with a single non-struct-type column
> * a tuple of pd.Series if predict is called with more than one Spark DF 
> columns
> * a pd.DataFrame if predict is called with a single StructType column
> {code}
> @pandas_udf(...)
> def evaluate(batch_iter):
>   model = ... # load model
>   for features, label in batch_iter:
> pred = model.predict(features)
> yield (pre

[jira] [Updated] (SPARK-26412) Allow Pandas UDF to take an iterator of pd.DataFrames

2019-06-06 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-26412:
--
Description: 
Pandas UDF is the ideal connection between PySpark and DL model inference 
workload. However, user needs to load the model file first to make predictions. 
It is common to see models of size ~100MB or bigger. If the Pandas UDF 
execution is limited to each batch, user needs to repeatedly load the same 
model for every batch in the same python worker process, which is inefficient.

We can provide users the iterator of batches in pd.DataFrame and let user code 
handle it:

{code}
@pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITER)
def predict(batch_iter):
  model = ... # load model
  for batch in batch_iter:
yield model.predict(batch)
{code}

The type of each batch is:
* pd.Series if UDF is called with a single non-struct-type column
* a tuple of pd.Series if predict is called with more than one Spark DF columns
* a pd.DataFrame if predict is called with a single StructType column

{code}
@pandas_udf(...)
def evaluate(batch_iter):
  model = ... # load model
  for features, label in batch_iter:
pred = model.predict(features)
yield (pred - label).abs()

df.select(evaluate(col("features"), col("label")).alias("err"))
{code}

{code}
@pandas_udf(...)
def evaluate(pdf_iter):
  model = ... # load model
  for pdf in pdf_iter:
pred = model.predict(pdf['x'])
yield (pred - pdf['y']).abs()

df.select(evaluate(struct(col("features"), col("label"))).alias("err"))
{code}

Another benefit is with iterator interface and asyncio from Python, it is 
flexible for users to implement data pipelining.

cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator]

  was:
Pandas UDF is the ideal connection between PySpark and DL model inference 
workload. However, user needs to load the model file first to make predictions. 
It is common to see models of size ~100MB or bigger. If the Pandas UDF 
execution is limited to each batch, user needs to repeatedly load the same 
model for every batch in the same python worker process, which is inefficient.

We can provide users the iterator of batches in pd.DataFrame and let user code 
handle it:

{code}
@pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITERATOR)
def predict(batch_iter):
  model = ... # load model
  for batch in batch_iter:
yield model.predict(batch)
{code}

We might add a contract that each yield must match the corresponding batch size.

Another benefit is with iterator interface and asyncio from Python, it is 
flexible for users to implement data pipelining.

cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator]


> Allow Pandas UDF to take an iterator of pd.DataFrames
> -
>
> Key: SPARK-26412
> URL: https://issues.apache.org/jira/browse/SPARK-26412
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Weichen Xu
>Priority: Major
>
> Pandas UDF is the ideal connection between PySpark and DL model inference 
> workload. However, user needs to load the model file first to make 
> predictions. It is common to see models of size ~100MB or bigger. If the 
> Pandas UDF execution is limited to each batch, user needs to repeatedly load 
> the same model for every batch in the same python worker process, which is 
> inefficient.
> We can provide users the iterator of batches in pd.DataFrame and let user 
> code handle it:
> {code}
> @pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITER)
> def predict(batch_iter):
>   model = ... # load model
>   for batch in batch_iter:
> yield model.predict(batch)
> {code}
> The type of each batch is:
> * pd.Series if UDF is called with a single non-struct-type column
> * a tuple of pd.Series if predict is called with more than one Spark DF 
> columns
> * a pd.DataFrame if predict is called with a single StructType column
> {code}
> @pandas_udf(...)
> def evaluate(batch_iter):
>   model = ... # load model
>   for features, label in batch_iter:
> pred = model.predict(features)
> yield (pred - label).abs()
> df.select(evaluate(col("features"), col("label")).alias("err"))
> {code}
> {code}
> @pandas_udf(...)
> def evaluate(pdf_iter):
>   model = ... # load model
>   for pdf in pdf_iter:
> pred = model.predict(pdf['x'])
> yield (pred - pdf['y']).abs()
> df.select(evaluate(struct(col("features"), col("label"))).alias("err"))
> {code}
> Another benefit is with iterator interface and asyncio from Python, it is 
> flexible for users to implement data pipelining.
> cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (SPARK-27968) ArrowEvalPythonExec.evaluate shouldn't eagerly read the first row

2019-06-06 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-27968:
-

 Summary: ArrowEvalPythonExec.evaluate shouldn't eagerly read the 
first row
 Key: SPARK-27968
 URL: https://issues.apache.org/jira/browse/SPARK-27968
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 3.0.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


An issue mentioned here: 
https://github.com/apache/spark/pull/24734/files#r288377915, could be decoupled 
from that PR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27368) Design: Standalone supports GPU scheduling

2019-06-05 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-27368:
--
Description: 
Design draft:

Scenarios:
* client-mode, worker might create one or more executor processes, from 
different Spark applications.
* cluster-mode, worker might create driver process as well.
* local-cluster model, there could be multiple worker processes on the same 
node. This is an undocumented use of standalone mode, which is mainly for tests.
* Resource isolation is not considered here.

Because executor and driver processes on the same node will share the 
accelerator resources, worker must take the role that allocates resources. So 
we will add spark.worker.resource.[resourceName].discoveryScript conf for 
workers to discover resources. User need to match the resourceName in driver 
and executor requests. Besides CPU cores and memory, worker now also considers 
resources in creating new executors or drivers.

Example conf:

{code}
spark.worker.resource.gpu.discoveryScript=/path/to/list-gpus.sh
spark.driver.resource.gpu.count=4
spark.worker.resource.gpu.count=1
{code}

In client mode, driver process is not launched by worker. So user can specify 
driver resource discovery script. In cluster mode, if user still specify driver 
resource discovery script, it is ignored with a warning.

Supporting resource isolation is tricky because Spark worker doesn't know how 
to isolate resources unless we hardcode some resource names like GPU support in 
YARN, which is less ideal. Support resource isolation of multiple resource 
types is even harder. In the first version, we will implement accelerator 
support without resource isolation.

  was:
Design draft:

Scenarios:
* client-mode, worker might create one or more executor processes, from 
different Spark applications.
* cluster-mode, worker might create driver process as well.
* local-cluster model, there could be multiple worker processes on the same 
node. This is an undocumented use of standalone mode, which is mainly for tests.

Because executor and driver processes on the same node will share the 
accelerator resources, worker must take the role that allocates resources. So 
we will add spark.worker.resource.[resourceName].discoveryScript conf for 
workers to discover resources. User need to match the resourceName in driver 
and executor requests and they don't need to specify discovery scripts 
separately.


> Design: Standalone supports GPU scheduling
> --
>
> Key: SPARK-27368
> URL: https://issues.apache.org/jira/browse/SPARK-27368
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>
> Design draft:
> Scenarios:
> * client-mode, worker might create one or more executor processes, from 
> different Spark applications.
> * cluster-mode, worker might create driver process as well.
> * local-cluster model, there could be multiple worker processes on the same 
> node. This is an undocumented use of standalone mode, which is mainly for 
> tests.
> * Resource isolation is not considered here.
> Because executor and driver processes on the same node will share the 
> accelerator resources, worker must take the role that allocates resources. So 
> we will add spark.worker.resource.[resourceName].discoveryScript conf for 
> workers to discover resources. User need to match the resourceName in driver 
> and executor requests. Besides CPU cores and memory, worker now also 
> considers resources in creating new executors or drivers.
> Example conf:
> {code}
> spark.worker.resource.gpu.discoveryScript=/path/to/list-gpus.sh
> spark.driver.resource.gpu.count=4
> spark.worker.resource.gpu.count=1
> {code}
> In client mode, driver process is not launched by worker. So user can specify 
> driver resource discovery script. In cluster mode, if user still specify 
> driver resource discovery script, it is ignored with a warning.
> Supporting resource isolation is tricky because Spark worker doesn't know how 
> to isolate resources unless we hardcode some resource names like GPU support 
> in YARN, which is less ideal. Support resource isolation of multiple resource 
> types is even harder. In the first version, we will implement accelerator 
> support without resource isolation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27368) Design: Standalone supports GPU scheduling

2019-06-05 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-27368:
--
Description: 
Design draft:

Scenarios:
* client-mode, worker might create one or more executor processes, from 
different Spark applications.
* cluster-mode, worker might create driver process as well.
* local-cluster model, there could be multiple worker processes on the same 
node. This is an undocumented use of standalone mode, which is mainly for tests.

Because executor and driver processes on the same node will share the 
accelerator resources, worker must take the role that allocates resources. So 
we will add spark.worker.resource.[resourceName].discoveryScript conf for 
workers to discover resources. User need to match the resourceName in driver 
and executor requests and they don't need to specify discovery scripts 
separately.

> Design: Standalone supports GPU scheduling
> --
>
> Key: SPARK-27368
> URL: https://issues.apache.org/jira/browse/SPARK-27368
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>
> Design draft:
> Scenarios:
> * client-mode, worker might create one or more executor processes, from 
> different Spark applications.
> * cluster-mode, worker might create driver process as well.
> * local-cluster model, there could be multiple worker processes on the same 
> node. This is an undocumented use of standalone mode, which is mainly for 
> tests.
> Because executor and driver processes on the same node will share the 
> accelerator resources, worker must take the role that allocates resources. So 
> we will add spark.worker.resource.[resourceName].discoveryScript conf for 
> workers to discover resources. User need to match the resourceName in driver 
> and executor requests and they don't need to specify discovery scripts 
> separately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27366) Spark scheduler internal changes to support GPU scheduling

2019-06-04 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-27366.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24374
[https://github.com/apache/spark/pull/24374]

> Spark scheduler internal changes to support GPU scheduling
> --
>
> Key: SPARK-27366
> URL: https://issues.apache.org/jira/browse/SPARK-27366
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xingbo Jiang
>Priority: Major
> Fix For: 3.0.0
>
>
> Update Spark job scheduler to support accelerator resource requests submitted 
> at application level.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27888) Python 2->3 migration guide for PySpark users

2019-06-04 Thread Xiangrui Meng (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855838#comment-16855838
 ] 

Xiangrui Meng commented on SPARK-27888:
---

It would be nice if we can find some PySpark users who already migrated from 
python 2 to 3 and tell us what issues users should expect and any benefits from 
the migration.

> Python 2->3 migration guide for PySpark users
> -
>
> Key: SPARK-27888
> URL: https://issues.apache.org/jira/browse/SPARK-27888
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> We might need a short Python 2->3 migration guide for PySpark users. It 
> doesn't need to be comprehensive given many Python 2->3 migration guides 
> around. We just need some pointers and list items that are specific to 
> PySpark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27884) Deprecate Python 2 support in Spark 3.0

2019-06-04 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-27884:
-

Assignee: Xiangrui Meng

> Deprecate Python 2 support in Spark 3.0
> ---
>
> Key: SPARK-27884
> URL: https://issues.apache.org/jira/browse/SPARK-27884
> Project: Spark
>  Issue Type: Story
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>  Labels: release-notes
>
> Officially deprecate Python 2 support in Spark 3.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27888) Python 2->3 migration guide for PySpark users

2019-06-04 Thread Xiangrui Meng (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855832#comment-16855832
 ] 

Xiangrui Meng commented on SPARK-27888:
---

This JIRA is not to inform users that Python 2 is deprecated. It provides info 
to help users migrate to Python, though I don't know if there are things 
specific to PySpark. I think it should stay in the user guide instead of a 
release note.

> Python 2->3 migration guide for PySpark users
> -
>
> Key: SPARK-27888
> URL: https://issues.apache.org/jira/browse/SPARK-27888
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> We might need a short Python 2->3 migration guide for PySpark users. It 
> doesn't need to be comprehensive given many Python 2->3 migration guides 
> around. We just need some pointers and list items that are specific to 
> PySpark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27886) Add Apache Spark project to https://python3statement.org/

2019-06-04 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-27886.
---
Resolution: Done

> Add Apache Spark project to https://python3statement.org/
> -
>
> Key: SPARK-27886
> URL: https://issues.apache.org/jira/browse/SPARK-27886
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>
> Add Spark to https://python3statement.org/ and indicate our timeline. I 
> reviewed the statement at https://python3statement.org/. Most projects listed 
> there will *drop* Python 2 before 2020/01/01 instead of deprecating it with 
> only [one 
> exception|https://github.com/python3statement/python3statement.github.io/blob/6ccacf8beb3cc49b1b3d572ab4f841e250853ca9/site.js#L206].
>  We certainly cannot drop Python 2 support in 2019 given we haven't 
> deprecated it yet.
> Maybe we can add the following time line:
> * 2019/10 - 2020/04: Python 2 & 3
> * 2020/04 - : Python 3 only
> The switching is the next release after Spark 3.0. If we want to hold another 
> release, it would be Sept or Oct.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 5507 matches

Mail list logo