[jira] [Resolved] (SPARK-22060) CrossValidator/TrainValidationSplit parallelism param persist/load bug

2017-09-22 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-22060.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19278
[https://github.com/apache/spark/pull/19278]

> CrossValidator/TrainValidationSplit parallelism param persist/load bug
> --
>
> Key: SPARK-22060
> URL: https://issues.apache.org/jira/browse/SPARK-22060
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
> Fix For: 2.3.0
>
>
> CrossValidator/TrainValidationSplit `parallelism` param cannot be saved, when 
> we save the CrossValidator/TrainValidationSplit object to disk.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19357) Parallel Model Evaluation for ML Tuning: Scala

2017-09-21 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16175629#comment-16175629
 ] 

Joseph K. Bradley edited comment on SPARK-19357 at 9/21/17 11:14 PM:
-

[~bryanc], [~nick.pentre...@gmail.com], [~WeichenXu123] Well I feel a bit 
foolish; I just realized these changes to support parallel model evaluation are 
going to cause some problems for optimizing multi-model fitting.
* When we originally designed the Pipelines API, we put {{def fit(dataset: 
Dataset[_], paramMaps: Array[ParamMap]): Seq[M]}} in {{abstract class 
Estimator}} for the sake of eventually being able to override that method 
within specific Estimators which can do algorithm-specific optimizations.  
E.g., if you're tuning {{maxIter}}, then you should really only fit once and 
just save the model at various iterations along the way.
* These recent changes in master to CrossValidator and TrainValidationSplit 
have switched from calling fit() with all of the ParamMaps to calling fit() 
with a single ParamMap.  This means that the model-specific optimization is no 
longer possible.

Although we haven't found time yet to do these model-specific optimizations, 
I'd really like for us to be able to do so in the future.  For some models, 
this could lead to huge speedups (N^2 to N for the case of maxIter for linear 
models).  Any ideas for fixing this?  Here are my thoughts:
* To allow model-specific optimization, the implementation for fitting for 
multiple ParamMaps needs to be within models, not within CrossValidator or 
other tuning algorithms.
* Therefore, we need to use something like {{def fit(dataset: Dataset[_], 
paramMaps: Array[ParamMap]): Seq[M]}}.  However, we will need an API which 
takes the {{parallelism}} Param.
* Since {{Estimator}} is an abstract class, we can add a new method as long as 
it has a default implementation, without worrying about breaking APIs across 
Spark versions.  So we could add something like:
** {{def fit(dataset: Dataset[_], paramMaps: Array[ParamMap], parallelism: 
Int): Seq[M]}}
** However, this will not mesh well with our plans for dumping models from 
CrossValidator to disk during tuning.  For that, we would need to be able to 
pass callbacks, e.g.: {{def fit(dataset: Dataset[_], paramMaps: 
Array[ParamMap], parallelism: Int, callback: M => ()): Seq[M]}} (or something 
like that).

What do you think?


was (Author: josephkb):
[~bryanc], [~nick.pentre...@gmail.com], [~WeichenXu123] Well I feel a bit 
foolish; I just realized these changes to support parallel model evaluation are 
going to cause some problems for optimizing multi-model fitting.
* When we originally designed the Pipelines API, we put {{def fit(dataset: 
Dataset[_], paramMaps: Array[ParamMap]): Seq[M]}} in {{abstract class 
Estimator}} for the sake of eventually being able to override that method 
within specific Estimators which can do algorithm-specific optimizations.  
E.g., if you're tuning {{maxIter}}, then you should really only fit once and 
just save the model at various iterations along the way.
* These recent changes in master to CrossValidator and TrainValidationSplit 
have switched from calling fit() with all of the ParamMaps to calling fit() 
with a single ParamMap.  This means that the model-specific optimization is no 
longer possible.

Although we haven't found time yet to do these model-specific optimizations, 
I'd really like for us to be able to do so in the future.  Any ideas for fixing 
this?  Here are my thoughts:
* To allow model-specific optimization, the implementation for fitting for 
multiple ParamMaps needs to be within models, not within CrossValidator or 
other tuning algorithms.
* Therefore, we need to use something like {{def fit(dataset: Dataset[_], 
paramMaps: Array[ParamMap]): Seq[M]}}.  However, we will need an API which 
takes the {{parallelism}} Param.
* Since {{Estimator}} is an abstract class, we can add a new method as long as 
it has a default implementation, without worrying about breaking APIs across 
Spark versions.  So we could add something like:
** {{def fit(dataset: Dataset[_], paramMaps: Array[ParamMap], parallelism: 
Int): Seq[M]}}
** However, this will not mesh well with our plans for dumping models from 
CrossValidator to disk during tuning.  For that, we would need to be able to 
pass callbacks, e.g.: {{def fit(dataset: Dataset[_], paramMaps: 
Array[ParamMap], parallelism: Int, callback: M => ()): Seq[M]}} (or something 
like that).

What do you think?

> Parallel Model Evaluation for ML Tuning: Scala
> --
>
> Key: SPARK-19357
> URL: https://issues.apache.org/jira/browse/SPARK-19357
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
> Fix

[jira] [Commented] (SPARK-19357) Parallel Model Evaluation for ML Tuning: Scala

2017-09-21 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16175629#comment-16175629
 ] 

Joseph K. Bradley commented on SPARK-19357:
---

[~bryanc], [~nick.pentre...@gmail.com], [~WeichenXu123] Well I feel a bit 
foolish; I just realized these changes to support parallel model evaluation are 
going to cause some problems for optimizing multi-model fitting.
* When we originally designed the Pipelines API, we put {{def fit(dataset: 
Dataset[_], paramMaps: Array[ParamMap]): Seq[M]}} in {{abstract class 
Estimator}} for the sake of eventually being able to override that method 
within specific Estimators which can do algorithm-specific optimizations.  
E.g., if you're tuning {{maxIter}}, then you should really only fit once and 
just save the model at various iterations along the way.
* These recent changes in master to CrossValidator and TrainValidationSplit 
have switched from calling fit() with all of the ParamMaps to calling fit() 
with a single ParamMap.  This means that the model-specific optimization is no 
longer possible.

Although we haven't found time yet to do these model-specific optimizations, 
I'd really like for us to be able to do so in the future.  Any ideas for fixing 
this?  Here are my thoughts:
* To allow model-specific optimization, the implementation for fitting for 
multiple ParamMaps needs to be within models, not within CrossValidator or 
other tuning algorithms.
* Therefore, we need to use something like {{def fit(dataset: Dataset[_], 
paramMaps: Array[ParamMap]): Seq[M]}}.  However, we will need an API which 
takes the {{parallelism}} Param.
* Since {{Estimator}} is an abstract class, we can add a new method as long as 
it has a default implementation, without worrying about breaking APIs across 
Spark versions.  So we could add something like:
** {{def fit(dataset: Dataset[_], paramMaps: Array[ParamMap], parallelism: 
Int): Seq[M]}}
** However, this will not mesh well with our plans for dumping models from 
CrossValidator to disk during tuning.  For that, we would need to be able to 
pass callbacks, e.g.: {{def fit(dataset: Dataset[_], paramMaps: 
Array[ParamMap], parallelism: Int, callback: M => ()): Seq[M]}} (or something 
like that).

What do you think?

> Parallel Model Evaluation for ML Tuning: Scala
> --
>
> Key: SPARK-19357
> URL: https://issues.apache.org/jira/browse/SPARK-19357
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
> Fix For: 2.3.0
>
> Attachments: parallelism-verification-test.pdf
>
>
> This is a first step of the parent task of Optimizations for ML Pipeline 
> Tuning to perform model evaluation in parallel.  A simple approach is to 
> naively evaluate with a possible parameter to control the level of 
> parallelism.  There are some concerns with this:
> * excessive caching of datasets
> * what to set as the default value for level of parallelism.  1 will evaluate 
> all models in serial, as is done currently. Higher values could lead to 
> excessive caching.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22060) CrossValidator/TrainValidationSplit parallelism param persist/load bug

2017-09-21 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-22060:
--
Target Version/s: 2.3.0

> CrossValidator/TrainValidationSplit parallelism param persist/load bug
> --
>
> Key: SPARK-22060
> URL: https://issues.apache.org/jira/browse/SPARK-22060
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>
> CrossValidator/TrainValidationSplit `parallelism` param cannot be saved, when 
> we save the CrossValidator/TrainValidationSplit object to disk.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22060) CrossValidator/TrainValidationSplit parallelism param persist/load bug

2017-09-21 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-22060:
-

Assignee: Weichen Xu

> CrossValidator/TrainValidationSplit parallelism param persist/load bug
> --
>
> Key: SPARK-22060
> URL: https://issues.apache.org/jira/browse/SPARK-22060
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>
> CrossValidator/TrainValidationSplit `parallelism` param cannot be saved, when 
> we save the CrossValidator/TrainValidationSplit object to disk.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22060) CrossValidator/TrainValidationSplit parallelism param persist/load bug

2017-09-21 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-22060:
--
Shepherd: Joseph K. Bradley

> CrossValidator/TrainValidationSplit parallelism param persist/load bug
> --
>
> Key: SPARK-22060
> URL: https://issues.apache.org/jira/browse/SPARK-22060
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>
> CrossValidator/TrainValidationSplit `parallelism` param cannot be saved, when 
> we save the CrossValidator/TrainValidationSplit object to disk.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14371) OnlineLDAOptimizer should not collect stats for each doc in mini-batch to driver

2017-09-20 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14371:
--
Shepherd: Joseph K. Bradley

> OnlineLDAOptimizer should not collect stats for each doc in mini-batch to 
> driver
> 
>
> Key: SPARK-14371
> URL: https://issues.apache.org/jira/browse/SPARK-14371
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> See this line: 
> https://github.com/apache/spark/blob/5743c6476dbef50852b7f9873112a2d299966ebd/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala#L437
> The second element in each row of "stats" is a list with one Vector for each 
> document in the mini-batch.  Those are collected to the driver in this line:
> https://github.com/apache/spark/blob/5743c6476dbef50852b7f9873112a2d299966ebd/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala#L456
> We should not collect those to the driver.  Rather, we should do the 
> necessary maps and aggregations in a distributed manner.  This will involve 
> modify the Dirichlet expectation implementation.  (This JIRA should be done 
> by someone knowledge about online LDA and Spark.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21770) ProbabilisticClassificationModel: Improve normalization of all-zero raw predictions

2017-09-18 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16170734#comment-16170734
 ] 

Joseph K. Bradley commented on SPARK-21770:
---

Update from PR discussion: The new plan is to throw an Exception.  We can do an 
actual fix if this method is used by other models in the future.

> ProbabilisticClassificationModel: Improve normalization of all-zero raw 
> predictions
> ---
>
> Key: SPARK-21770
> URL: https://issues.apache.org/jira/browse/SPARK-21770
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Siddharth Murching
>Priority: Minor
>
> Given an n-element raw prediction vector of all-zeros, 
> ProbabilisticClassifierModel.normalizeToProbabilitiesInPlace() should output 
> a probability vector of all-equal 1/n entries



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark

2017-09-12 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16163976#comment-16163976
 ] 

Joseph K. Bradley commented on SPARK-21866:
---

1. For the namespace, here are my thoughts:

I don't feel too strongly about this, but I'd vote for putting it under 
{{org.apache.spark.ml.image}}.
Pros:
* The image package will be in the spark-ml sub-project, and this fits that 
structure.
* This will avoid polluting the o.a.s namespace, and we do not yet have any 
other data types listed under o.a.s.
Cons:
* Images are more general than ML.  We might want to move the image package out 
of spark-ml eventually.

2. For the SQL data source, HUGE +1 for making a data source

I'm glad it's mentioned in the SPIP, but I would really like to see it 
prioritized.  There's no need to make a dependency between SQL and ML by adding 
options to the image data source reader; data sources support optional 
arguments.  E.g., the CSV data source has option "delimiter" but that is wholly 
contained within the data source; it doesn't affect other data sources.  Is 
there an option needed by the image data source which will force us to abuse 
the data source API?

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>  Labels: SPIP
> Attachments: SPIP - Image support for Apache Spark V1.1.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
> * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
> * the total size of an image should be restricted to less than 2GB (roughly)
> * the meaning of color channels is application-specific and is not mandated 
> by the standard (in line with the OpenCV standard)
> * specialized formats used in meteorology, the medical field, etc. are not 
> supported
> * this format is specialized to images and does not attempt to

[jira] [Resolved] (SPARK-18608) Spark ML algorithms that check RDD cache level for internal caching double-cache data

2017-09-12 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-18608.
---
   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.1

Issue resolved by pull request 19197
[https://github.com/apache/spark/pull/19197]

> Spark ML algorithms that check RDD cache level for internal caching 
> double-cache data
> -
>
> Key: SPARK-18608
> URL: https://issues.apache.org/jira/browse/SPARK-18608
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: zhengruifeng
> Fix For: 2.2.1, 2.3.0
>
>
> Some algorithms in Spark ML (e.g. {{LogisticRegression}}, 
> {{LinearRegression}}, and I believe now {{KMeans}}) handle persistence 
> internally. They check whether the input dataset is cached, and if not they 
> cache it for performance.
> However, the check is done using {{dataset.rdd.getStorageLevel == NONE}}. 
> This will actually always be true, since even if the dataset itself is 
> cached, the RDD returned by {{dataset.rdd}} will not be cached.
> Hence if the input dataset is cached, the data will end up being cached 
> twice, which is wasteful.
> To see this:
> {code}
> scala> import org.apache.spark.storage.StorageLevel
> import org.apache.spark.storage.StorageLevel
> scala> val df = spark.range(10).toDF("num")
> df: org.apache.spark.sql.DataFrame = [num: bigint]
> scala> df.storageLevel == StorageLevel.NONE
> res0: Boolean = true
> scala> df.persist
> res1: df.type = [num: bigint]
> scala> df.storageLevel == StorageLevel.MEMORY_AND_DISK
> res2: Boolean = true
> scala> df.rdd.getStorageLevel == StorageLevel.MEMORY_AND_DISK
> res3: Boolean = false
> scala> df.rdd.getStorageLevel == StorageLevel.NONE
> res4: Boolean = true
> {code}
> Before SPARK-16063, there was no way to check the storage level of the input 
> {{DataSet}}, but now we can, so the checks should be migrated to use 
> {{dataset.storageLevel}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18608) Spark ML algorithms that check RDD cache level for internal caching double-cache data

2017-09-12 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18608:
--
Target Version/s: 2.2.1, 2.3.0

> Spark ML algorithms that check RDD cache level for internal caching 
> double-cache data
> -
>
> Key: SPARK-18608
> URL: https://issues.apache.org/jira/browse/SPARK-18608
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: zhengruifeng
>
> Some algorithms in Spark ML (e.g. {{LogisticRegression}}, 
> {{LinearRegression}}, and I believe now {{KMeans}}) handle persistence 
> internally. They check whether the input dataset is cached, and if not they 
> cache it for performance.
> However, the check is done using {{dataset.rdd.getStorageLevel == NONE}}. 
> This will actually always be true, since even if the dataset itself is 
> cached, the RDD returned by {{dataset.rdd}} will not be cached.
> Hence if the input dataset is cached, the data will end up being cached 
> twice, which is wasteful.
> To see this:
> {code}
> scala> import org.apache.spark.storage.StorageLevel
> import org.apache.spark.storage.StorageLevel
> scala> val df = spark.range(10).toDF("num")
> df: org.apache.spark.sql.DataFrame = [num: bigint]
> scala> df.storageLevel == StorageLevel.NONE
> res0: Boolean = true
> scala> df.persist
> res1: df.type = [num: bigint]
> scala> df.storageLevel == StorageLevel.MEMORY_AND_DISK
> res2: Boolean = true
> scala> df.rdd.getStorageLevel == StorageLevel.MEMORY_AND_DISK
> res3: Boolean = false
> scala> df.rdd.getStorageLevel == StorageLevel.NONE
> res4: Boolean = true
> {code}
> Before SPARK-16063, there was no way to check the storage level of the input 
> {{DataSet}}, but now we can, so the checks should be migrated to use 
> {{dataset.storageLevel}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18608) Spark ML algorithms that check RDD cache level for internal caching double-cache data

2017-09-12 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-18608:
-

Assignee: zhengruifeng

> Spark ML algorithms that check RDD cache level for internal caching 
> double-cache data
> -
>
> Key: SPARK-18608
> URL: https://issues.apache.org/jira/browse/SPARK-18608
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: zhengruifeng
>
> Some algorithms in Spark ML (e.g. {{LogisticRegression}}, 
> {{LinearRegression}}, and I believe now {{KMeans}}) handle persistence 
> internally. They check whether the input dataset is cached, and if not they 
> cache it for performance.
> However, the check is done using {{dataset.rdd.getStorageLevel == NONE}}. 
> This will actually always be true, since even if the dataset itself is 
> cached, the RDD returned by {{dataset.rdd}} will not be cached.
> Hence if the input dataset is cached, the data will end up being cached 
> twice, which is wasteful.
> To see this:
> {code}
> scala> import org.apache.spark.storage.StorageLevel
> import org.apache.spark.storage.StorageLevel
> scala> val df = spark.range(10).toDF("num")
> df: org.apache.spark.sql.DataFrame = [num: bigint]
> scala> df.storageLevel == StorageLevel.NONE
> res0: Boolean = true
> scala> df.persist
> res1: df.type = [num: bigint]
> scala> df.storageLevel == StorageLevel.MEMORY_AND_DISK
> res2: Boolean = true
> scala> df.rdd.getStorageLevel == StorageLevel.MEMORY_AND_DISK
> res3: Boolean = false
> scala> df.rdd.getStorageLevel == StorageLevel.NONE
> res4: Boolean = true
> {code}
> Before SPARK-16063, there was no way to check the storage level of the input 
> {{DataSet}}, but now we can, so the checks should be migrated to use 
> {{dataset.storageLevel}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21027) Parallel One vs. Rest Classifier

2017-09-12 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-21027:
-

Assignee: Ajay Saini

> Parallel One vs. Rest Classifier
> 
>
> Key: SPARK-21027
> URL: https://issues.apache.org/jira/browse/SPARK-21027
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Ajay Saini
>Assignee: Ajay Saini
> Fix For: 2.3.0
>
>
> Currently, the Scala implementation of OneVsRest allows the user to run a 
> parallel implementation in which each class is evaluated in a different 
> thread. This implementation allows up to a 2X speedup as determined by 
> experiments but is not currently not tunable. Furthermore, the python 
> implementation of OneVsRest does not parallelize at all. It would be useful 
> to add a parallel, tunable implementation of OneVsRest to the python library 
> in order to speed up the algorithm.
>  A ticket for the Scala implementation of this classifier is here: 
> https://issues.apache.org/jira/browse/SPARK-21028



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21027) Parallel One vs. Rest Classifier

2017-09-12 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-21027.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19110
[https://github.com/apache/spark/pull/19110]

> Parallel One vs. Rest Classifier
> 
>
> Key: SPARK-21027
> URL: https://issues.apache.org/jira/browse/SPARK-21027
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Ajay Saini
>Assignee: Ajay Saini
> Fix For: 2.3.0
>
>
> Currently, the Scala implementation of OneVsRest allows the user to run a 
> parallel implementation in which each class is evaluated in a different 
> thread. This implementation allows up to a 2X speedup as determined by 
> experiments but is not currently not tunable. Furthermore, the python 
> implementation of OneVsRest does not parallelize at all. It would be useful 
> to add a parallel, tunable implementation of OneVsRest to the python library 
> in order to speed up the algorithm.
>  A ticket for the Scala implementation of this classifier is here: 
> https://issues.apache.org/jira/browse/SPARK-21028



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19422) Cache input data in algorithms

2017-09-11 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16162008#comment-16162008
 ] 

Joseph K. Bradley commented on SPARK-19422:
---

Linking [SPARK-21972], which may interact with this PR.  Let's be very careful 
about changing the behavior, though I agree this will be valuable to improve.

> Cache input data in algorithms
> --
>
> Key: SPARK-19422
> URL: https://issues.apache.org/jira/browse/SPARK-19422
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>
> Now some algorithms cache the input dataset if it was not cached any more 
> {{StorageLevel.NONE}}:
> {{FeedForwardTrainer}}, {{LogisticRegression}}, {{OneVsRest}}, {{KMeans}}, 
> {{AFTSurvivalRegression}}, {{IsotonicRegression}}, {{LinearRegression}} with 
> non-WSL solver
> It maybe reasonable to cache input for others:
> {{DecisionTreeClassifier}}, {{GBTClassifier}}, {{RandomForestClassifier}}, 
> {{LinearSVC}}
> {{BisectingKMeans}}, {{GaussianMixture}}, {{LDA}}
> {{DecisionTreeRegressor}}, {{GBTRegressor}}, {{GeneralizedLinearRegression}} 
> with IRLS solver, {{RandomForestRegressor}}
> {{NaiveBayes}} is not included since it only make one pass on the data.
> {{MultilayerPerceptronClassifier}} is not included since the data is cached 
> in {{FeedForwardTrainer.train}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param

2017-09-11 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16162007#comment-16162007
 ] 

Joseph K. Bradley commented on SPARK-21972:
---

The issue (a) does not really conflict with or affect this JIRA; it can be 
fixed separately.

Also, I recommend we *not* set handlePersistence to true by default in the 
shared param.  Algorithms should set the default individually since the right 
behavior is very algorithm-dependent.

> Allow users to control input data persistence in ML Estimators via a 
> handlePersistence ml.Param
> ---
>
> Key: SPARK-21972
> URL: https://issues.apache.org/jira/browse/SPARK-21972
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Siddharth Murching
>
> Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, 
> etc) call {{cache()}} on uncached input datasets to improve performance.
> Unfortunately, these algorithms a) check input persistence inaccurately 
> ([SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) 
> check the persistence level of the input dataset but not any of its parents. 
> These issues can result in unwanted double-caching of input data & degraded 
> performance (see 
> [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]).
> This ticket proposes adding a boolean {{handlePersistence}} param 
> (org.apache.spark.ml.param) so that users can specify whether an ML algorithm 
> should try to cache un-cached input data. {{handlePersistence}} will be 
> {{true}} by default, corresponding to existing behavior (always persisting 
> uncached input), but users can achieve finer-grained control over input 
> persistence by setting {{handlePersistence}} to {{false}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18608) Spark ML algorithms that check RDD cache level for internal caching double-cache data

2017-09-11 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16162004#comment-16162004
 ] 

Joseph K. Bradley commented on SPARK-18608:
---

Hi all, it looks like there has been confusion about what has been agreed on.  
This is my current understanding:

There are 2 issues:
1. This JIRA [SPARK-18608], which discusses the bug of double-caching because 
of misuse of {{dataset.rdd.getStorageLevel}}.  Note that [SPARK-21799] is just 
a special case of this bug.
2. [SPARK-21972], which discusses adding a parameter handlePersistence to allow 
user control over whether to cache the input data.

I recommend:
1. We should fix the current double-caching bug in master and branch-2.2.  
Going from Spark 2.1 to 2.2, I've only seen a performance regression with 
K-Means, but I recommend we fix the bug for all cases.  This fix would be like 
[~podongfeng]'s original PR for https://github.com/apache/spark/pull/17014 
(before adding in handlePersistence).
2. We can work on adding handlePersistence to master.  No backporting there of 
course.  Note that [SPARK-19422] is also related, and it may be blocked by 
decisions on [SPARK-21972].

> Spark ML algorithms that check RDD cache level for internal caching 
> double-cache data
> -
>
> Key: SPARK-18608
> URL: https://issues.apache.org/jira/browse/SPARK-18608
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Nick Pentreath
>
> Some algorithms in Spark ML (e.g. {{LogisticRegression}}, 
> {{LinearRegression}}, and I believe now {{KMeans}}) handle persistence 
> internally. They check whether the input dataset is cached, and if not they 
> cache it for performance.
> However, the check is done using {{dataset.rdd.getStorageLevel == NONE}}. 
> This will actually always be true, since even if the dataset itself is 
> cached, the RDD returned by {{dataset.rdd}} will not be cached.
> Hence if the input dataset is cached, the data will end up being cached 
> twice, which is wasteful.
> To see this:
> {code}
> scala> import org.apache.spark.storage.StorageLevel
> import org.apache.spark.storage.StorageLevel
> scala> val df = spark.range(10).toDF("num")
> df: org.apache.spark.sql.DataFrame = [num: bigint]
> scala> df.storageLevel == StorageLevel.NONE
> res0: Boolean = true
> scala> df.persist
> res1: df.type = [num: bigint]
> scala> df.storageLevel == StorageLevel.MEMORY_AND_DISK
> res2: Boolean = true
> scala> df.rdd.getStorageLevel == StorageLevel.MEMORY_AND_DISK
> res3: Boolean = false
> scala> df.rdd.getStorageLevel == StorageLevel.NONE
> res4: Boolean = true
> {code}
> Before SPARK-16063, there was no way to check the storage level of the input 
> {{DataSet}}, but now we can, so the checks should be migrated to use 
> {{dataset.storageLevel}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-21799) KMeans performance regression (5-6x slowdown) in Spark 2.2

2017-09-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-21799.
-
Resolution: Duplicate

> KMeans performance regression (5-6x slowdown) in Spark 2.2
> --
>
> Key: SPARK-21799
> URL: https://issues.apache.org/jira/browse/SPARK-21799
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: Siddharth Murching
>
> I've been running KMeans performance tests using 
> [spark-sql-perf|https://github.com/databricks/spark-sql-perf/] and have 
> noticed a regression (slowdowns of 5-6x) when running tests on large datasets 
> in Spark 2.2 vs 2.1.
> The test params are:
> * Cluster: 510 GB RAM, 16 workers
> * Data: 100 examples, 1 features
> After talking to [~josephkb], the issue seems related to the changes in 
> [SPARK-18356|https://issues.apache.org/jira/browse/SPARK-18356] introduced in 
> [this PR|https://github.com/apache/spark/pull/16295].
> It seems `df.cache()` doesn't set the storageLevel of `df.rdd`, so 
> `handlePersistence` is true even when KMeans is run on a cached DataFrame. 
> This unnecessarily causes another copy of the input dataset to be persisted.
> As of Spark 2.1 ([JIRA 
> link|https://issues.apache.org/jira/browse/SPARK-16063]) `df.storageLevel` 
> returns the correct result after calling `df.cache()`, so I'd suggest 
> replacing instances of `df.rdd.getStorageLevel` with df.storageLevel` in 
> MLlib algorithms (the same pattern shows up in LogisticRegression, 
> LinearRegression, and others). I've verified this behavior in [this 
> notebook|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5211178207246023/950505630032626/7788830288800223/latest.html]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21799) KMeans performance regression (5-6x slowdown) in Spark 2.2

2017-09-11 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16161992#comment-16161992
 ] 

Joseph K. Bradley commented on SPARK-21799:
---

Now that I've caught up on these, this is just a special case of the bug in 
[SPARK-18608].  I'm going to close this issue and ask for a PR like 
[~podongfeng]'s original PR be sent for [SPARK-18608], fixing the use of 
{{dataset.rdd.getStorageLevel}}.  I think we should fix it for all algorithms, 
not just K-Means.

> KMeans performance regression (5-6x slowdown) in Spark 2.2
> --
>
> Key: SPARK-21799
> URL: https://issues.apache.org/jira/browse/SPARK-21799
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: Siddharth Murching
>
> I've been running KMeans performance tests using 
> [spark-sql-perf|https://github.com/databricks/spark-sql-perf/] and have 
> noticed a regression (slowdowns of 5-6x) when running tests on large datasets 
> in Spark 2.2 vs 2.1.
> The test params are:
> * Cluster: 510 GB RAM, 16 workers
> * Data: 100 examples, 1 features
> After talking to [~josephkb], the issue seems related to the changes in 
> [SPARK-18356|https://issues.apache.org/jira/browse/SPARK-18356] introduced in 
> [this PR|https://github.com/apache/spark/pull/16295].
> It seems `df.cache()` doesn't set the storageLevel of `df.rdd`, so 
> `handlePersistence` is true even when KMeans is run on a cached DataFrame. 
> This unnecessarily causes another copy of the input dataset to be persisted.
> As of Spark 2.1 ([JIRA 
> link|https://issues.apache.org/jira/browse/SPARK-16063]) `df.storageLevel` 
> returns the correct result after calling `df.cache()`, so I'd suggest 
> replacing instances of `df.rdd.getStorageLevel` with df.storageLevel` in 
> MLlib algorithms (the same pattern shows up in LogisticRegression, 
> LinearRegression, and others). I've verified this behavior in [this 
> notebook|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5211178207246023/950505630032626/7788830288800223/latest.html]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21729) Generic test for ProbabilisticClassifier to ensure consistent output columns

2017-09-01 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-21729.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19065
[https://github.com/apache/spark/pull/19065]

> Generic test for ProbabilisticClassifier to ensure consistent output columns
> 
>
> Key: SPARK-21729
> URL: https://issues.apache.org/jira/browse/SPARK-21729
> Project: Spark
>  Issue Type: Test
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
> Fix For: 2.3.0
>
>
> One challenge with the ProbabilisticClassifier abstraction is that it 
> introduces different code paths for predictions depending on which output 
> columns are turned on or off: probability, rawPrediction, prediction.  We ran 
> into a bug in MLOR with this.
> This task is for adding a generic test usable in all test suites for 
> ProbabilisticClassifier types which does the following:
> * Take a dataset + Estimator
> * Fit the Estimator
> * Test prediction using the model with all combinations of output columns 
> turned on/off.
> * Make sure the output column values match, presumably by comparing vs. the 
> case with all 3 output columns turned on
> CC [~WeichenXu123] since this came up in 
> https://github.com/apache/spark/pull/17373



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21729) Generic test for ProbabilisticClassifier to ensure consistent output columns

2017-09-01 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-21729:
-

Assignee: Weichen Xu

> Generic test for ProbabilisticClassifier to ensure consistent output columns
> 
>
> Key: SPARK-21729
> URL: https://issues.apache.org/jira/browse/SPARK-21729
> Project: Spark
>  Issue Type: Test
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>
> One challenge with the ProbabilisticClassifier abstraction is that it 
> introduces different code paths for predictions depending on which output 
> columns are turned on or off: probability, rawPrediction, prediction.  We ran 
> into a bug in MLOR with this.
> This task is for adding a generic test usable in all test suites for 
> ProbabilisticClassifier types which does the following:
> * Take a dataset + Estimator
> * Fit the Estimator
> * Test prediction using the model with all combinations of output columns 
> turned on/off.
> * Make sure the output column values match, presumably by comparing vs. the 
> case with all 3 output columns turned on
> CC [~WeichenXu123] since this came up in 
> https://github.com/apache/spark/pull/17373



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-21729) Generic test for ProbabilisticClassifier to ensure consistent output columns

2017-09-01 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-21729:
--
Comment: was deleted

(was: User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/19065)

> Generic test for ProbabilisticClassifier to ensure consistent output columns
> 
>
> Key: SPARK-21729
> URL: https://issues.apache.org/jira/browse/SPARK-21729
> Project: Spark
>  Issue Type: Test
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>
> One challenge with the ProbabilisticClassifier abstraction is that it 
> introduces different code paths for predictions depending on which output 
> columns are turned on or off: probability, rawPrediction, prediction.  We ran 
> into a bug in MLOR with this.
> This task is for adding a generic test usable in all test suites for 
> ProbabilisticClassifier types which does the following:
> * Take a dataset + Estimator
> * Fit the Estimator
> * Test prediction using the model with all combinations of output columns 
> turned on/off.
> * Make sure the output column values match, presumably by comparing vs. the 
> case with all 3 output columns turned on
> CC [~WeichenXu123] since this came up in 
> https://github.com/apache/spark/pull/17373



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21770) ProbabilisticClassificationModel: Improve normalization of all-zero raw predictions

2017-09-01 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16150846#comment-16150846
 ] 

Joseph K. Bradley commented on SPARK-21770:
---

Linear models are the most likely to hit this case; if the algorithm has done 0 
iterations, then all coefficients will be 0.  But I agree it's just fixing a 
corner case which few people would ever hit.  OK to fix though IMO.

> ProbabilisticClassificationModel: Improve normalization of all-zero raw 
> predictions
> ---
>
> Key: SPARK-21770
> URL: https://issues.apache.org/jira/browse/SPARK-21770
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Siddharth Murching
>Priority: Minor
>
> Given an n-element raw prediction vector of all-zeros, 
> ProbabilisticClassifierModel.normalizeToProbabilitiesInPlace() should output 
> a probability vector of all-equal 1/n entries



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21862) Add overflow check in PCA

2017-08-31 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-21862.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19078
[https://github.com/apache/spark/pull/19078]

> Add overflow check in PCA
> -
>
> Key: SPARK-21862
> URL: https://issues.apache.org/jira/browse/SPARK-21862
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Minor
> Fix For: 2.3.0
>
>
> We should add overflow check in PCA, otherwise it is possible to throw 
> `NegativeArraySizeException` when `k` and `numFeatures` are too large.
> The overflow checking formula is here:
> https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/functions/svd.scala#L87



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark

2017-08-30 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148268#comment-16148268
 ] 

Joseph K. Bradley commented on SPARK-21866:
---

It's a valid question, but overall, I'd support this effort.  My thoughts:

Summary: Image processing use cases have become increasingly important, 
especially because of the rise of interest in deep learning.  It's valuable to 
standardize around a common format, partly for users and partly for developers.

Q: Are images a common data type?  I.e., if we were talking about adding 
support for storing text in Spark DataFrames, there would be no question that 
Spark must be able to handle text since it is such a common data format.  Are 
images common enough to merit inclusion in Spark?
A: I'd argue yes, partly because of the rise in requests around it.  But also, 
if it makes sense for a general purpose language like Java to contain image 
formats, then it likewise makes sense for a general purpose data processing 
library like Spark to contain image formats.  This does not duplicate 
functionality from java.awt (or other libraries) since the key elements being 
added here are Spark-specific: a Spark DataFrame schema and a Spark Data Source.

Q: Will leaving this functionality in a package, rather than putting it in 
Spark, be sufficient?
A: I worry that this will limit adoption, as well as community oversight of 
such a core piece of functionality.  Tooling built upon image formats, 
including image processing algorithms, could live outside of Spark, but basic 
image loading and saving should IMO live in Spark.

Q: Will users really benefit?
A: My main reason to support this is confusion I've heard about the right way 
to handle images in Spark.  They are sometimes handled outside of Spark's data 
model (often giving up proper resilience guarantees), are handled by falling 
back to the RDD API, etc.  I hope that standardization will simplify life for 
users (clarifying and standardizing APIs) and library developers (facilitating 
collaboration on image ETL).

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>  Labels: SPIP
> Attachments: SPIP - Image support for Apache Spark.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with ima

[jira] [Updated] (SPARK-21866) SPIP: Image support in Spark

2017-08-30 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-21866:
--
Target Version/s:   (was: 2.3.0)

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>  Labels: SPIP
> Attachments: SPIP - Image support for Apache Spark.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
> * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
> * the total size of an image should be restricted to less than 2GB (roughly)
> * the meaning of color channels is application-specific and is not mandated 
> by the standard (in line with the OpenCV standard)
> * specialized formats used in meteorology, the medical field, etc. are not 
> supported
> * this format is specialized to images and does not attempt to solve the more 
> general problem of representing n-dimensional tensors in Spark
> h2. Proposed API changes
> We propose to add a new package in the package structure, under the MLlib 
> project:
> {{org.apache.spark.image}}
> h3. Data format
> We propose to add the following structure:
> imageSchema = StructType([
> * StructField("mode", StringType(), False),
> ** The exact representation of the data.
> ** The values are described in the following OpenCV convention. Basically, 
> the type has both "depth" and "number of channels" info: in particular, type 
> "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 
> (value 32 in the table) with the channel order specified by convention.
> ** The exact channel ordering and meaning of each channel is dictated by 
> convention. By default, the order is RGB (3 channels) and BGRA (4 channels).
> If the image failed to load, the value is the empty string "".
> * StructField("origin", StringType(), True),
> ** Some information about the origin of the image. The 

[jira] [Updated] (SPARK-21862) Add overflow check in PCA

2017-08-29 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-21862:
--
Shepherd: Joseph K. Bradley

> Add overflow check in PCA
> -
>
> Key: SPARK-21862
> URL: https://issues.apache.org/jira/browse/SPARK-21862
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Minor
>
> We should add overflow check in PCA, otherwise it is possible to throw 
> `NegativeArraySizeException` when `k` and `numFeatures` are too large.
> The overflow checking formula is here:
> https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/functions/svd.scala#L87



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21862) Add overflow check in PCA

2017-08-29 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-21862:
-

Assignee: Weichen Xu

> Add overflow check in PCA
> -
>
> Key: SPARK-21862
> URL: https://issues.apache.org/jira/browse/SPARK-21862
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Minor
>
> We should add overflow check in PCA, otherwise it is possible to throw 
> `NegativeArraySizeException` when `k` and `numFeatures` are too large.
> The overflow checking formula is here:
> https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/functions/svd.scala#L87



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17139) Add model summary for MultinomialLogisticRegression

2017-08-28 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-17139.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 15435
[https://github.com/apache/spark/pull/15435]

> Add model summary for MultinomialLogisticRegression
> ---
>
> Key: SPARK-17139
> URL: https://issues.apache.org/jira/browse/SPARK-17139
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>Assignee: Weichen Xu
> Fix For: 2.3.0
>
>
> Add model summary to multinomial logistic regression using same interface as 
> in other ML models.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17139) Add model summary for MultinomialLogisticRegression

2017-08-28 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-17139:
-

Assignee: Weichen Xu

> Add model summary for MultinomialLogisticRegression
> ---
>
> Key: SPARK-17139
> URL: https://issues.apache.org/jira/browse/SPARK-17139
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>Assignee: Weichen Xu
>
> Add model summary to multinomial logistic regression using same interface as 
> in other ML models.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21681) MLOR do not work correctly when featureStd contains zero

2017-08-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-21681:
--
Labels: correctness  (was: )

> MLOR do not work correctly when featureStd contains zero
> 
>
> Key: SPARK-21681
> URL: https://issues.apache.org/jira/browse/SPARK-21681
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>  Labels: correctness
> Fix For: 2.2.1, 2.3.0
>
>
> MLOR do not work correctly when featureStd contains zero.
> We can reproduce the bug through such dataset (features including zero 
> variance), will generate wrong result (all coefficients becomes 0)
> {code}
> val multinomialDatasetWithZeroVar = {
>   val nPoints = 100
>   val coefficients = Array(
> -0.57997, 0.912083, -0.371077,
> -0.16624, -0.84355, -0.048509)
>   val xMean = Array(5.843, 3.0)
>   val xVariance = Array(0.6856, 0.0)  // including zero variance
>   val testData = generateMultinomialLogisticInput(
> coefficients, xMean, xVariance, addIntercept = true, nPoints, seed)
>   val df = sc.parallelize(testData, 4).toDF().withColumn("weight", 
> lit(1.0))
>   df.cache()
>   df
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21681) MLOR do not work correctly when featureStd contains zero

2017-08-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-21681.
---
   Resolution: Fixed
Fix Version/s: 2.2.1

Issue resolved by pull request 19026
[https://github.com/apache/spark/pull/19026]

> MLOR do not work correctly when featureStd contains zero
> 
>
> Key: SPARK-21681
> URL: https://issues.apache.org/jira/browse/SPARK-21681
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
> Fix For: 2.2.1, 2.3.0
>
>
> MLOR do not work correctly when featureStd contains zero.
> We can reproduce the bug through such dataset (features including zero 
> variance), will generate wrong result (all coefficients becomes 0)
> {code}
> val multinomialDatasetWithZeroVar = {
>   val nPoints = 100
>   val coefficients = Array(
> -0.57997, 0.912083, -0.371077,
> -0.16624, -0.84355, -0.048509)
>   val xMean = Array(5.843, 3.0)
>   val xVariance = Array(0.6856, 0.0)  // including zero variance
>   val testData = generateMultinomialLogisticInput(
> coefficients, xMean, xVariance, addIntercept = true, nPoints, seed)
>   val df = sc.parallelize(testData, 4).toDF().withColumn("weight", 
> lit(1.0))
>   df.cache()
>   df
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12664) Expose probability, rawPrediction in MultilayerPerceptronClassificationModel

2017-08-22 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-12664.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 17373
[https://github.com/apache/spark/pull/17373]

> Expose probability, rawPrediction in MultilayerPerceptronClassificationModel
> 
>
> Key: SPARK-12664
> URL: https://issues.apache.org/jira/browse/SPARK-12664
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Robert Dodier
>Assignee: Weichen Xu
> Fix For: 2.3.0
>
>
> In 
> org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel, 
> there isn't any way to get raw prediction scores; only an integer output 
> (from 0 to #classes - 1) is available via the `predict` method. 
> `mplModel.predict` is called within the class to get the raw score, but 
> `mlpModel` is private so that isn't available to outside callers.
> The raw score is useful when the user wants to interpret the classifier 
> output as a probability. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12664) Expose probability, rawPrediction in MultilayerPerceptronClassificationModel

2017-08-22 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-12664:
--
Summary: Expose probability, rawPrediction in 
MultilayerPerceptronClassificationModel  (was: Expose raw prediction scores in 
MultilayerPerceptronClassificationModel)

> Expose probability, rawPrediction in MultilayerPerceptronClassificationModel
> 
>
> Key: SPARK-12664
> URL: https://issues.apache.org/jira/browse/SPARK-12664
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Robert Dodier
>Assignee: Weichen Xu
>
> In 
> org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel, 
> there isn't any way to get raw prediction scores; only an integer output 
> (from 0 to #classes - 1) is available via the `predict` method. 
> `mplModel.predict` is called within the class to get the raw score, but 
> `mlpModel` is private so that isn't available to outside callers.
> The raw score is useful when the user wants to interpret the classifier 
> output as a probability. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21535) Reduce memory requirement for CrossValidator and TrainValidationSplit

2017-08-22 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16137697#comment-16137697
 ] 

Joseph K. Bradley commented on SPARK-21535:
---

[~yuhaoyan] Parallel training of models can be beneficial; we've done tests 
showing decent speedups (2-3x).  But the benefits are generally limited to 
small models or small data, where there isn't enough work during training a 
single model for the whole cluster to stay busy.  For larger problems, parallel 
training does not help as much.

I agree with you that parallel training & this fix should not conflict too 
much: The memory efficiency issue is a problem for big models; parallel 
training is more useful with smaller models.

> Reduce memory requirement for CrossValidator and TrainValidationSplit 
> --
>
> Key: SPARK-21535
> URL: https://issues.apache.org/jira/browse/SPARK-21535
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>
> CrossValidator and TrainValidationSplit both use 
> {code}models = est.fit(trainingDataset, epm) {code} to fit the models, where 
> epm is Array[ParamMap].
> Even though the training process is sequential, current implementation 
> consumes extra driver memory for holding the trained models, which is not 
> necessary and often leads to memory exception for both CrossValidator and 
> TrainValidationSplit. My proposal is to optimize the training implementation, 
> thus that used model can be collected by GC, and avoid the unnecessary OOM 
> exceptions.
> E.g. when grid search space is 12, old implementation needs to hold all 12 
> trained models in the driver memory at the same time, while the new 
> implementation only needs to hold 1 trained model at a time, and previous 
> model can be cleared by GC.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21086) CrossValidator, TrainValidationSplit should preserve all models after fitting

2017-08-22 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16137690#comment-16137690
 ] 

Joseph K. Bradley commented on SPARK-21086:
---

My understanding is that they actually want these models, and that the reasons 
vary.  Some reasons I've heard include:
* You may decide you want to use a different cross-val score later on, or you 
may want to compute it on a new dataset.
* You may want to do analysis on the model coefficients/data to understand what 
tuning is doing.
* (There's also the issue which can alternatively be solved by [SPARK-18704].)

> CrossValidator, TrainValidationSplit should preserve all models after fitting
> -
>
> Key: SPARK-21086
> URL: https://issues.apache.org/jira/browse/SPARK-21086
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>
> I've heard multiple requests for having CrossValidatorModel and 
> TrainValidationSplitModel preserve the full list of fitted models.  This 
> sounds very valuable.
> One decision should be made before we do this: Should we save and load the 
> models in ML persistence?  That could blow up the size of a saved Pipeline if 
> the models are large.
> * I suggest *not* saving the models by default but allowing saving if 
> specified.  We could specify whether to save the model as an extra Param for 
> CrossValidatorModelWriter, but we would have to make sure to expose 
> CrossValidatorModelWriter as a public API and modify the return type of 
> CrossValidatorModel.write to be CrossValidatorModelWriter (but this will not 
> be a breaking change).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21681) MLOR do not work correctly when featureStd contains zero

2017-08-22 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-21681:
--
Fix Version/s: 2.3.0

> MLOR do not work correctly when featureStd contains zero
> 
>
> Key: SPARK-21681
> URL: https://issues.apache.org/jira/browse/SPARK-21681
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
> Fix For: 2.3.0
>
>
> MLOR do not work correctly when featureStd contains zero.
> We can reproduce the bug through such dataset (features including zero 
> variance), will generate wrong result (all coefficients becomes 0)
> {code}
> val multinomialDatasetWithZeroVar = {
>   val nPoints = 100
>   val coefficients = Array(
> -0.57997, 0.912083, -0.371077,
> -0.16624, -0.84355, -0.048509)
>   val xMean = Array(5.843, 3.0)
>   val xVariance = Array(0.6856, 0.0)  // including zero variance
>   val testData = generateMultinomialLogisticInput(
> coefficients, xMean, xVariance, addIntercept = true, nPoints, seed)
>   val df = sc.parallelize(testData, 4).toDF().withColumn("weight", 
> lit(1.0))
>   df.cache()
>   df
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21681) MLOR do not work correctly when featureStd contains zero

2017-08-22 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16137638#comment-16137638
 ] 

Joseph K. Bradley commented on SPARK-21681:
---

I'll leave this open until it's been backported to 2.2

> MLOR do not work correctly when featureStd contains zero
> 
>
> Key: SPARK-21681
> URL: https://issues.apache.org/jira/browse/SPARK-21681
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
> Fix For: 2.3.0
>
>
> MLOR do not work correctly when featureStd contains zero.
> We can reproduce the bug through such dataset (features including zero 
> variance), will generate wrong result (all coefficients becomes 0)
> {code}
> val multinomialDatasetWithZeroVar = {
>   val nPoints = 100
>   val coefficients = Array(
> -0.57997, 0.912083, -0.371077,
> -0.16624, -0.84355, -0.048509)
>   val xMean = Array(5.843, 3.0)
>   val xVariance = Array(0.6856, 0.0)  // including zero variance
>   val testData = generateMultinomialLogisticInput(
> coefficients, xMean, xVariance, addIntercept = true, nPoints, seed)
>   val df = sc.parallelize(testData, 4).toDF().withColumn("weight", 
> lit(1.0))
>   df.cache()
>   df
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21770) ProbabilisticClassificationModel: Improve normalization of all-zero raw predictions

2017-08-18 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16133483#comment-16133483
 ] 

Joseph K. Bradley commented on SPARK-21770:
---

I vaguely recall discussing this before but forget where that discussion was.  
Overall, I'd vote for the uniform distribution:
* The "probability" column has a clear meaning: It should provide a predicted 
probability distribution over class labels.  An all-0 vector is not a valid 
probability distribution.
* It does not really make sense to say all classes are impossible.  When 
fitting a statistical model to predict from n classes, one makes the implicit 
assumption that there exist "true" classes to be predicted.

However, I can see the argument for not changing current behavior (from a 
software engineering standpoint).  That said, if people are relying on this 
behavior, their application logic is probably incorrect from a statistical 
modeling perspective.

Any opinions [~sethah], [~yanboliang], [~dbtsai] ?

> ProbabilisticClassificationModel: Improve normalization of all-zero raw 
> predictions
> ---
>
> Key: SPARK-21770
> URL: https://issues.apache.org/jira/browse/SPARK-21770
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Siddharth Murching
>Priority: Minor
>
> Given an n-element raw prediction vector of all-zeros, 
> ProbabilisticClassifierModel.normalizeToProbabilitiesInPlace() should output 
> a probability vector of all-equal 1/n entries



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19747) Consolidate code in ML aggregators

2017-08-17 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131340#comment-16131340
 ] 

Joseph K. Bradley commented on SPARK-19747:
---

Just saying: Thanks a lot for doing this reorg!  It's a nice step towards 
having pluggable algorithms.

> Consolidate code in ML aggregators
> --
>
> Key: SPARK-19747
> URL: https://issues.apache.org/jira/browse/SPARK-19747
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Seth Hendrickson
>Priority: Minor
>
> Many algorithms in Spark ML are posed as optimization of a differentiable 
> loss function over a parameter vector. We implement these by having a loss 
> function accumulate the gradient using an Aggregator class which has methods 
> that amount to a {{seqOp}} and {{combOp}}. So, pretty much every algorithm 
> that obeys this form implements a cost function class and an aggregator 
> class, which are completely separate from one another but share probably 80% 
> of the same code. 
> I think it is important to clean things like this up, and if we can do it 
> properly it will make the code much more maintainable, readable, and bug 
> free. It will also help reduce the overhead of future implementations.
> The design is of course open for discussion, but I think we should aim to:
> 1. Have all aggregators share parent classes, so that they only need to 
> implement the {{add}} function. This is really the only difference in the 
> current aggregators.
> 2. Have a single, generic cost function that is parameterized by the 
> aggregator type. This reduces the many places we implement cost functions and 
> greatly reduces the amount of duplicated code.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21681) MLOR do not work correctly when featureStd contains zero

2017-08-15 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-21681:
--
Shepherd: Joseph K. Bradley

> MLOR do not work correctly when featureStd contains zero
> 
>
> Key: SPARK-21681
> URL: https://issues.apache.org/jira/browse/SPARK-21681
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>
> MLOR do not work correctly when featureStd contains zero.
> We can reproduce the bug through such dataset (features including zero 
> variance), will generate wrong result (all coefficients becomes 0)
> {code}
> val multinomialDatasetWithZeroVar = {
>   val nPoints = 100
>   val coefficients = Array(
> -0.57997, 0.912083, -0.371077,
> -0.16624, -0.84355, -0.048509)
>   val xMean = Array(5.843, 3.0)
>   val xVariance = Array(0.6856, 0.0)  // including zero variance
>   val testData = generateMultinomialLogisticInput(
> coefficients, xMean, xVariance, addIntercept = true, nPoints, seed)
>   val df = sc.parallelize(testData, 4).toDF().withColumn("weight", 
> lit(1.0))
>   df.cache()
>   df
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21681) MLOR do not work correctly when featureStd contains zero

2017-08-15 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-21681:
--
Affects Version/s: 2.3.0

> MLOR do not work correctly when featureStd contains zero
> 
>
> Key: SPARK-21681
> URL: https://issues.apache.org/jira/browse/SPARK-21681
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>
> MLOR do not work correctly when featureStd contains zero.
> We can reproduce the bug through such dataset (features including zero 
> variance), will generate wrong result (all coefficients becomes 0)
> {code}
> val multinomialDatasetWithZeroVar = {
>   val nPoints = 100
>   val coefficients = Array(
> -0.57997, 0.912083, -0.371077,
> -0.16624, -0.84355, -0.048509)
>   val xMean = Array(5.843, 3.0)
>   val xVariance = Array(0.6856, 0.0)  // including zero variance
>   val testData = generateMultinomialLogisticInput(
> coefficients, xMean, xVariance, addIntercept = true, nPoints, seed)
>   val df = sc.parallelize(testData, 4).toDF().withColumn("weight", 
> lit(1.0))
>   df.cache()
>   df
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21681) MLOR do not work correctly when featureStd contains zero

2017-08-15 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-21681:
--
Target Version/s: 2.2.1, 2.3.0

> MLOR do not work correctly when featureStd contains zero
> 
>
> Key: SPARK-21681
> URL: https://issues.apache.org/jira/browse/SPARK-21681
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>
> MLOR do not work correctly when featureStd contains zero.
> We can reproduce the bug through such dataset (features including zero 
> variance), will generate wrong result (all coefficients becomes 0)
> {code}
> val multinomialDatasetWithZeroVar = {
>   val nPoints = 100
>   val coefficients = Array(
> -0.57997, 0.912083, -0.371077,
> -0.16624, -0.84355, -0.048509)
>   val xMean = Array(5.843, 3.0)
>   val xVariance = Array(0.6856, 0.0)  // including zero variance
>   val testData = generateMultinomialLogisticInput(
> coefficients, xMean, xVariance, addIntercept = true, nPoints, seed)
>   val df = sc.parallelize(testData, 4).toDF().withColumn("weight", 
> lit(1.0))
>   df.cache()
>   df
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21681) MLOR do not work correctly when featureStd contains zero

2017-08-15 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-21681:
-

Assignee: Weichen Xu

> MLOR do not work correctly when featureStd contains zero
> 
>
> Key: SPARK-21681
> URL: https://issues.apache.org/jira/browse/SPARK-21681
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>
> MLOR do not work correctly when featureStd contains zero.
> We can reproduce the bug through such dataset (features including zero 
> variance), will generate wrong result (all coefficients becomes 0)
> {code}
> val multinomialDatasetWithZeroVar = {
>   val nPoints = 100
>   val coefficients = Array(
> -0.57997, 0.912083, -0.371077,
> -0.16624, -0.84355, -0.048509)
>   val xMean = Array(5.843, 3.0)
>   val xVariance = Array(0.6856, 0.0)  // including zero variance
>   val testData = generateMultinomialLogisticInput(
> coefficients, xMean, xVariance, addIntercept = true, nPoints, seed)
>   val df = sc.parallelize(testData, 4).toDF().withColumn("weight", 
> lit(1.0))
>   df.cache()
>   df
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12664) Expose raw prediction scores in MultilayerPerceptronClassificationModel

2017-08-14 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16126583#comment-16126583
 ] 

Joseph K. Bradley commented on SPARK-12664:
---

[~yanboliang] I can take over shepherding this feature, but let me know if 
you'd like to return to it.  I just made an initial review pass.

> Expose raw prediction scores in MultilayerPerceptronClassificationModel
> ---
>
> Key: SPARK-12664
> URL: https://issues.apache.org/jira/browse/SPARK-12664
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Robert Dodier
>Assignee: Weichen Xu
>
> In 
> org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel, 
> there isn't any way to get raw prediction scores; only an integer output 
> (from 0 to #classes - 1) is available via the `predict` method. 
> `mplModel.predict` is called within the class to get the raw score, but 
> `mlpModel` is private so that isn't available to outside callers.
> The raw score is useful when the user wants to interpret the classifier 
> output as a probability. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12664) Expose raw prediction scores in MultilayerPerceptronClassificationModel

2017-08-14 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-12664:
--
Shepherd: Joseph K. Bradley  (was: Yanbo Liang)

> Expose raw prediction scores in MultilayerPerceptronClassificationModel
> ---
>
> Key: SPARK-12664
> URL: https://issues.apache.org/jira/browse/SPARK-12664
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Robert Dodier
>Assignee: Weichen Xu
>
> In 
> org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel, 
> there isn't any way to get raw prediction scores; only an integer output 
> (from 0 to #classes - 1) is available via the `predict` method. 
> `mplModel.predict` is called within the class to get the raw score, but 
> `mlpModel` is private so that isn't available to outside callers.
> The raw score is useful when the user wants to interpret the classifier 
> output as a probability. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21729) Generic test for ProbabilisticClassifier to ensure consistent output columns

2017-08-14 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-21729:
-

 Summary: Generic test for ProbabilisticClassifier to ensure 
consistent output columns
 Key: SPARK-21729
 URL: https://issues.apache.org/jira/browse/SPARK-21729
 Project: Spark
  Issue Type: Test
  Components: ML
Affects Versions: 2.2.0
Reporter: Joseph K. Bradley


One challenge with the ProbabilisticClassifier abstraction is that it 
introduces different code paths for predictions depending on which output 
columns are turned on or off: probability, rawPrediction, prediction.  We ran 
into a bug in MLOR with this.

This task is for adding a generic test usable in all test suites for 
ProbabilisticClassifier types which does the following:
* Take a dataset + Estimator
* Fit the Estimator
* Test prediction using the model with all combinations of output columns 
turned on/off.
* Make sure the output column values match, presumably by comparing vs. the 
case with all 3 output columns turned on

CC [~WeichenXu123] since this came up in 
https://github.com/apache/spark/pull/17373



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer

2017-08-12 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16124493#comment-16124493
 ] 

Joseph K. Bradley commented on SPARK-17025:
---

[~nchammas] I just merged https://github.com/apache/spark/pull/1 which 
should make this work if the custom Transformer uses simple (JSON-serializable) 
Params to store all of its data.  Does it meet your use case?  I'd like to make 
it easier to implement ML persistence for fancier data types in Transformers 
and Models (like Vectors or DataFrames) in the future, but hopefully this 
unblocks some use cases for now.

> Cannot persist PySpark ML Pipeline model that includes custom Transformer
> -
>
> Key: SPARK-17025
> URL: https://issues.apache.org/jira/browse/SPARK-17025
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Following the example in [this Databricks blog 
> post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html]
>  under "Python tuning", I'm trying to save an ML Pipeline model.
> This pipeline, however, includes a custom transformer. When I try to save the 
> model, the operation fails because the custom transformer doesn't have a 
> {{_to_java}} attribute.
> {code}
> Traceback (most recent call last):
>   File ".../file.py", line 56, in 
> model.bestModel.save('model')
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 222, in save
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 217, in write
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py",
>  line 93, in __init__
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 254, in _to_java
> AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java'
> {code}
> Looking at the source code for 
> [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py],
>  I see that not even the base Transformer class has such an attribute.
> I'm assuming this is missing functionality that is intended to be patched up 
> (i.e. [like 
> this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]).
> I'm not sure if there is an existing JIRA for this (my searches didn't turn 
> up clear results).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21685) Params isSet in scala Transformer triggered by _setDefault in pyspark

2017-08-10 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122492#comment-16122492
 ] 

Joseph K. Bradley commented on SPARK-21685:
---

Could you please point to more info, such as the Python wrappers you are 
calling?  I don't see enough info here to identify the problem.

> Params isSet in scala Transformer triggered by _setDefault in pyspark
> -
>
> Key: SPARK-21685
> URL: https://issues.apache.org/jira/browse/SPARK-21685
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Ratan Rai Sur
>
> I'm trying to write a PySpark wrapper for a Transformer whose transform 
> method includes the line
> {code:java}
> require(!(isSet(outputNodeName) && isSet(outputNodeIndex)), "Can't set both 
> outputNodeName and outputNodeIndex")
> {code}
> This should only throw an exception when both of these parameters are 
> explicitly set.
> In the PySpark wrapper for the Transformer, there is this line in ___init___
> {code:java}
> self._setDefault(outputNodeIndex=0)
> {code}
> Here is the line in the main python script showing how it is being configured
> {code:java}
> cntkModel = 
> CNTKModel().setInputCol("images").setOutputCol("output").setModelLocation(spark,
>  model.uri).setOutputNodeName("z")
> {code}
> As you can see, only setOutputNodeName is being explicitly set but the 
> exception is still being thrown.
> If you need more context, 
> https://github.com/RatanRSur/mmlspark/tree/default-cntkmodel-output is the 
> branch with the code, the files I'm referring to here that are tracked are 
> the following:
> src/cntk-model/src/main/scala/CNTKModel.scala
> notebooks/tests/301 - CIFAR10 CNTK CNN Evaluation.ipynb
> The pyspark wrapper code is autogenerated



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21542) Helper functions for custom Python Persistence

2017-08-07 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-21542.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18742
[https://github.com/apache/spark/pull/18742]

> Helper functions for custom Python Persistence
> --
>
> Key: SPARK-21542
> URL: https://issues.apache.org/jira/browse/SPARK-21542
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Ajay Saini
>Assignee: Ajay Saini
> Fix For: 2.3.0
>
>
> Currently, there is no way to easily persist Json-serializable parameters in 
> Python only. All parameters in Python are persisted by converting them to 
> Java objects and using the Java persistence implementation. In order to 
> facilitate the creation of custom Python-only pipeline stages, it would be 
> good to have a Python-only persistence framework so that these stages do not 
> need to be implemented in Scala for persistence. 
> This task involves:
> - Adding implementations for DefaultParamsReadable, DefaultParamsWriteable, 
> DefaultParamsReader, and DefaultParamsWriter in pyspark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21542) Helper functions for custom Python Persistence

2017-08-07 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-21542:
-

Assignee: Ajay Saini

> Helper functions for custom Python Persistence
> --
>
> Key: SPARK-21542
> URL: https://issues.apache.org/jira/browse/SPARK-21542
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Ajay Saini
>Assignee: Ajay Saini
>
> Currently, there is no way to easily persist Json-serializable parameters in 
> Python only. All parameters in Python are persisted by converting them to 
> Java objects and using the Java persistence implementation. In order to 
> facilitate the creation of custom Python-only pipeline stages, it would be 
> good to have a Python-only persistence framework so that these stages do not 
> need to be implemented in Scala for persistence. 
> This task involves:
> - Adding implementations for DefaultParamsReadable, DefaultParamsWriteable, 
> DefaultParamsReader, and DefaultParamsWriter in pyspark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21633) Unary Transformer in Python

2017-08-04 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-21633.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18746
[https://github.com/apache/spark/pull/18746]

> Unary Transformer in Python
> ---
>
> Key: SPARK-21633
> URL: https://issues.apache.org/jira/browse/SPARK-21633
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Ajay Saini
>Assignee: Ajay Saini
> Fix For: 2.3.0
>
>
> Currently, the abstract class UnaryTransformer is only implemented in Scala. 
> In order to make Pyspark easier to extend with custom transformers, it would 
> be helpful to have the implementation of UnaryTransformer in Python as well.
> This task involves:
> - implementing the class UnaryTransformer in Python
> - testing the transform() functionality of the class to make sure it works



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21633) Unary Transformer in Python

2017-08-04 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-21633:
-

Assignee: Ajay Saini

> Unary Transformer in Python
> ---
>
> Key: SPARK-21633
> URL: https://issues.apache.org/jira/browse/SPARK-21633
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Ajay Saini
>Assignee: Ajay Saini
>
> Currently, the abstract class UnaryTransformer is only implemented in Scala. 
> In order to make Pyspark easier to extend with custom transformers, it would 
> be helpful to have the implementation of UnaryTransformer in Python as well.
> This task involves:
> - implementing the class UnaryTransformer in Python
> - testing the transform() functionality of the class to make sure it works



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21633) Unary Transformer in Python

2017-08-04 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-21633:
--
Shepherd: Joseph K. Bradley

> Unary Transformer in Python
> ---
>
> Key: SPARK-21633
> URL: https://issues.apache.org/jira/browse/SPARK-21633
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Ajay Saini
>
> Currently, the abstract class UnaryTransformer is only implemented in Scala. 
> In order to make Pyspark easier to extend with custom transformers, it would 
> be helpful to have the implementation of UnaryTransformer in Python as well.
> This task involves:
> - implementing the class UnaryTransformer in Python
> - testing the transform() functionality of the class to make sure it works



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21542) Helper functions for custom Python Persistence

2017-08-02 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-21542:
--
Shepherd: Joseph K. Bradley

> Helper functions for custom Python Persistence
> --
>
> Key: SPARK-21542
> URL: https://issues.apache.org/jira/browse/SPARK-21542
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Ajay Saini
>
> Currently, there is no way to easily persist Json-serializable parameters in 
> Python only. All parameters in Python are persisted by converting them to 
> Java objects and using the Java persistence implementation. In order to 
> facilitate the creation of custom Python-only pipeline stages, it would be 
> good to have a Python-only persistence framework so that these stages do not 
> need to be implemented in Scala for persistence. 
> This task involves:
> - Adding implementations for DefaultParamsReadable, DefaultParamsWriteable, 
> DefaultParamsReader, and DefaultParamsWriter in pyspark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21542) Helper functions for custom Python Persistence

2017-07-26 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-21542:
--
Description: 
Currently, there is no way to easily persist Json-serializable parameters in 
Python only. All parameters in Python are persisted by converting them to Java 
objects and using the Java persistence implementation. In order to facilitate 
the creation of custom Python-only pipeline stages, it would be good to have a 
Python-only persistence framework so that these stages do not need to be 
implemented in Scala for persistence. 

This task involves:
- Adding implementations for DefaultParamsReadable, DefaultParamsWriteable, 
DefaultParamsReader, and DefaultParamsWriter in pyspark.

  was:
Currnetly, there is no way to easily persist Json-serializable parameters in 
Python only. All parameters in Python are persisted by converting them to Java 
objects and using the Java persistence implementation. In order to facilitate 
the creation of custom Python-only pipeline stages, it would be good to have a 
Python-only persistence framework so that these stages do not need to be 
implemented in Scala for persistence. 

This task involves:
- Adding implementations for DefaultParamsReadable, DefaultParamsWriteable, 
DefaultParamsReader, and DefaultParamsWriter in pyspark.


> Helper functions for custom Python Persistence
> --
>
> Key: SPARK-21542
> URL: https://issues.apache.org/jira/browse/SPARK-21542
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Ajay Saini
>
> Currently, there is no way to easily persist Json-serializable parameters in 
> Python only. All parameters in Python are persisted by converting them to 
> Java objects and using the Java persistence implementation. In order to 
> facilitate the creation of custom Python-only pipeline stages, it would be 
> good to have a Python-only persistence framework so that these stages do not 
> need to be implemented in Scala for persistence. 
> This task involves:
> - Adding implementations for DefaultParamsReadable, DefaultParamsWriteable, 
> DefaultParamsReader, and DefaultParamsWriter in pyspark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21542) Helper functions for custom Python Persistence

2017-07-26 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-21542:
--
Component/s: ML

> Helper functions for custom Python Persistence
> --
>
> Key: SPARK-21542
> URL: https://issues.apache.org/jira/browse/SPARK-21542
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Ajay Saini
>
> Currnetly, there is no way to easily persist Json-serializable parameters in 
> Python only. All parameters in Python are persisted by converting them to 
> Java objects and using the Java persistence implementation. In order to 
> facilitate the creation of custom Python-only pipeline stages, it would be 
> good to have a Python-only persistence framework so that these stages do not 
> need to be implemented in Scala for persistence. 
> This task involves:
> - Adding implementations for DefaultParamsReadable, DefaultParamsWriteable, 
> DefaultParamsReader, and DefaultParamsWriter in pyspark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13786) Pyspark ml.tuning support export/import

2017-07-25 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-13786.
---
Resolution: Duplicate

> Pyspark ml.tuning support export/import
> ---
>
> Key: SPARK-13786
> URL: https://issues.apache.org/jira/browse/SPARK-13786
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> This should follow whatever implementation is chosen for Pipeline (since 
> these are all meta-algorithms).
> Note this will also require persistence for Evaluators.  Hopefully that can 
> leverage the Java implementations; there is not a real need to make Python 
> Evaluators be MLWritable, as far as I can tell.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13786) Pyspark ml.tuning support export/import

2017-07-25 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16101057#comment-16101057
 ] 

Joseph K. Bradley commented on SPARK-13786:
---

This has been resolved now via [SPARK-11893], [SPARK-6791], and [SPARK-21221].

> Pyspark ml.tuning support export/import
> ---
>
> Key: SPARK-13786
> URL: https://issues.apache.org/jira/browse/SPARK-13786
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> This should follow whatever implementation is chosen for Pipeline (since 
> these are all meta-algorithms).
> Note this will also require persistence for Evaluators.  Hopefully that can 
> leverage the Java implementations; there is not a real need to make Python 
> Evaluators be MLWritable, as far as I can tell.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21523) Fix bug of strong wolfe linesearch `init` parameter lose effectiveness

2017-07-24 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16099229#comment-16099229
 ] 

Joseph K. Bradley commented on SPARK-21523:
---

CC [~yanboliang] [~yuhaoyan] [~dbtsai] making a few people aware of this

> Fix bug of strong wolfe linesearch `init` parameter lose effectiveness
> --
>
> Key: SPARK-21523
> URL: https://issues.apache.org/jira/browse/SPARK-21523
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: Weichen Xu
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> We need merge this breeze bugfix into spark because it influence a series of 
> algos in MLlib which use LBFGS.
> https://github.com/scalanlp/breeze/pull/651



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21523) Fix bug of strong wolfe linesearch `init` parameter lose effectiveness

2017-07-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-21523:
--
Description: 
We need merge this breeze bugfix into spark because it influence a series of 
algos in MLlib which use LBFGS.
https://github.com/scalanlp/breeze/pull/651

  was:
We need merge this breeze bugfix into spark because it influence a series of 
algos in MLLib which use LBFGS.
https://github.com/scalanlp/breeze/pull/651


> Fix bug of strong wolfe linesearch `init` parameter lose effectiveness
> --
>
> Key: SPARK-21523
> URL: https://issues.apache.org/jira/browse/SPARK-21523
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: Weichen Xu
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> We need merge this breeze bugfix into spark because it influence a series of 
> algos in MLlib which use LBFGS.
> https://github.com/scalanlp/breeze/pull/651



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15574) Python meta-algorithms in Scala

2017-07-17 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16090459#comment-16090459
 ] 

Joseph K. Bradley commented on SPARK-15574:
---

(Just commented on the PR; I'm uncertain about the need for this.)

> Python meta-algorithms in Scala
> ---
>
> Key: SPARK-15574
> URL: https://issues.apache.org/jira/browse/SPARK-15574
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> This is an experimental idea for implementing Python ML meta-algorithms 
> (CrossValidator, TrainValidationSplit, Pipeline, OneVsRest, etc.) in Scala.  
> This would require a Scala wrapper for algorithms implemented in Python, 
> somewhat analogous to Python UDFs.
> The benefit of this change would be that we could avoid currently awkward 
> conversions between Scala/Python meta-algorithms required for persistence.  
> It would let us have full support for Python persistence and would generally 
> simplify the implementation within MLlib.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21221) CrossValidator and TrainValidationSplit Persist Nested Estimators such as OneVsRest

2017-07-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-21221.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18428
[https://github.com/apache/spark/pull/18428]

> CrossValidator and TrainValidationSplit Persist Nested Estimators such as 
> OneVsRest
> ---
>
> Key: SPARK-21221
> URL: https://issues.apache.org/jira/browse/SPARK-21221
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Ajay Saini
>Assignee: Ajay Saini
> Fix For: 2.3.0
>
>
> Currently, the saving of parameters done in ValidatorParams.scala assumes 
> that all parameters in EstimatorParameterMaps are JSON serializable. Such an 
> assumption causes CrossValidator and TrainValidationSplit persistence to fail 
> when the internal estimator to these meta-algorithms is not JSON 
> serializable. One example is OneVsRest which has a classifier (estimator) as 
> a parameter.
> The changes would involve removing the assumption in validator params that 
> all the estimator params are JSON serializable. This could mean saving 
> parameters that are not JSON serializable separately at a specified path. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20090) Add StructType.fieldNames to Python API

2017-07-14 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16087666#comment-16087666
 ] 

Joseph K. Bradley commented on SPARK-20090:
---

Sorry for not seeing this.  You're right about there being names.  I'm happy 
with either, but matching Scala would be nice.

> Add StructType.fieldNames to Python API
> ---
>
> Key: SPARK-20090
> URL: https://issues.apache.org/jira/browse/SPARK-20090
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Joseph K. Bradley
>Priority: Trivial
>
> The Scala/Java API for {{StructType}} has a method {{fieldNames}}.  It would 
> be nice if the Python {{StructType}} did as well.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20099) Add transformSchema to pyspark.ml

2017-07-13 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16086422#comment-16086422
 ] 

Joseph K. Bradley commented on SPARK-20099:
---

[~holdenk] [~yanboliang] [~yuhaoyan] [~mlnick] CCing a few people since 
[~WeichenXu123] is interested in working on this.  Do you think it's reasonable 
to add PipelineStage to Python in order to add transformSchema?

Pro: early schema failure detection in Python

Con: duplication of schema checking logic in Python
* I don't see a good way to do schema checking in Python for Pipelines without 
this duplication.  The only way would be to convert Pipelines to Scala 
equivalents before executing them; i.e., the Pipeline implementation would be 
in Scala only.  The problem is that we need Pipelines implemented in Python as 
well in order to support Python-only implementations of Transformers and 
Estimators (for custom use cases).

A reasonable way to do this in a series of PRs would be to:
* Add PipelineStage abstraction, with abstract transformSchema method
* For each Transformer/Estimator/Model in Python, change it to inherit from 
PipelineStage
* Finally, change Pipeline and PipelineModel to call transformSchema on their 
sequences of stages

> Add transformSchema to pyspark.ml
> -
>
> Key: SPARK-20099
> URL: https://issues.apache.org/jira/browse/SPARK-20099
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.1.0
>Reporter: Joseph K. Bradley
>
> Python's ML API currently lacks the PipelineStage abstraction.  This 
> abstraction's main purpose is to provide transformSchema() for checking for 
> early failures in a Pipeline.
> As mentioned in https://github.com/apache/spark/pull/17218 it would also be 
> useful in Python for checking Params in Python wrapper for Scala 
> implementations; in these, transformSchema would involve passing Params in 
> Python to Scala, which would then be able to validate the Param values.  This 
> could prevent late failures from bad Param settings in Pipeline execution, 
> while still allowing us to check Param values on only the Scala side.
> This issue is for adding transformSchema() to pyspark.ml.  If it's 
> reasonable, we could create a PipelineStage abstraction.  But it'd probably 
> be fine to add transformSchema() directly to Transformer and Estimator, 
> rather than creating PipelineStage.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21221) CrossValidator and TrainValidationSplit Persist Nested Estimators

2017-07-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-21221:
-

Assignee: Ajay Saini

> CrossValidator and TrainValidationSplit Persist Nested Estimators
> -
>
> Key: SPARK-21221
> URL: https://issues.apache.org/jira/browse/SPARK-21221
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Ajay Saini
>Assignee: Ajay Saini
>
> Currently, the saving of parameters done in ValidatorParams.scala assumes 
> that all parameters in EstimatorParameterMaps are JSON serializable. Such an 
> assumption causes CrossValidator and TrainValidationSplit persistence to fail 
> when the internal estimator to these meta-algorithms is not JSON 
> serializable. One example is OneVsRest which has a classifier (estimator) as 
> a parameter.
> The changes would involve removing the assumption in validator params that 
> all the estimator params are JSON serializable. This could mean saving 
> parameters that are not JSON serializable separately at a specified path. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21221) CrossValidator and TrainValidationSplit Persist Nested Estimators

2017-07-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-21221:
--
Affects Version/s: (was: 2.1.1)
   2.2.0

> CrossValidator and TrainValidationSplit Persist Nested Estimators
> -
>
> Key: SPARK-21221
> URL: https://issues.apache.org/jira/browse/SPARK-21221
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Ajay Saini
>
> Currently, the saving of parameters done in ValidatorParams.scala assumes 
> that all parameters in EstimatorParameterMaps are JSON serializable. Such an 
> assumption causes CrossValidator and TrainValidationSplit persistence to fail 
> when the internal estimator to these meta-algorithms is not JSON 
> serializable. One example is OneVsRest which has a classifier (estimator) as 
> a parameter.
> The changes would involve removing the assumption in validator params that 
> all the estimator params are JSON serializable. This could mean saving 
> parameters that are not JSON serializable separately at a specified path. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20604) Allow Imputer to handle all numeric types

2017-07-10 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20604:
--
Issue Type: Improvement  (was: Bug)

> Allow Imputer to handle all numeric types
> -
>
> Key: SPARK-20604
> URL: https://issues.apache.org/jira/browse/SPARK-20604
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Wayne Zhang
>
> Imputer currently requires input column to be Double or Float, but the logic 
> should work on any numeric data types. Many practical problems have integer  
> data types, and it could get very tedious to manually cast them into Double 
> before calling imputer. This transformer could be extended to handle all 
> numeric types.  
> The example below shows failure of Imputer on integer data. 
> {code}
> val df = spark.createDataFrame( Seq(
>   (0, 1.0, 1.0, 1.0),
>   (1, 11.0, 11.0, 11.0),
>   (2, 1.5, 1.5, 1.5),
>   (3, Double.NaN, 4.5, 1.5)
> )).toDF("id", "value1", "expected_mean_value1", "expected_median_value1")
> val imputer = new Imputer()
>   .setInputCols(Array("value1"))
>   .setOutputCols(Array("out1"))
> imputer.fit(df.withColumn("value1", col("value1").cast(IntegerType)))
> java.lang.IllegalArgumentException: requirement failed: Column value1 must be 
> of type equal to one of the following types: [DoubleType, FloatType] but was 
> actually of type IntegerType.
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21241) Add intercept to StreamingLinearRegressionWithSGD

2017-07-10 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-21241:
--
Issue Type: New Feature  (was: Bug)

> Add intercept to StreamingLinearRegressionWithSGD
> -
>
> Key: SPARK-21241
> URL: https://issues.apache.org/jira/browse/SPARK-21241
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Affects Versions: 2.3.0
>Reporter: Soulaimane GUEDRIA
>
> StreamingLinearRegressionWithSGD class in PySpark is missing the setIntercept 
> Method which offers the possibility to turn on/off the intercept value. API 
> parity is not respected between Python and Scala.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20133) User guide for spark.ml.stat.ChiSquareTest

2017-07-10 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081503#comment-16081503
 ] 

Joseph K. Bradley commented on SPARK-20133:
---

Sorry for the slow response; please feel free to!

> User guide for spark.ml.stat.ChiSquareTest
> --
>
> Key: SPARK-20133
> URL: https://issues.apache.org/jira/browse/SPARK-20133
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Add new user guide section for spark.ml.stat, and document ChiSquareTest.  
> This may involve adding new example scripts.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21086) CrossValidator, TrainValidationSplit should preserve all models after fitting

2017-07-10 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081500#comment-16081500
 ] 

Joseph K. Bradley commented on SPARK-21086:
---

I like the idea for that path, but it could become really long in some cases, 
so I'd prefer to use indices instead for robustness.

Driver memory shouldn't be a big problem since all models are already collected 
to the driver.

> CrossValidator, TrainValidationSplit should preserve all models after fitting
> -
>
> Key: SPARK-21086
> URL: https://issues.apache.org/jira/browse/SPARK-21086
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>
> I've heard multiple requests for having CrossValidatorModel and 
> TrainValidationSplitModel preserve the full list of fitted models.  This 
> sounds very valuable.
> One decision should be made before we do this: Should we save and load the 
> models in ML persistence?  That could blow up the size of a saved Pipeline if 
> the models are large.
> * I suggest *not* saving the models by default but allowing saving if 
> specified.  We could specify whether to save the model as an extra Param for 
> CrossValidatorModelWriter, but we would have to make sure to expose 
> CrossValidatorModelWriter as a public API and modify the return type of 
> CrossValidatorModel.write to be CrossValidatorModelWriter (but this will not 
> be a breaking change).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21341) Spark 2.1.1: I want to be able to serialize wordVectors on Word2VecModel

2017-07-10 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081483#comment-16081483
 ] 

Joseph K. Bradley commented on SPARK-21341:
---

+1 for the built-in save/load.  Saving as an object file is not something MLlib 
is meant to support.

> Spark 2.1.1: I want to be able to serialize wordVectors on Word2VecModel 
> -
>
> Key: SPARK-21341
> URL: https://issues.apache.org/jira/browse/SPARK-21341
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Zied Sellami
>
> I am using sparContext.saveAsObjectFile to save a complex object containing a 
> pipelineModel with a Word2Vec ML Transformer. When I load the object and call 
> myPipelineModel.transform, Word2VecModel raise a null pointer error on line 
> 292 Word2Vec.scala "wordVectors.getVectors" . I resolve the problem by 
> removing@transient annotation on val wordVectors and @transient lazy val on 
> getVectors function.
> -Why this 2 val are transient ?
> -Any solution to add a boolean function on the Word2Vec Transformer to force 
> the serialization of wordVectors.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21208) Ability to "setLocalProperty" from sc, in sparkR

2017-07-10 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-21208:
--
Issue Type: New Feature  (was: Bug)

> Ability to "setLocalProperty" from sc, in sparkR
> 
>
> Key: SPARK-21208
> URL: https://issues.apache.org/jira/browse/SPARK-21208
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 2.1.1
>Reporter: Karuppayya
>
> Checked the API 
> [documentation|https://spark.apache.org/docs/latest/api/R/index.html] for 
> sparkR.
> Was not able to find a way to *setLocalProperty* on sc.
> Need ability to *setLocalProperty* on sparkContext(similar to available for 
> pyspark, scala)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21341) Spark 2.1.1: I want to be able to serialize wordVectors on Word2VecModel

2017-07-10 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-21341.
---
Resolution: Not A Problem

> Spark 2.1.1: I want to be able to serialize wordVectors on Word2VecModel 
> -
>
> Key: SPARK-21341
> URL: https://issues.apache.org/jira/browse/SPARK-21341
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Zied Sellami
>
> I am using sparContext.saveAsObjectFile to save a complex object containing a 
> pipelineModel with a Word2Vec ML Transformer. When I load the object and call 
> myPipelineModel.transform, Word2VecModel raise a null pointer error on line 
> 292 Word2Vec.scala "wordVectors.getVectors" . I resolve the problem by 
> removing@transient annotation on val wordVectors and @transient lazy val on 
> getVectors function.
> -Why this 2 val are transient ?
> -Any solution to add a boolean function on the Word2Vec Transformer to force 
> the serialization of wordVectors.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20929) LinearSVC should not use shared Param HasThresholds

2017-06-30 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20929:
--
Target Version/s: 2.2.0  (was: 2.2.1, 2.3.0)

> LinearSVC should not use shared Param HasThresholds
> ---
>
> Key: SPARK-20929
> URL: https://issues.apache.org/jira/browse/SPARK-20929
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
> Fix For: 2.2.0
>
>
> LinearSVC applies the Param 'threshold' to the rawPrediction, not the 
> probability.  It has different semantics than the shared Param HasThreshold, 
> so it should not use the shared Param.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20929) LinearSVC should not use shared Param HasThresholds

2017-06-30 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20929:
--
Fix Version/s: (was: 2.2.1)
   (was: 2.3.0)
   2.2.0

> LinearSVC should not use shared Param HasThresholds
> ---
>
> Key: SPARK-20929
> URL: https://issues.apache.org/jira/browse/SPARK-20929
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
> Fix For: 2.2.0
>
>
> LinearSVC applies the Param 'threshold' to the rawPrediction, not the 
> probability.  It has different semantics than the shared Param HasThreshold, 
> so it should not use the shared Param.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21221) CrossValidator and TrainValidationSplit Persist Nested Estimators

2017-06-26 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-21221:
--
Shepherd: Joseph K. Bradley

> CrossValidator and TrainValidationSplit Persist Nested Estimators
> -
>
> Key: SPARK-21221
> URL: https://issues.apache.org/jira/browse/SPARK-21221
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Ajay Saini
>
> Currently, the saving of parameters done in ValidatorParams.scala assumes 
> that all parameters in EstimatorParameterMaps are JSON serializable. Such an 
> assumption causes CrossValidator and TrainValidationSplit persistence to fail 
> when the internal estimator to these meta-algorithms is not JSON 
> serializable. One example is OneVsRest which has a classifier (estimator) as 
> a parameter.
> The changes would involve removing the assumption in validator params that 
> all the estimator params are JSON serializable. This could mean saving 
> parameters that are not JSON serializable separately at a specified path. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21166) Automated ML persistence

2017-06-21 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-21166:
-

 Summary: Automated ML persistence
 Key: SPARK-21166
 URL: https://issues.apache.org/jira/browse/SPARK-21166
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 2.2.0
Reporter: Joseph K. Bradley


This JIRA is for discussing the possibility of automating ML persistence.  
Currently, custom save/load methods are written for every Model.  However, we 
could design a mixin which provides automated persistence, inspecting model 
data and Params and reading/writing (known types) automatically.  This was 
brought up in discussions with developers behind 
https://github.com/azure/mmlspark

Some issues we will need to consider:
* Providing generic mixin usable in most or all cases
* Handling corner cases (strange Param types, etc.)
* Backwards compatibility (loading models saved by old Spark versions)

Because of backwards compatibility in particular, it may make sense to 
implement testing for that first, before we try to address automated 
persistence: [SPARK-15573]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20114) spark.ml parity for sequential pattern mining - PrefixSpan

2017-06-21 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20114:
--
Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-14501

> spark.ml parity for sequential pattern mining - PrefixSpan
> --
>
> Key: SPARK-20114
> URL: https://issues.apache.org/jira/browse/SPARK-20114
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>
> Creating this jira to track the feature parity for PrefixSpan and sequential 
> pattern mining in Spark ml with DataFrame API. 
> First list a few design issues to be discussed, then subtasks like Scala, 
> Python and R API will be created.
> # Wrapping the MLlib PrefixSpan and provide a generic fit() should be 
> straightforward. Yet PrefixSpan only extracts frequent sequential patterns, 
> which is not good to be used directly for predicting on new records. Please 
> read  
> http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/
>  for some background knowledge. Thanks Philippe Fournier-Viger for providing 
> insights. If we want to keep using the Estimator/Transformer pattern, options 
> are:
>  #*  Implement a dummy transform for PrefixSpanModel, which will not add 
> new column to the input DataSet. The PrefixSpanModel is only used to provide 
> access for frequent sequential patterns.
>  #*  Adding the feature to extract sequential rules from sequential 
> patterns. Then use the sequential rules in the transform as FPGrowthModel.  
> The rules extracted are of the form X–> Y where X and Y are sequential 
> patterns. But in practice, these rules are not very good as they are too 
> precise and thus not noise tolerant.
> #  Different from association rules and frequent itemsets, sequential rules 
> can be extracted from the original dataset more efficiently using algorithms 
> like RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is 
> unordered, but X must appear before Y, which is more general and can work 
> better in practice for prediction. 
> I'd like to hear more from the users to see which kind of Sequential rules 
> are more practical. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20929) LinearSVC should not use shared Param HasThresholds

2017-06-19 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-20929.
---
   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.1

Issue resolved by pull request 18151
[https://github.com/apache/spark/pull/18151]

> LinearSVC should not use shared Param HasThresholds
> ---
>
> Key: SPARK-20929
> URL: https://issues.apache.org/jira/browse/SPARK-20929
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
> Fix For: 2.2.1, 2.3.0
>
>
> LinearSVC applies the Param 'threshold' to the rawPrediction, not the 
> probability.  It has different semantics than the shared Param HasThreshold, 
> so it should not use the shared Param.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20929) LinearSVC should not use shared Param HasThresholds

2017-06-19 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20929:
--
Target Version/s: 2.2.1, 2.3.0

> LinearSVC should not use shared Param HasThresholds
> ---
>
> Key: SPARK-20929
> URL: https://issues.apache.org/jira/browse/SPARK-20929
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
>
> LinearSVC applies the Param 'threshold' to the rawPrediction, not the 
> probability.  It has different semantics than the shared Param HasThreshold, 
> so it should not use the shared Param.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21088) CrossValidator, TrainValidationSplit should preserve all models after fitting: Python

2017-06-13 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-21088:
-

 Summary: CrossValidator, TrainValidationSplit should preserve all 
models after fitting: Python
 Key: SPARK-21088
 URL: https://issues.apache.org/jira/browse/SPARK-21088
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 2.2.0
Reporter: Joseph K. Bradley


See parent JIRA



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21088) CrossValidator, TrainValidationSplit should preserve all models after fitting: Python

2017-06-13 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-21088:
--
Component/s: PySpark

> CrossValidator, TrainValidationSplit should preserve all models after 
> fitting: Python
> -
>
> Key: SPARK-21088
> URL: https://issues.apache.org/jira/browse/SPARK-21088
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>
> See parent JIRA



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21087) CrossValidator, TrainValidationSplit should preserve all models after fitting: Scala

2017-06-13 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-21087:
-

 Summary: CrossValidator, TrainValidationSplit should preserve all 
models after fitting: Scala
 Key: SPARK-21087
 URL: https://issues.apache.org/jira/browse/SPARK-21087
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 2.2.0
Reporter: Joseph K. Bradley


See parent JIRA



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21086) CrossValidator, TrainValidationSplit should preserve all models after fitting

2017-06-13 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-21086:
-

 Summary: CrossValidator, TrainValidationSplit should preserve all 
models after fitting
 Key: SPARK-21086
 URL: https://issues.apache.org/jira/browse/SPARK-21086
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 2.2.0
Reporter: Joseph K. Bradley


I've heard multiple requests for having CrossValidatorModel and 
TrainValidationSplitModel preserve the full list of fitted models.  This sounds 
very valuable.

One decision should be made before we do this: Should we save and load the 
models in ML persistence?  That could blow up the size of a saved Pipeline if 
the models are large.
* I suggest *not* saving the models by default but allowing saving if 
specified.  We could specify whether to save the model as an extra Param for 
CrossValidatorModelWriter, but we would have to make sure to expose 
CrossValidatorModelWriter as a public API and modify the return type of 
CrossValidatorModel.write to be CrossValidatorModelWriter (but this will not be 
a breaking change).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2017-06-13 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16048477#comment-16048477
 ] 

Joseph K. Bradley commented on SPARK-1:
---

Thanks for explaining!  I just rediscovered this issue (which I'd forgotten): 
[SPARK-21043]

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21027) Parallel One vs. Rest Classifier

2017-06-12 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-21027:
--
Shepherd: Joseph K. Bradley

> Parallel One vs. Rest Classifier
> 
>
> Key: SPARK-21027
> URL: https://issues.apache.org/jira/browse/SPARK-21027
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Ajay Saini
>
> Currently, the Scala implementation of OneVsRest allows the user to run a 
> parallel implementation in which each class is evaluated in a different 
> thread. This implementation allows up to a 2X speedup as determined by 
> experiments but is not currently not tunable. Furthermore, the python 
> implementation of OneVsRest does not parallelize at all. It would be useful 
> to add a parallel, tunable implementation of OneVsRest to the python library 
> in order to speed up the algorithm.
>  A ticket for the Scala implementation of this classifier is here: 
> https://issues.apache.org/jira/browse/SPARK-21028



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-14450) Python OneVsRest should train multiple models at once

2017-06-12 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-14450.
-
Resolution: Duplicate

> Python OneVsRest should train multiple models at once
> -
>
> Key: SPARK-14450
> URL: https://issues.apache.org/jira/browse/SPARK-14450
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> [SPARK-7861] adds a Python wrapper for OneVsRest.  Because of possible issues 
> related to using existing libraries like {{multiprocessing}}, we are not 
> training multiple models in parallel initially.
> This issue is for prototyping, testing, and implementing a way to train 
> multiple models at once.  Speaking with [~joshrosen], a good option might be 
> the concurrent.futures package:
> * Python 3.x: 
> [https://docs.python.org/3/library/concurrent.futures.html#module-concurrent.futures]
> * Python 2.x: [https://pypi.python.org/pypi/futures]
> We will *not* add this for Spark 2.0, but it will be good to investigate for 
> 2.1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14450) Python OneVsRest should train multiple models at once

2017-06-12 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16047245#comment-16047245
 ] 

Joseph K. Bradley commented on SPARK-14450:
---

See linked JIRA for new issue.

> Python OneVsRest should train multiple models at once
> -
>
> Key: SPARK-14450
> URL: https://issues.apache.org/jira/browse/SPARK-14450
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> [SPARK-7861] adds a Python wrapper for OneVsRest.  Because of possible issues 
> related to using existing libraries like {{multiprocessing}}, we are not 
> training multiple models in parallel initially.
> This issue is for prototyping, testing, and implementing a way to train 
> multiple models at once.  Speaking with [~joshrosen], a good option might be 
> the concurrent.futures package:
> * Python 3.x: 
> [https://docs.python.org/3/library/concurrent.futures.html#module-concurrent.futures]
> * Python 2.x: [https://pypi.python.org/pypi/futures]
> We will *not* add this for Spark 2.0, but it will be good to investigate for 
> 2.1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14450) Python OneVsRest should train multiple models at once

2017-06-12 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16047244#comment-16047244
 ] 

Joseph K. Bradley commented on SPARK-14450:
---

Scala already has parallelization.  I just rediscovered this issue...so I'll 
close this old copy.

> Python OneVsRest should train multiple models at once
> -
>
> Key: SPARK-14450
> URL: https://issues.apache.org/jira/browse/SPARK-14450
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> [SPARK-7861] adds a Python wrapper for OneVsRest.  Because of possible issues 
> related to using existing libraries like {{multiprocessing}}, we are not 
> training multiple models in parallel initially.
> This issue is for prototyping, testing, and implementing a way to train 
> multiple models at once.  Speaking with [~joshrosen], a good option might be 
> the concurrent.futures package:
> * Python 3.x: 
> [https://docs.python.org/3/library/concurrent.futures.html#module-concurrent.futures]
> * Python 2.x: [https://pypi.python.org/pypi/futures]
> We will *not* add this for Spark 2.0, but it will be good to investigate for 
> 2.1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21027) Parallel One vs. Rest Classifier

2017-06-12 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16047243#comment-16047243
 ] 

Joseph K. Bradley commented on SPARK-21027:
---

Copying from [ML-14450]:

[SPARK-7861] adds a Python wrapper for OneVsRest.  Because of possible issues 
related to using existing libraries like {{multiprocessing}}, we are not 
training multiple models in parallel initially.

This issue is for prototyping, testing, and implementing a way to train 
multiple models at once.  Speaking with [~joshrosen], a good option might be 
the concurrent.futures package:
* Python 3.x: 
[https://docs.python.org/3/library/concurrent.futures.html#module-concurrent.futures]
* Python 2.x: [https://pypi.python.org/pypi/futures]

> Parallel One vs. Rest Classifier
> 
>
> Key: SPARK-21027
> URL: https://issues.apache.org/jira/browse/SPARK-21027
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Ajay Saini
>
> Currently, the Scala implementation of OneVsRest allows the user to run a 
> parallel implementation in which each class is evaluated in a different 
> thread. This implementation allows up to a 2X speedup as determined by 
> experiments but is not currently not tunable. Furthermore, the python 
> implementation of OneVsRest does not parallelize at all. It would be useful 
> to add a parallel, tunable implementation of OneVsRest to the python library 
> in order to speed up the algorithm.
>  A ticket for the Scala implementation of this classifier is here: 
> https://issues.apache.org/jira/browse/SPARK-21028



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21027) Parallel One vs. Rest Classifier

2017-06-12 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16047243#comment-16047243
 ] 

Joseph K. Bradley edited comment on SPARK-21027 at 6/12/17 11:54 PM:
-

Copying from [SPARK-14450]:

[SPARK-7861] adds a Python wrapper for OneVsRest.  Because of possible issues 
related to using existing libraries like {{multiprocessing}}, we are not 
training multiple models in parallel initially.

This issue is for prototyping, testing, and implementing a way to train 
multiple models at once.  Speaking with [~joshrosen], a good option might be 
the concurrent.futures package:
* Python 3.x: 
[https://docs.python.org/3/library/concurrent.futures.html#module-concurrent.futures]
* Python 2.x: [https://pypi.python.org/pypi/futures]


was (Author: josephkb):
Copying from [ML-14450]:

[SPARK-7861] adds a Python wrapper for OneVsRest.  Because of possible issues 
related to using existing libraries like {{multiprocessing}}, we are not 
training multiple models in parallel initially.

This issue is for prototyping, testing, and implementing a way to train 
multiple models at once.  Speaking with [~joshrosen], a good option might be 
the concurrent.futures package:
* Python 3.x: 
[https://docs.python.org/3/library/concurrent.futures.html#module-concurrent.futures]
* Python 2.x: [https://pypi.python.org/pypi/futures]

> Parallel One vs. Rest Classifier
> 
>
> Key: SPARK-21027
> URL: https://issues.apache.org/jira/browse/SPARK-21027
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Ajay Saini
>
> Currently, the Scala implementation of OneVsRest allows the user to run a 
> parallel implementation in which each class is evaluated in a different 
> thread. This implementation allows up to a 2X speedup as determined by 
> experiments but is not currently not tunable. Furthermore, the python 
> implementation of OneVsRest does not parallelize at all. It would be useful 
> to add a parallel, tunable implementation of OneVsRest to the python library 
> in order to speed up the algorithm.
>  A ticket for the Scala implementation of this classifier is here: 
> https://issues.apache.org/jira/browse/SPARK-21028



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21027) Parallel One vs. Rest Classifier

2017-06-12 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16047241#comment-16047241
 ] 

Joseph K. Bradley commented on SPARK-21027:
---

Whoops! I realized I'd reported this long ago...and I'd said there might be 
issues with python multiprocessing.  I don't recall what those issues were.  
Let's investigate.

> Parallel One vs. Rest Classifier
> 
>
> Key: SPARK-21027
> URL: https://issues.apache.org/jira/browse/SPARK-21027
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Ajay Saini
>
> Currently, the Scala implementation of OneVsRest allows the user to run a 
> parallel implementation in which each class is evaluated in a different 
> thread. This implementation allows up to a 2X speedup as determined by 
> experiments but is not currently not tunable. Furthermore, the python 
> implementation of OneVsRest does not parallelize at all. It would be useful 
> to add a parallel, tunable implementation of OneVsRest to the python library 
> in order to speed up the algorithm.
>  A ticket for the Scala implementation of this classifier is here: 
> https://issues.apache.org/jira/browse/SPARK-21028



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21050) ml word2vec write has overflow issue in calculating numPartitions

2017-06-12 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-21050.
---
   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.1

Issue resolved by pull request 18265
[https://github.com/apache/spark/pull/18265]

> ml word2vec write has overflow issue in calculating numPartitions
> -
>
> Key: SPARK-21050
> URL: https://issues.apache.org/jira/browse/SPARK-21050
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
> Fix For: 2.2.1, 2.3.0
>
>
> The method calculateNumberOfPartitions() uses Int, not Long (unlike the MLlib 
> version), so it is very easily to have an overflow in calculating the number 
> of partitions for ML persistence.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20499) Spark MLlib, GraphX 2.2 QA umbrella

2017-06-12 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20499:
--
Fix Version/s: 2.2.0

> Spark MLlib, GraphX 2.2 QA umbrella
> ---
>
> Key: SPARK-20499
> URL: https://issues.apache.org/jira/browse/SPARK-20499
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
> Fix For: 2.2.0
>
>
> This JIRA lists tasks for the next Spark release's QA period for MLlib and 
> GraphX.   *SparkR is separate: [SPARK-20508].*
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
> * Check binary API compatibility for Scala/Java
> * Audit new public APIs (from the generated html doc)
> ** Scala
> ** Java compatibility
> ** Python coverage
> * Check Experimental, DeveloperApi tags
> h2. Algorithms and performance
> * Performance tests
> h2. Documentation and example code
> * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
> * Update Programming Guide
> * Update website



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20507) Update MLlib, GraphX websites for 2.2

2017-06-12 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20507:
--
Fix Version/s: 2.2.0

> Update MLlib, GraphX websites for 2.2
> -
>
> Key: SPARK-20507
> URL: https://issues.apache.org/jira/browse/SPARK-20507
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Nick Pentreath
>Priority: Critical
> Fix For: 2.2.0
>
>
> Update the sub-projects' websites to include new features in this release.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20511) SparkR 2.2 QA: Check for new R APIs requiring example code

2017-06-12 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-20511:
-

Assignee: Felix Cheung

> SparkR 2.2 QA: Check for new R APIs requiring example code
> --
>
> Key: SPARK-20511
> URL: https://issues.apache.org/jira/browse/SPARK-20511
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Felix Cheung
> Fix For: 2.2.0
>
>
> Audit list of new features added to MLlib's R API, and see which major items 
> are missing example code (in the examples folder).  We do not need examples 
> for everything, only for major items such as new algorithms.
> For any such items:
> * Create a JIRA for that feature, and assign it to the author of the feature 
> (or yourself if interested).
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18864) Changes of MLlib and SparkR behavior for 2.2

2017-06-12 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18864:
--
Fix Version/s: 2.2.0

> Changes of MLlib and SparkR behavior for 2.2
> 
>
> Key: SPARK-18864
> URL: https://issues.apache.org/jira/browse/SPARK-18864
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML, MLlib, SparkR
>Reporter: Joseph K. Bradley
> Fix For: 2.2.0
>
>
> This JIRA is for tracking changes of behavior within MLlib and SparkR for the 
> Spark 2.2 release.  If any JIRAs change behavior, please list them below with 
> a short description of the change.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20511) SparkR 2.2 QA: Check for new R APIs requiring example code

2017-06-12 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20511:
--
Fix Version/s: 2.2.0

> SparkR 2.2 QA: Check for new R APIs requiring example code
> --
>
> Key: SPARK-20511
> URL: https://issues.apache.org/jira/browse/SPARK-20511
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
> Fix For: 2.2.0
>
>
> Audit list of new features added to MLlib's R API, and see which major items 
> are missing example code (in the examples folder).  We do not need examples 
> for everything, only for major items such as new algorithms.
> For any such items:
> * Create a JIRA for that feature, and assign it to the author of the feature 
> (or yourself if interested).
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20508) Spark R 2.2 QA umbrella

2017-06-12 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-20508:
-

Assignee: Felix Cheung  (was: Joseph K. Bradley)

> Spark R 2.2 QA umbrella
> ---
>
> Key: SPARK-20508
> URL: https://issues.apache.org/jira/browse/SPARK-20508
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Felix Cheung
>Priority: Critical
> Fix For: 2.2.0
>
>
> This JIRA lists tasks for the next Spark release's QA period for SparkR.
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
> * Audit new public APIs (from the generated html doc)
> ** relative to Spark Scala/Java APIs
> ** relative to popular R libraries
> h2. Documentation and example code
> * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
> * Update Programming Guide
> * Update website



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2   3   4   5   6   7   8   9   10   >