[jira] [Commented] (SPARK-14831) Make ML APIs in SparkR consistent

2016-04-27 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15260981#comment-15260981
 ] 

Xiangrui Meng commented on SPARK-14831:
---

+1 on `read.ml` and `write.ml`, which are consistent with `read.df` and 
`write.df` and leave space for future features. Putting the discussions 
together, we have:

* read.ml and write.ml for saving/loading ML models
* "spark." prefix to ML algorithms, especially if we cannot closely match 
existing R methods or have to shadow them. This includes:
** spark.glm and glm (which doesn't shadow stats::glm)
** spark.kmeans
** spark.naiveBayes
** spark.survreg

For methods with `spark.` prefix, I suggest the following signature:

{code:none}
spark.kmeans(df, formula, [required params], [optional params], ...)
{code}

Sounds good?

> Make ML APIs in SparkR consistent
> -
>
> Key: SPARK-14831
> URL: https://issues.apache.org/jira/browse/SPARK-14831
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> In current master, we have 4 ML methods in SparkR:
> {code:none}
> glm(formula, family, data, ...)
> kmeans(data, centers, ...)
> naiveBayes(formula, data, ...)
> survreg(formula, data, ...)
> {code}
> We tried to keep the signatures similar to existing ones in R. However, if we 
> put them together, they are not consistent. One example is k-means, which 
> doesn't accept a formula. Instead of looking at each method independently, we 
> might want to update the signature of kmeans to
> {code:none}
> kmeans(formula, data, centers, ...)
> {code}
> We can also discuss possible global changes here. For example, `glm` puts 
> `family` before `data` while `kmeans` puts `centers` after `data`. This is 
> not consistent. And logically, the formula doesn't mean anything without 
> associating with a DataFrame. So it makes more sense to me to have the 
> following signature:
> {code:none}
> algorithm(df, formula, [required params], [optional params])
> {code}
> If we make this change, we might want to avoid name collisions because they 
> have different signature. We can use `ml.kmeans`, 'ml.glm`, etc.
> Sorry for discussing API changes in the last minute. But I think it would be 
> better to have consistent signatures in SparkR.
> cc: [~shivaram] [~josephkb] [~yanboliang]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14315) GLMs model persistence in SparkR

2016-04-27 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14315:
--
Assignee: Gayathri Murali

> GLMs model persistence in SparkR
> 
>
> Key: SPARK-14315
> URL: https://issues.apache.org/jira/browse/SPARK-14315
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Assignee: Gayathri Murali
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14314) K-means model persistence in SparkR

2016-04-27 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14314:
--
Shepherd: Yanbo Liang
Assignee: Gayathri Murali
Target Version/s: 2.0.0

> K-means model persistence in SparkR
> ---
>
> Key: SPARK-14314
> URL: https://issues.apache.org/jira/browse/SPARK-14314
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Assignee: Gayathri Murali
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14315) GLMs model persistence in SparkR

2016-04-27 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14315:
--
Target Version/s: 2.0.0

> GLMs model persistence in SparkR
> 
>
> Key: SPARK-14315
> URL: https://issues.apache.org/jira/browse/SPARK-14315
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Assignee: Gayathri Murali
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14315) GLMs model persistence in SparkR

2016-04-27 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14315:
--
Shepherd: Yanbo Liang

> GLMs model persistence in SparkR
> 
>
> Key: SPARK-14315
> URL: https://issues.apache.org/jira/browse/SPARK-14315
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Assignee: Gayathri Murali
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14313) AFTSurvivalRegression model persistence in SparkR

2016-04-26 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-14313.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12685
[https://github.com/apache/spark/pull/12685]

> AFTSurvivalRegression model persistence in SparkR
> -
>
> Key: SPARK-14313
> URL: https://issues.apache.org/jira/browse/SPARK-14313
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14313) AFTSurvivalRegression model persistence in SparkR

2016-04-25 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14313:
--
Assignee: Yanbo Liang

> AFTSurvivalRegression model persistence in SparkR
> -
>
> Key: SPARK-14313
> URL: https://issues.apache.org/jira/browse/SPARK-14313
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14313) AFTSurvivalRegression model persistence in SparkR

2016-04-25 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14313:
--
Target Version/s: 2.0.0

> AFTSurvivalRegression model persistence in SparkR
> -
>
> Key: SPARK-14313
> URL: https://issues.apache.org/jira/browse/SPARK-14313
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14312) NaiveBayes model persistence in SparkR

2016-04-25 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-14312.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12573
[https://github.com/apache/spark/pull/12573]

> NaiveBayes model persistence in SparkR
> --
>
> Key: SPARK-14312
> URL: https://issues.apache.org/jira/browse/SPARK-14312
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14850) VectorUDT/MatrixUDT should take primitive arrays without boxing

2016-04-22 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14850:
--
Priority: Blocker  (was: Critical)

> VectorUDT/MatrixUDT should take primitive arrays without boxing
> ---
>
> Key: SPARK-14850
> URL: https://issues.apache.org/jira/browse/SPARK-14850
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SQL
>Affects Versions: 1.5.2, 1.6.1, 2.0.0
>Reporter: Xiangrui Meng
>Priority: Blocker
>
> In SPARK-9390, we switched to use GenericArrayData to store indices and 
> values in vector/matrix UDTs. However, GenericArrayData is not specialized 
> for primitive types. This might hurt MLlib performance badly. We should 
> consider either specialize GenericArrayData or use a different container.
> cc: [~cloud_fan] [~yhuai]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14850) VectorUDT/MatrixUDT should take primitive arrays without boxing

2016-04-22 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14850:
--
Affects Version/s: 1.5.2

> VectorUDT/MatrixUDT should take primitive arrays without boxing
> ---
>
> Key: SPARK-14850
> URL: https://issues.apache.org/jira/browse/SPARK-14850
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SQL
>Affects Versions: 1.5.2, 1.6.1, 2.0.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> In SPARK-9390, we switched to use GenericArrayData to store indices and 
> values in vector/matrix UDTs. However, GenericArrayData is not specialized 
> for primitive types. This might hurt MLlib performance badly. We should 
> consider either specialize GenericArrayData or use a different container.
> cc: [~cloud_fan] [~yhuai]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14850) VectorUDT/MatrixUDT should take primitive arrays without boxing

2016-04-22 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15254538#comment-15254538
 ] 

Xiangrui Meng commented on SPARK-14850:
---

Ran the following code with different Spark versions:

{code}
sc.parallelize(0 until 1e4.toInt, 1).map { i =>
  (i, Vectors.dense(Array.fill(1e6.toInt)(1.0)))
}.toDF.rdd.count()
{code}

Durations:
* 1.4.1: 22s
* 1.5.2: 282s
* 1.6.0: 360s
* 1.6.1: 340s

So it is about 15x slow down on serialization.

> VectorUDT/MatrixUDT should take primitive arrays without boxing
> ---
>
> Key: SPARK-14850
> URL: https://issues.apache.org/jira/browse/SPARK-14850
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> In SPARK-9390, we switched to use GenericArrayData to store indices and 
> values in vector/matrix UDTs. However, GenericArrayData is not specialized 
> for primitive types. This might hurt MLlib performance badly. We should 
> consider either specialize GenericArrayData or use a different container.
> cc: [~cloud_fan] [~yhuai]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14831) Make ML APIs in SparkR consistent

2016-04-22 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15254516#comment-15254516
 ] 

Xiangrui Meng commented on SPARK-14831:
---

1. Please see my reply to Felix above for the issue with similar but slightly 
different. I totally agree that we should use the same method name if we can 
safely override the base R functions and match the features 100%. However, it 
depends on how existing R methods are defined and their signature. For some ML 
methods in SparkR, they actually shadow the existing ones. User need to specify 
namespace after SparkR is loaded.

2. +1 on `spark.` prefix. A related task is save/load for MLlib models. If we 
want to call them`spark.save` and `spark.load`, we need to discuss how to 
implement it. If would be nice if save/load work for both DataFrames and ML 
models.

> Make ML APIs in SparkR consistent
> -
>
> Key: SPARK-14831
> URL: https://issues.apache.org/jira/browse/SPARK-14831
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> In current master, we have 4 ML methods in SparkR:
> {code:none}
> glm(formula, family, data, ...)
> kmeans(data, centers, ...)
> naiveBayes(formula, data, ...)
> survreg(formula, data, ...)
> {code}
> We tried to keep the signatures similar to existing ones in R. However, if we 
> put them together, they are not consistent. One example is k-means, which 
> doesn't accept a formula. Instead of looking at each method independently, we 
> might want to update the signature of kmeans to
> {code:none}
> kmeans(formula, data, centers, ...)
> {code}
> We can also discuss possible global changes here. For example, `glm` puts 
> `family` before `data` while `kmeans` puts `centers` after `data`. This is 
> not consistent. And logically, the formula doesn't mean anything without 
> associating with a DataFrame. So it makes more sense to me to have the 
> following signature:
> {code:none}
> algorithm(df, formula, [required params], [optional params])
> {code}
> If we make this change, we might want to avoid name collisions because they 
> have different signature. We can use `ml.kmeans`, 'ml.glm`, etc.
> Sorry for discussing API changes in the last minute. But I think it would be 
> better to have consistent signatures in SparkR.
> cc: [~shivaram] [~josephkb] [~yanboliang]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14831) Make ML APIs in SparkR consistent

2016-04-22 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15254510#comment-15254510
 ] 

Xiangrui Meng commented on SPARK-14831:
---

We have been trying to mimic existing R APIs in SparkR. That gave users some 
impressions that existing R code should work magically after they convert the 
input data.frame to SparkR's DataFrame. However, this is not true for 
DataFrames APIs, nor for the ML APIs in SparkR. For example, we have 
`algorithm` defined in `kmeans` because R's kmeans has this argument. But 
actually they mean different things, one for initialization algorithms and one 
for training algorithms. This is quite annoying to users when the methods are 
similar but with subtle differences. If we don't use the same method name, user 
would probably look at the help first before trying it.

> Make ML APIs in SparkR consistent
> -
>
> Key: SPARK-14831
> URL: https://issues.apache.org/jira/browse/SPARK-14831
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> In current master, we have 4 ML methods in SparkR:
> {code:none}
> glm(formula, family, data, ...)
> kmeans(data, centers, ...)
> naiveBayes(formula, data, ...)
> survreg(formula, data, ...)
> {code}
> We tried to keep the signatures similar to existing ones in R. However, if we 
> put them together, they are not consistent. One example is k-means, which 
> doesn't accept a formula. Instead of looking at each method independently, we 
> might want to update the signature of kmeans to
> {code:none}
> kmeans(formula, data, centers, ...)
> {code}
> We can also discuss possible global changes here. For example, `glm` puts 
> `family` before `data` while `kmeans` puts `centers` after `data`. This is 
> not consistent. And logically, the formula doesn't mean anything without 
> associating with a DataFrame. So it makes more sense to me to have the 
> following signature:
> {code:none}
> algorithm(df, formula, [required params], [optional params])
> {code}
> If we make this change, we might want to avoid name collisions because they 
> have different signature. We can use `ml.kmeans`, 'ml.glm`, etc.
> Sorry for discussing API changes in the last minute. But I think it would be 
> better to have consistent signatures in SparkR.
> cc: [~shivaram] [~josephkb] [~yanboliang]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14850) VectorUDT/MatrixUDT should take primitive arrays without boxing

2016-04-22 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-14850:
-

 Summary: VectorUDT/MatrixUDT should take primitive arrays without 
boxing
 Key: SPARK-14850
 URL: https://issues.apache.org/jira/browse/SPARK-14850
 Project: Spark
  Issue Type: Improvement
  Components: ML, SQL
Affects Versions: 1.6.1, 2.0.0
Reporter: Xiangrui Meng
Priority: Critical


In SPARK-9390, we switched to use GenericArrayData to store indices and values 
in vector/matrix UDTs. However, GenericArrayData is not specialized for 
primitive types. This might hurt MLlib performance badly. We should consider 
either specialize GenericArrayData or use a different container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14850) VectorUDT/MatrixUDT should take primitive arrays without boxing

2016-04-22 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14850:
--
Description: 
In SPARK-9390, we switched to use GenericArrayData to store indices and values 
in vector/matrix UDTs. However, GenericArrayData is not specialized for 
primitive types. This might hurt MLlib performance badly. We should consider 
either specialize GenericArrayData or use a different container.

cc: [~cloud_fan] [~yhuai]

  was:In SPARK-9390, we switched to use GenericArrayData to store indices and 
values in vector/matrix UDTs. However, GenericArrayData is not specialized for 
primitive types. This might hurt MLlib performance badly. We should consider 
either specialize GenericArrayData or use a different container.


> VectorUDT/MatrixUDT should take primitive arrays without boxing
> ---
>
> Key: SPARK-14850
> URL: https://issues.apache.org/jira/browse/SPARK-14850
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> In SPARK-9390, we switched to use GenericArrayData to store indices and 
> values in vector/matrix UDTs. However, GenericArrayData is not specialized 
> for primitive types. This might hurt MLlib performance badly. We should 
> consider either specialize GenericArrayData or use a different container.
> cc: [~cloud_fan] [~yhuai]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14314) K-means model persistence in SparkR

2016-04-21 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253140#comment-15253140
 ] 

Xiangrui Meng commented on SPARK-14314:
---

Please hold until the naive Bayes one gets merged.

On Thu, Apr 21, 2016, 10:19 AM Gayathri Murali (JIRA) 



> K-means model persistence in SparkR
> ---
>
> Key: SPARK-14314
> URL: https://issues.apache.org/jira/browse/SPARK-14314
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14831) Make ML APIs in SparkR consistent

2016-04-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14831:
--
Description: 
In current master, we have 4 ML methods in SparkR:

{code:none}
glm(formula, family, data, ...)
kmeans(data, centers, ...)
naiveBayes(formula, data, ...)
survreg(formula, data, ...)
{code}

We tried to keep the signatures similar to existing ones in R. However, if we 
put them together, they are not consistent. One example is k-means, which 
doesn't accept a formula. Instead of looking at each method independently, we 
might want to update the signature of kmeans to

{code:none}
kmeans(formula, data, centers, ...)
{code}

We can also discuss possible global changes here. For example, `glm` puts 
`family` before `data` while `kmeans` puts `centers` after `data`. This is not 
consistent. And logically, the formula doesn't mean anything without 
associating with a DataFrame. So it makes more sense to me to have the 
following signature:

{code:none}
algorithm(df, formula, [required params], [optional params])
{code}

If we make this change, we might want to avoid name collisions because they 
have different signature. We can use `ml.kmeans`, 'ml.glm`, etc.

Sorry for discussing API changes in the last minute. But I think it would be 
better to have consistent signatures in SparkR.

cc: [~shivaram] [~josephkb] [~yanboliang]

  was:
In current master, we have 4 ML methods in SparkR:

{code:none}
glm(formula, family, data, ...)
kmeans(data, centers, ...)
naiveBayes(formula, data, ...)
survreg(formula, data, ...)
{code}

We tried to keep the signatures similar to existing ones in R. However, if we 
put them together, they are not consistent. One example is k-means, which 
doesn't accept a formula. Instead of looking at each method independently, we 
might want to update the signature of kmeans to

{code:none}
kmeans(formula, data, centers, ...)
{code}

We can also discuss possible global changes here. For example, `glm` puts 
`family` before `data` while `kmeans` puts `centers` after `data`. This is not 
consistent. And logically, the formula doesn't mean anything without 
associating with a DataFrame. So it makes more sense to me to have the 
following signature:

{code:none}
algorithm(data, formula, [required params], [optional params])
{code}

If we make this change, we might want to avoid name collisions because they 
have different signature. We can use `ml.kmeans`, 'ml.glm`, etc.

Sorry for discussing API changes in the last minute. But I think it would be 
better to have consistent signatures in SparkR.

cc: [~shivaram] [~josephkb] [~yanboliang]


> Make ML APIs in SparkR consistent
> -
>
> Key: SPARK-14831
> URL: https://issues.apache.org/jira/browse/SPARK-14831
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> In current master, we have 4 ML methods in SparkR:
> {code:none}
> glm(formula, family, data, ...)
> kmeans(data, centers, ...)
> naiveBayes(formula, data, ...)
> survreg(formula, data, ...)
> {code}
> We tried to keep the signatures similar to existing ones in R. However, if we 
> put them together, they are not consistent. One example is k-means, which 
> doesn't accept a formula. Instead of looking at each method independently, we 
> might want to update the signature of kmeans to
> {code:none}
> kmeans(formula, data, centers, ...)
> {code}
> We can also discuss possible global changes here. For example, `glm` puts 
> `family` before `data` while `kmeans` puts `centers` after `data`. This is 
> not consistent. And logically, the formula doesn't mean anything without 
> associating with a DataFrame. So it makes more sense to me to have the 
> following signature:
> {code:none}
> algorithm(df, formula, [required params], [optional params])
> {code}
> If we make this change, we might want to avoid name collisions because they 
> have different signature. We can use `ml.kmeans`, 'ml.glm`, etc.
> Sorry for discussing API changes in the last minute. But I think it would be 
> better to have consistent signatures in SparkR.
> cc: [~shivaram] [~josephkb] [~yanboliang]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14831) Make ML APIs in SparkR consistent

2016-04-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14831:
--
Description: 
In current master, we have 4 ML methods in SparkR:

{code:none}
glm(formula, family, data, ...)
kmeans(data, centers, ...)
naiveBayes(formula, data, ...)
survreg(formula, data, ...)
{code}

We tried to keep the signatures similar to existing ones in R. However, if we 
put them together, they are not consistent. One example is k-means, which 
doesn't accept a formula. Instead of looking at each method independently, we 
might want to update the signature of kmeans to

{code:none}
kmeans(formula, data, centers, ...)
{code}

We can also discuss possible global changes here. For example, `glm` puts 
`family` before `data` while `kmeans` puts `centers` after `data`. This is not 
consistent. And logically, the formula doesn't mean anything without 
associating with a DataFrame. So it makes more sense to me to have the 
following signature:

{code:none}
algorithm(data, formula, [required params], [optional params])
{code}

If we make this change, we might want to avoid name collisions because they 
have different signature. We can use `ml.kmeans`, 'ml.glm`, etc.

Sorry for discussing API changes in the last minute. But I think it would be 
better to have consistent signatures in SparkR.

cc: [~shivaram] [~josephkb] [~yanboliang]

  was:
In current master, we have 4 ML methods in SparkR:

{code:none}
glm(formula, family, data, ...)
kmeans(data, centers, ...)
naiveBayes(formula, data, ...)
survreg(formula, data, ...)
{code}

We tried to keep the signatures similar to existing ones in R. However, if we 
put them together, they are not consistent. One example is k-means, which 
doesn't accept a formula. Instead of looking at each method independently, we 
might want to update the signature of kmeans to

{code:none}
kmeans(formula, data, centers, ...)
{code}

We can also discuss possible global changes here. For example, `glm` puts 
`family` before `data` while `kmeans` puts `centers` after `data`. This is not 
consistent. And logically, the formula doesn't mean anything without 
associating with a DataFrame. So it makes more sense to me to have the 
following signature:

{code:none}
algorithm(data, formula, [required params], [optional params])
{code}

If we make this change, we might want to avoid name collisions because they 
have different signature. We can use `ml.kmeans`, 'ml.glm`, etc.

Sorry for discussing API changes in the last minute. But I think it would be 
better to have consistent signatures in SparkR.


> Make ML APIs in SparkR consistent
> -
>
> Key: SPARK-14831
> URL: https://issues.apache.org/jira/browse/SPARK-14831
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> In current master, we have 4 ML methods in SparkR:
> {code:none}
> glm(formula, family, data, ...)
> kmeans(data, centers, ...)
> naiveBayes(formula, data, ...)
> survreg(formula, data, ...)
> {code}
> We tried to keep the signatures similar to existing ones in R. However, if we 
> put them together, they are not consistent. One example is k-means, which 
> doesn't accept a formula. Instead of looking at each method independently, we 
> might want to update the signature of kmeans to
> {code:none}
> kmeans(formula, data, centers, ...)
> {code}
> We can also discuss possible global changes here. For example, `glm` puts 
> `family` before `data` while `kmeans` puts `centers` after `data`. This is 
> not consistent. And logically, the formula doesn't mean anything without 
> associating with a DataFrame. So it makes more sense to me to have the 
> following signature:
> {code:none}
> algorithm(data, formula, [required params], [optional params])
> {code}
> If we make this change, we might want to avoid name collisions because they 
> have different signature. We can use `ml.kmeans`, 'ml.glm`, etc.
> Sorry for discussing API changes in the last minute. But I think it would be 
> better to have consistent signatures in SparkR.
> cc: [~shivaram] [~josephkb] [~yanboliang]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14831) Make ML APIs in SparkR consistent

2016-04-21 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-14831:
-

 Summary: Make ML APIs in SparkR consistent
 Key: SPARK-14831
 URL: https://issues.apache.org/jira/browse/SPARK-14831
 Project: Spark
  Issue Type: Improvement
  Components: ML, SparkR
Affects Versions: 2.0.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical


In current master, we have 4 ML methods in SparkR:

{code:none}
glm(formula, family, data, ...)
kmeans(data, centers, ...)
naiveBayes(formula, data, ...)
survreg(formula, data, ...)
{code}

We tried to keep the signatures similar to existing ones in R. However, if we 
put them together, they are not consistent. One example is k-means, which 
doesn't accept a formula. Instead of looking at each method independently, we 
might want to update the signature of kmeans to

{code:none}
kmeans(formula, data, centers, ...)
{code}

We can also discuss possible global changes here. For example, `glm` puts 
`family` before `data` while `kmeans` puts `centers` after `data`. This is not 
consistent. And logically, the formula doesn't mean anything without 
associating with a DataFrame. So it makes more sense to me to have the 
following signature:

{code:none}
algorithm(data, formula, [required params], [optional params])
{code}

If we make this change, we might want to avoid name collisions because they 
have different signature. We can use `ml.kmeans`, 'ml.glm`, etc.

Sorry for discussing API changes in the last minute. But I think it would be 
better to have consistent signatures in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14479) GLM supports output link prediction

2016-04-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-14479.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12287
[https://github.com/apache/spark/pull/12287]

> GLM supports output link prediction
> ---
>
> Key: SPARK-14479
> URL: https://issues.apache.org/jira/browse/SPARK-14479
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Yanbo Liang
> Fix For: 2.0.0
>
>
> In R glm and glmnet, the default type of predict is "link" which is the 
> linear predictor, users can specify "type = response" to output response 
> prediction. Currently the ML glm predict will output "response" prediction by 
> default, I think it's more reasonable. Should we change the default type of 
> ML glm predict output? 
> R glm: 
> https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.glm.html
> R glmnet: http://www.inside-r.org/packages/cran/glmnet/docs/predict.glmnet
> Meanwhile, we should decide the default type of glm predict output in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14479) GLM supports output link prediction

2016-04-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14479:
--
Assignee: Yanbo Liang

> GLM supports output link prediction
> ---
>
> Key: SPARK-14479
> URL: https://issues.apache.org/jira/browse/SPARK-14479
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
> Fix For: 2.0.0
>
>
> In R glm and glmnet, the default type of predict is "link" which is the 
> linear predictor, users can specify "type = response" to output response 
> prediction. Currently the ML glm predict will output "response" prediction by 
> default, I think it's more reasonable. Should we change the default type of 
> ML glm predict output? 
> R glm: 
> https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.glm.html
> R glmnet: http://www.inside-r.org/packages/cran/glmnet/docs/predict.glmnet
> Meanwhile, we should decide the default type of glm predict output in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7992) Hide private classes/objects in in generated Java API doc

2016-04-21 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253031#comment-15253031
 ] 

Xiangrui Meng commented on SPARK-7992:
--

Thanks for making this work in the official repo! I'm going to close this JIRA.

I posted some issues on https://github.com/typesafehub/genjavadoc/issues/73, 
which we can either discuss there on in SPARK-14511.

> Hide private classes/objects in in generated Java API doc
> -
>
> Key: SPARK-7992
> URL: https://issues.apache.org/jira/browse/SPARK-7992
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Documentation
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
> Fix For: 2.0.0
>
>
> After SPARK-5610, we found that private classes/objects still show up in the 
> generated Java API doc, e.g., under `org.apache.spark.api.r` we can see
> {code}
> BaseRRDD
> PairwiseRRDD
> RRDD
> SpecialLengths
> StringRRDD
> {code}
> We should update genjavadoc to hide those private classes/methods. The best 
> approach is to find a good mapping from Scala private to Java, and merge it 
> into the main genjavadoc repo. A WIP PR is at 
> https://github.com/typesafehub/genjavadoc/pull/47.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7992) Hide private classes/objects in in generated Java API doc

2016-04-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7992:
-
Assignee: Jakob Odersky

> Hide private classes/objects in in generated Java API doc
> -
>
> Key: SPARK-7992
> URL: https://issues.apache.org/jira/browse/SPARK-7992
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Documentation
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Assignee: Jakob Odersky
> Fix For: 2.0.0
>
>
> After SPARK-5610, we found that private classes/objects still show up in the 
> generated Java API doc, e.g., under `org.apache.spark.api.r` we can see
> {code}
> BaseRRDD
> PairwiseRRDD
> RRDD
> SpecialLengths
> StringRRDD
> {code}
> We should update genjavadoc to hide those private classes/methods. The best 
> approach is to find a good mapping from Scala private to Java, and merge it 
> into the main genjavadoc repo. A WIP PR is at 
> https://github.com/typesafehub/genjavadoc/pull/47.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7992) Hide private classes/objects in in generated Java API doc

2016-04-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-7992.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

> Hide private classes/objects in in generated Java API doc
> -
>
> Key: SPARK-7992
> URL: https://issues.apache.org/jira/browse/SPARK-7992
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Documentation
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Assignee: Jakob Odersky
> Fix For: 2.0.0
>
>
> After SPARK-5610, we found that private classes/objects still show up in the 
> generated Java API doc, e.g., under `org.apache.spark.api.r` we can see
> {code}
> BaseRRDD
> PairwiseRRDD
> RRDD
> SpecialLengths
> StringRRDD
> {code}
> We should update genjavadoc to hide those private classes/methods. The best 
> approach is to find a good mapping from Scala private to Java, and merge it 
> into the main genjavadoc repo. A WIP PR is at 
> https://github.com/typesafehub/genjavadoc/pull/47.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14312) NaiveBayes model persistence in SparkR

2016-04-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14312:
--
Target Version/s: 2.0.0

> NaiveBayes model persistence in SparkR
> --
>
> Key: SPARK-14312
> URL: https://issues.apache.org/jira/browse/SPARK-14312
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14312) NaiveBayes model persistence in SparkR

2016-04-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14312:
--
Assignee: Yanbo Liang

> NaiveBayes model persistence in SparkR
> --
>
> Key: SPARK-14312
> URL: https://issues.apache.org/jira/browse/SPARK-14312
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14312) NaiveBayes model persistence in SparkR

2016-04-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14312:
--
Shepherd: Xiangrui Meng

> NaiveBayes model persistence in SparkR
> --
>
> Key: SPARK-14312
> URL: https://issues.apache.org/jira/browse/SPARK-14312
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7264) SparkR API for parallel functions

2016-04-18 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7264:
-
Target Version/s: 2.0.0

> SparkR API for parallel functions
> -
>
> Key: SPARK-7264
> URL: https://issues.apache.org/jira/browse/SPARK-7264
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Assignee: Timothy Hunter
>
> This is a JIRA to discuss design proposals for enabling parallel R 
> computation in SparkR without exposing the entire RDD API. 
> The rationale for this is that the RDD API has a number of low level 
> functions and we would like to expose a more light-weight API that is both 
> friendly to R users and easy to maintain.
> http://goo.gl/GLHKZI has a first cut design doc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7264) SparkR API for parallel functions

2016-04-18 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7264:
-
Assignee: Timothy Hunter

> SparkR API for parallel functions
> -
>
> Key: SPARK-7264
> URL: https://issues.apache.org/jira/browse/SPARK-7264
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Assignee: Timothy Hunter
>
> This is a JIRA to discuss design proposals for enabling parallel R 
> computation in SparkR without exposing the entire RDD API. 
> The rationale for this is that the RDD API has a number of low level 
> functions and we would like to expose a more light-weight API that is both 
> friendly to R users and easy to maintain.
> http://goo.gl/GLHKZI has a first cut design doc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14299) Scala ML examples code merge and clean up

2016-04-18 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14299:
--
Assignee: Xusen Yin

> Scala ML examples code merge and clean up
> -
>
> Key: SPARK-14299
> URL: https://issues.apache.org/jira/browse/SPARK-14299
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Assignee: Xusen Yin
>Priority: Minor
>  Labels: starter
> Fix For: 2.0.0
>
>
> Duplicated code that I found in scala/examples/ml:
> * scala/ml
> ** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample
> ** TrainValidationSplitExample.scala --> 
> ModelSelectionViaTrainValidationSplitExample
> ** DeveloperApiExample.scala --> I delete it for now because it's only about 
> how to create your own classifieri, etc, which can be learned easily from 
> other examples and ml codes.
> ** SimpleParamsExample.scala --> merge with 
> LogisticRegressionSummaryExample.scala
> ** SimpleTextClassificationPipeline.scala --> 
> ModelSelectionViaCrossValidationExample
> ** DataFrameExample.scala --> merge with 
> LogisticRegressionSummaryExample.scala
> * Intend to reserve with command-line support:
> ** DecisionTreeExample.scala --> DecisionTreeRegressionExample, 
> DecisionTreeClassificationExample
> ** GBTExample.scala --> GradientBoostedTreeClassifierExample, 
> GradientBoostedTreeRegressorExample
> ** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample
> ** LogisticRegressionExample.scala --> 
> LogisticRegressionWithElasticNetExample, LogisticRegressionSummaryExample
> ** RandomForestExample.scala --> RandomForestRegressorExample, 
> RandomForestClassifierExample
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks.
> I'll take this one as an example. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14299) Scala ML examples code merge and clean up

2016-04-18 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-14299.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12366
[https://github.com/apache/spark/pull/12366]

> Scala ML examples code merge and clean up
> -
>
> Key: SPARK-14299
> URL: https://issues.apache.org/jira/browse/SPARK-14299
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
> Fix For: 2.0.0
>
>
> Duplicated code that I found in scala/examples/ml:
> * scala/ml
> ** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample
> ** TrainValidationSplitExample.scala --> 
> ModelSelectionViaTrainValidationSplitExample
> ** DeveloperApiExample.scala --> I delete it for now because it's only about 
> how to create your own classifieri, etc, which can be learned easily from 
> other examples and ml codes.
> ** SimpleParamsExample.scala --> merge with 
> LogisticRegressionSummaryExample.scala
> ** SimpleTextClassificationPipeline.scala --> 
> ModelSelectionViaCrossValidationExample
> ** DataFrameExample.scala --> merge with 
> LogisticRegressionSummaryExample.scala
> * Intend to reserve with command-line support:
> ** DecisionTreeExample.scala --> DecisionTreeRegressionExample, 
> DecisionTreeClassificationExample
> ** GBTExample.scala --> GradientBoostedTreeClassifierExample, 
> GradientBoostedTreeRegressorExample
> ** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample
> ** LogisticRegressionExample.scala --> 
> LogisticRegressionWithElasticNetExample, LogisticRegressionSummaryExample
> ** RandomForestExample.scala --> RandomForestRegressorExample, 
> RandomForestClassifierExample
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks.
> I'll take this one as an example. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14440) Remove PySpark ml.pipeline's specific Reader and Writer

2016-04-18 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14440:
--
Assignee: Xusen Yin

> Remove PySpark ml.pipeline's specific Reader and Writer
> ---
>
> Key: SPARK-14440
> URL: https://issues.apache.org/jira/browse/SPARK-14440
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Xusen Yin
>Assignee: Xusen Yin
>Priority: Trivial
> Fix For: 2.0.0
>
>
> Since the 
> PipelineMLWriter/PipelineMLReader/PipelineModelMLWriter/PipelineModelMLReader 
> are just extended from JavaMLWriter and JavaMLReader without other 
> modifications of attributes and methods, there is no need to keep them, just 
> like what we did in the save/load of ml/tuning.py.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14440) Remove PySpark ml.pipeline's specific Reader and Writer

2016-04-18 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-14440.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12216
[https://github.com/apache/spark/pull/12216]

> Remove PySpark ml.pipeline's specific Reader and Writer
> ---
>
> Key: SPARK-14440
> URL: https://issues.apache.org/jira/browse/SPARK-14440
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Xusen Yin
>Priority: Trivial
> Fix For: 2.0.0
>
>
> Since the 
> PipelineMLWriter/PipelineMLReader/PipelineModelMLWriter/PipelineModelMLReader 
> are just extended from JavaMLWriter and JavaMLReader without other 
> modifications of attributes and methods, there is no need to keep them, just 
> like what we did in the save/load of ml/tuning.py.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13944) Separate out local linear algebra as a standalone module without Spark dependency

2016-04-15 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15243793#comment-15243793
 ] 

Xiangrui Meng commented on SPARK-13944:
---

`mllib-local` by the name is not scoped just for local linear algebra. But 
let's talk about linear algebra library first. MLlib provides implementations 
of standard machine learning algorithms on Spark. Our goal is to cover common 
use cases instead of all that are possible. So some companies and developers 
need to build their own algorithms or modify the implementation in MLlib to 
meet their use cases. To implement algorithms on Spark, a natural choice is to 
use MLlib's linear algebra library, which has good integration with built-in 
MLlib algorithms and DataFrames. But, the issue is: what local linear algebra 
library they should use in online serving? MLlib's local linear algebra library 
is not an option because of its dependency on Spark Core. So people have to 
pick another library or maintain a fork. Neither is ideal due to offline/online 
inconsistency. Separating the linear algebra library out is a clear benefit to 
those developers. Btw, we have to provide linear algebra abstractions in MLlib 
because we cannot expose 3rd-party APIs in Spark public APIs. I think we are on 
the same page about it. 

Next, let's talk about other linear algebra libraries. If there existed a good 
Java linear algebra implementation that met our requirements, I would be more 
than happy to use it. For the requirements and the libraries you listed, please 
see my comment at 
https://issues.apache.org/jira/browse/SPARK-6442?focusedCommentId=14629182=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14629182.
 One thing I didn't mention there is what API compatibility those libraries 
promise. We picked breeze in Spark 1.0 because it was the best candidate at 
that time. If MTJ had a compatible license, I would go with it because it is a 
pure Java library. With breeze now we have to face the issue with Scala 2.12 
compatibility. See https://issues.apache.org/jira/browse/SPARK-14438. There 
have been more linear library coming out since Spark 1.0. It would be great if 
someone can spend time to do the comparison and benchmark again.

On the maintenance side, we have been trying to keep the linear algebra library 
lightweight. DB's PR only contains 4000 lines including test code. While we 
have done a good job to keep it thin, we have also received lots of complaints 
for its lack of features. There is always this trade-offs. We picked the former 
in Spark 1.x due to resource limit. Now with more users and contributors, I 
think we should adjust the balance and provide more features and make both 
users and developers happier. This is mainly for the public APIs. Underneath, 
we can use an existing implementation to avoid duplicate work. But there are 
issues with this approach too, as I mentioned in the previous paragraph.

The reasons listed above should justify the motivation of this JIRA. We can 
briefly talk about model serving and we do we need for local models. Do we need 
to implement local training? Perhaps not. We need model import, transform, and 
maybe online updates. To support other model-serving systems, we just need to 
open up the format we used for pipeline persistence, so it is readable by other 
systems. There are definitely work to do to stabilize the format we use. But 
making exported MLlib models and pipelines readable by other systems is 
certainly what we want to achieve. PMML is really not a good option here. XML 
doesn't seem to be the right format for this purpose and importing PMML is hard 
(at least no easy and Apache-compatible way to do it). We use Parquet and Json, 
both are exchangeable format. On our side, I still think it is still valuable 
for us to provide a lightweight solution for online serving, e.g., local models 
or code generation. It would also make it easier for other systems like 
Prediction.IO because they can use the local models from MLlib directly. We can 
discuss ideas when we start 2.1 development. This is beyond the scope of this 
JIRA.

> Separate out local linear algebra as a standalone module without Spark 
> dependency
> -
>
> Key: SPARK-13944
> URL: https://issues.apache.org/jira/browse/SPARK-13944
> Project: Spark
>  Issue Type: New Feature
>  Components: Build, ML
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: DB Tsai
>Priority: Blocker
>
> Separate out linear algebra as a standalone module without Spark dependency 
> to simplify production deployment. We can call the new module 
> spark-mllib-local, which might contain local models in the future.
> The major issue is to remove dependencies on user-defined 

[jira] [Updated] (SPARK-14657) RFormula output wrong features when formula w/o intercept

2016-04-15 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14657:
--
Target Version/s: 2.0.0

> RFormula output wrong features when formula w/o intercept
> -
>
> Key: SPARK-14657
> URL: https://issues.apache.org/jira/browse/SPARK-14657
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> SparkR::glm output different features compared with R glm when fit w/o 
> intercept and having string/category features. Take the following example, 
> SparkR output three features compared with four features for native R.
> SparkR::glm
> {quote}
> training <- suppressWarnings(createDataFrame(sqlContext, iris))
> model <- glm(Sepal_Width ~ Sepal_Length + Species - 1, data = training)
> summary(model)
> Coefficients:
> Estimate  Std. Error  t value  Pr(>|t|)
> Sepal_Length0.67468   0.0093013   72.536   0
> Species_versicolor  -1.2349   0.07269 -16.989  0
> Species_virginica   -1.4708   0.077397-19.003  0
> {quote}
> stats::glm
> {quote}
> summary(glm(Sepal.Width ~ Sepal.Length + Species - 1, data = iris))
> Coefficients:
>   Estimate Std. Error t value Pr(>|t|)
> Sepal.Length0.3499 0.0463   7.557 4.19e-12 ***
> Speciessetosa   1.6765 0.2354   7.123 4.46e-11 ***
> Speciesversicolor   0.6931 0.2779   2.494   0.0137 *  
> Speciesvirginica0.6690 0.3078   2.174   0.0313 *  
> {quote}
> The encoder for string/category feature is different. R did not drop any 
> category but SparkR drop the last one.
> I searched online and test some other cases, found when we fit R glm model(or 
> other models powered by R formula) w/o intercept on a dataset including 
> string/category features, one of the categories in the first category feature 
> is being used as reference category, we will not drop any category for that 
> feature.
> I think we should keep consistent semantics between Spark RFormula and R 
> formula.
> cc [~mengxr] 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14657) RFormula output wrong features when formula w/o intercept

2016-04-15 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14657:
--
Shepherd: Xiangrui Meng

> RFormula output wrong features when formula w/o intercept
> -
>
> Key: SPARK-14657
> URL: https://issues.apache.org/jira/browse/SPARK-14657
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> SparkR::glm output different features compared with R glm when fit w/o 
> intercept and having string/category features. Take the following example, 
> SparkR output three features compared with four features for native R.
> SparkR::glm
> {quote}
> training <- suppressWarnings(createDataFrame(sqlContext, iris))
> model <- glm(Sepal_Width ~ Sepal_Length + Species - 1, data = training)
> summary(model)
> Coefficients:
> Estimate  Std. Error  t value  Pr(>|t|)
> Sepal_Length0.67468   0.0093013   72.536   0
> Species_versicolor  -1.2349   0.07269 -16.989  0
> Species_virginica   -1.4708   0.077397-19.003  0
> {quote}
> stats::glm
> {quote}
> summary(glm(Sepal.Width ~ Sepal.Length + Species - 1, data = iris))
> Coefficients:
>   Estimate Std. Error t value Pr(>|t|)
> Sepal.Length0.3499 0.0463   7.557 4.19e-12 ***
> Speciessetosa   1.6765 0.2354   7.123 4.46e-11 ***
> Speciesversicolor   0.6931 0.2779   2.494   0.0137 *  
> Speciesvirginica0.6690 0.3078   2.174   0.0313 *  
> {quote}
> The encoder for string/category feature is different. R did not drop any 
> category but SparkR drop the last one.
> I searched online and test some other cases, found when we fit R glm model(or 
> other models powered by R formula) w/o intercept on a dataset including 
> string/category features, one of the categories in the first category feature 
> is being used as reference category, we will not drop any category for that 
> feature.
> I think we should keep consistent semantics between Spark RFormula and R 
> formula.
> cc [~mengxr] 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14657) RFormula output wrong features when formula w/o intercept

2016-04-15 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14657:
--
Assignee: Yanbo Liang

> RFormula output wrong features when formula w/o intercept
> -
>
> Key: SPARK-14657
> URL: https://issues.apache.org/jira/browse/SPARK-14657
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> SparkR::glm output different features compared with R glm when fit w/o 
> intercept and having string/category features. Take the following example, 
> SparkR output three features compared with four features for native R.
> SparkR::glm
> {quote}
> training <- suppressWarnings(createDataFrame(sqlContext, iris))
> model <- glm(Sepal_Width ~ Sepal_Length + Species - 1, data = training)
> summary(model)
> Coefficients:
> Estimate  Std. Error  t value  Pr(>|t|)
> Sepal_Length0.67468   0.0093013   72.536   0
> Species_versicolor  -1.2349   0.07269 -16.989  0
> Species_virginica   -1.4708   0.077397-19.003  0
> {quote}
> stats::glm
> {quote}
> summary(glm(Sepal.Width ~ Sepal.Length + Species - 1, data = iris))
> Coefficients:
>   Estimate Std. Error t value Pr(>|t|)
> Sepal.Length0.3499 0.0463   7.557 4.19e-12 ***
> Speciessetosa   1.6765 0.2354   7.123 4.46e-11 ***
> Speciesversicolor   0.6931 0.2779   2.494   0.0137 *  
> Speciesvirginica0.6690 0.3078   2.174   0.0313 *  
> {quote}
> The encoder for string/category feature is different. R did not drop any 
> category but SparkR drop the last one.
> I searched online and test some other cases, found when we fit R glm model(or 
> other models powered by R formula) w/o intercept on a dataset including 
> string/category features, one of the categories in the first category feature 
> is being used as reference category, we will not drop any category for that 
> feature.
> I think we should keep consistent semantics between Spark RFormula and R 
> formula.
> cc [~mengxr] 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13925) Expose R-like summary statistics in SparkR::glm for more family and link functions

2016-04-15 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13925:
--
Assignee: Yanbo Liang

> Expose R-like summary statistics in SparkR::glm for more family and link 
> functions
> --
>
> Key: SPARK-13925
> URL: https://issues.apache.org/jira/browse/SPARK-13925
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>Priority: Critical
> Fix For: 2.0.0
>
>
> This continues the work of SPARK-11494, SPARK-9837, and SPARK-12566 to expose 
> R-like model summary in more family and link functions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13925) Expose R-like summary statistics in SparkR::glm for more family and link functions

2016-04-15 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-13925.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12393
[https://github.com/apache/spark/pull/12393]

> Expose R-like summary statistics in SparkR::glm for more family and link 
> functions
> --
>
> Key: SPARK-13925
> URL: https://issues.apache.org/jira/browse/SPARK-13925
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Priority: Critical
> Fix For: 2.0.0
>
>
> This continues the work of SPARK-11494, SPARK-9837, and SPARK-12566 to expose 
> R-like model summary in more family and link functions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14549) Copy the Vector and Matrix classes from mllib to ml in mllib-local

2016-04-15 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-14549.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12317
[https://github.com/apache/spark/pull/12317]

> Copy the Vector and Matrix classes from mllib to ml in mllib-local
> --
>
> Key: SPARK-14549
> URL: https://issues.apache.org/jira/browse/SPARK-14549
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: DB Tsai
>Assignee: DB Tsai
> Fix For: 2.0.0
>
>
> This task will copy the Vector and Matrix classes from mllib to ml package in 
> mllib-local jar. The UDTs and `since` annotation in ml vector and matrix will 
> be removed from now. UDTs will be achieved by #SPARK-14487, and `since` will 
> be replaced by /* @ since 1.2.0 */
> The BLAS implementation will be copied, and some of the test utilities will 
> be copies as well. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14653) Remove NumericParser and jackson dependency from mllib-local

2016-04-14 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-14653:
-

 Summary: Remove NumericParser and jackson dependency from 
mllib-local
 Key: SPARK-14653
 URL: https://issues.apache.org/jira/browse/SPARK-14653
 Project: Spark
  Issue Type: Sub-task
Reporter: Xiangrui Meng


After SPARK-14549, we should remove NumericParser and jackson from mllib-local, 
which were introduced very earlier and now replaced by UDTs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14374) PySpark ml GBTClassifier, Regressor support export/import

2016-04-14 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-14374.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12383
[https://github.com/apache/spark/pull/12383]

> PySpark ml GBTClassifier, Regressor support export/import
> -
>
> Key: SPARK-14374
> URL: https://issues.apache.org/jira/browse/SPARK-14374
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14646) k-means save/load should put one cluster per row

2016-04-14 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-14646:
-

 Summary: k-means save/load should put one cluster per row
 Key: SPARK-14646
 URL: https://issues.apache.org/jira/browse/SPARK-14646
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.6.1
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Minor


We store k-means clusters in a single row in the current implementation. It 
would be better to store the clusters one per row. So it is easier to add more 
columns like size, cost, etc. We may consider backward compatibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12869) Optimize conversion from BlockMatrix to IndexedRowMatrix

2016-04-14 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-12869.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10839
[https://github.com/apache/spark/pull/10839]

> Optimize conversion from BlockMatrix to IndexedRowMatrix
> 
>
> Key: SPARK-12869
> URL: https://issues.apache.org/jira/browse/SPARK-12869
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.0
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Minor
> Fix For: 2.0.0
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> In the current implementation of the BlockMatrix, the conversion to the 
> IndexedRowMatrix is done by converting it to a CoordinateMatrix first. This 
> is somewhat ok when the matrix is very sparse, but for dense matrices this is 
> very inefficient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14565) RandomForest should use parseInt and parseDouble for feature subset size instead of regexes

2016-04-14 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-14565.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12360
[https://github.com/apache/spark/pull/12360]

> RandomForest should use parseInt and parseDouble for feature subset size 
> instead of regexes
> ---
>
> Key: SPARK-14565
> URL: https://issues.apache.org/jira/browse/SPARK-14565
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yong Tang
>Priority: Minor
> Fix For: 2.0.0
>
>
> Using regex is not robust and hard to maintain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13944) Separate out local linear algebra as a standalone module without Spark dependency

2016-04-13 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240322#comment-15240322
 ] 

Xiangrui Meng commented on SPARK-13944:
---

There are more production workflows using RDD-based APIs than DataFrame-based 
APIs since many users are still running Spark 1.4 or earlier. It would be nice 
if we can keep binary compatibility on RDD-based APIs in Spark 2.0. Using type 
alias is not a good solution because 1) it is not Java-compatible, 2) it 
introduces dependency from the RDD-based API to mllib-local, which means future 
development on mllib-local might cause behavior changes or break changes to the 
RDD-based API. Since we already decided that the RDD-based API would go into 
maintenance mode in Spark 2.0. Leaving some old code there won't increase 
maintenance cost, compared with the type alias.

We can provide a converter than converts all `mllib.linalg` types to 
`ml.linalg` types in Spark 2.0 to help users migrate to `ml.linalg`.

> Separate out local linear algebra as a standalone module without Spark 
> dependency
> -
>
> Key: SPARK-13944
> URL: https://issues.apache.org/jira/browse/SPARK-13944
> Project: Spark
>  Issue Type: New Feature
>  Components: Build, ML
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: DB Tsai
>Priority: Blocker
>
> Separate out linear algebra as a standalone module without Spark dependency 
> to simplify production deployment. We can call the new module 
> spark-mllib-local, which might contain local models in the future.
> The major issue is to remove dependencies on user-defined types.
> The package name will be changed from mllib to ml. For example, Vector will 
> be changed from `org.apache.spark.mllib.linalg.Vector` to 
> `org.apache.spark.ml.linalg.Vector`. The return vector type in the new ML 
> pipeline will be the one in ML package; however, the existing mllib code will 
> not be touched. As a result, this will potentially break the API. Also, when 
> the vector is loaded from mllib vector by Spark SQL, the vector will 
> automatically converted into the one in ml package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14154) Simplify the implementation for Kolmogorov–Smirnov test

2016-04-13 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15239531#comment-15239531
 ] 

Xiangrui Meng commented on SPARK-14154:
---

[~yuhaoyan] Thanks for the benchmark! I reverted the change in master.

> Simplify the implementation for Kolmogorov–Smirnov test
> ---
>
> Key: SPARK-14154
> URL: https://issues.apache.org/jira/browse/SPARK-14154
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: yuhao yang
>Assignee: yuhao yang
>Priority: Critical
>
> I just read the code for KolmogorovSmirnovTest and find it could be much 
> simplified following the original definition.
> Send a PR for discussion



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-14154) Simplify the implementation for Kolmogorov–Smirnov test

2016-04-13 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng closed SPARK-14154.
-
   Resolution: Not A Problem
Fix Version/s: (was: 2.0.0)

> Simplify the implementation for Kolmogorov–Smirnov test
> ---
>
> Key: SPARK-14154
> URL: https://issues.apache.org/jira/browse/SPARK-14154
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: yuhao yang
>Assignee: yuhao yang
>Priority: Critical
>
> I just read the code for KolmogorovSmirnovTest and find it could be much 
> simplified following the original definition.
> Send a PR for discussion



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14154) Simplify the implementation for Kolmogorov–Smirnov test

2016-04-12 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238172#comment-15238172
 ] 

Xiangrui Meng commented on SPARK-14154:
---

Changed the priority to critical since we should decide before the feature 
freeze deadline.

> Simplify the implementation for Kolmogorov–Smirnov test
> ---
>
> Key: SPARK-14154
> URL: https://issues.apache.org/jira/browse/SPARK-14154
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: yuhao yang
>Assignee: yuhao yang
>Priority: Critical
> Fix For: 2.0.0
>
>
> I just read the code for KolmogorovSmirnovTest and find it could be much 
> simplified following the original definition.
> Send a PR for discussion



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14154) Simplify the implementation for Kolmogorov–Smirnov test

2016-04-12 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14154:
--
Priority: Critical  (was: Minor)

> Simplify the implementation for Kolmogorov–Smirnov test
> ---
>
> Key: SPARK-14154
> URL: https://issues.apache.org/jira/browse/SPARK-14154
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: yuhao yang
>Assignee: yuhao yang
>Priority: Critical
> Fix For: 2.0.0
>
>
> I just read the code for KolmogorovSmirnovTest and find it could be much 
> simplified following the original definition.
> Send a PR for discussion



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14568) Log instrumentation in logistic regression as a first task

2016-04-12 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14568:
--
Shepherd: Joseph K. Bradley

> Log instrumentation in logistic regression as a first task
> --
>
> Key: SPARK-14568
> URL: https://issues.apache.org/jira/browse/SPARK-14568
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Timothy Hunter
>Assignee: Timothy Hunter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14568) Log instrumentation in logistic regression as a first task

2016-04-12 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14568:
--
Target Version/s: 2.0.0

> Log instrumentation in logistic regression as a first task
> --
>
> Key: SPARK-14568
> URL: https://issues.apache.org/jira/browse/SPARK-14568
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Timothy Hunter
>Assignee: Timothy Hunter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14568) Log instrumentation in logistic regression as a first task

2016-04-12 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14568:
--
Assignee: Timothy Hunter

> Log instrumentation in logistic regression as a first task
> --
>
> Key: SPARK-14568
> URL: https://issues.apache.org/jira/browse/SPARK-14568
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Timothy Hunter
>Assignee: Timothy Hunter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14311) Model persistence in SparkR

2016-04-12 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15237918#comment-15237918
 ] 

Xiangrui Meng commented on SPARK-14311:
---

I think we can implement a generic load in a Scalar wrapper that returns 
`Object`. On R side, we just return that object. It should work for R. Do you 
want to make a try on naive Bayes?

> Model persistence in SparkR
> ---
>
> Key: SPARK-14311
> URL: https://issues.apache.org/jira/browse/SPARK-14311
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> In Spark 2.0, we are going to have 4 ML models in SparkR: GLMs, k-means, 
> naive Bayes, and AFT survival regression. Users can fit models, get summary, 
> and make predictions. However, they cannot save/load the models yet.
> ML models in SparkR are wrappers around ML pipelines. So it should be 
> straightforward to implement model persistence. We need to think more about 
> the API. R uses save/load for objects and datasets (also objects). It is 
> possible to overload save for ML models, e.g., save.NaiveBayesWrapper. But 
> I'm not sure whether load can be overloaded easily. I propose the following 
> API:
> {code}
> model <- glm(formula, data = df)
> ml.save(model, path, mode = "overwrite")
> model2 <- ml.load(path)
> {code}
> We defined wrappers as S4 classes. So `ml.save` is an S4 method and ml.load 
> is a S3 method (correct me if I'm wrong).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14549) Copy the Vector and Matrix classes from mllib to ml in mllib-local

2016-04-12 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14549:
--
Shepherd: Xiangrui Meng

> Copy the Vector and Matrix classes from mllib to ml in mllib-local
> --
>
> Key: SPARK-14549
> URL: https://issues.apache.org/jira/browse/SPARK-14549
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: DB Tsai
>Assignee: DB Tsai
>
> This task will copy the Vector and Matrix classes from mllib to ml package in 
> mllib-local jar. The UDTs and `since` annotation in ml vector and matrix will 
> be removed from now. UDTs will be achieved by #SPARK-14487, and `since` will 
> be replaced by /* @ since 1.2.0 */
> The BLAS implementation will be copied, and some of the test utilities will 
> be copies as well. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14147) SparkR - ML predictors return features with vector datatype, however SparkR doesn't support it

2016-04-12 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-14147.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11958
[https://github.com/apache/spark/pull/11958]

> SparkR - ML predictors return features with vector datatype, however SparkR 
> doesn't support it
> --
>
> Key: SPARK-14147
> URL: https://issues.apache.org/jira/browse/SPARK-14147
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Narine Kokhlikyan
> Fix For: 2.0.0
>
>
> It seems that ML predictors in SparkR return an output which contains 
> features represented by vector datatype, however SparkR doesn't support it 
> and as a result features are being displayed as an environment variable.
> example: 
> prediction <- predict(model, training)
> DataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:double, 
> Petal_Width:double, features:vector, prediction:int]
> collect(prediction)
> Sepal_Length Sepal_Width Petal_Length Petal_Width   
> features prediction
> 15.1 3.5  1.4 0.2  0x10b7a8870>  1
> 24.9 3.0  1.4 0.2  0x10b79d498>  1
> 34.7 3.2  1.3 0.2  0x10b7960a8>  1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14147) SparkR - ML predictors return features with vector datatype, however SparkR doesn't support it

2016-04-12 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14147:
--
Assignee: Yanbo Liang

> SparkR - ML predictors return features with vector datatype, however SparkR 
> doesn't support it
> --
>
> Key: SPARK-14147
> URL: https://issues.apache.org/jira/browse/SPARK-14147
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Narine Kokhlikyan
>Assignee: Yanbo Liang
> Fix For: 2.0.0
>
>
> It seems that ML predictors in SparkR return an output which contains 
> features represented by vector datatype, however SparkR doesn't support it 
> and as a result features are being displayed as an environment variable.
> example: 
> prediction <- predict(model, training)
> DataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:double, 
> Petal_Width:double, features:vector, prediction:int]
> collect(prediction)
> Sepal_Length Sepal_Width Petal_Length Petal_Width   
> features prediction
> 15.1 3.5  1.4 0.2  0x10b7a8870>  1
> 24.9 3.0  1.4 0.2  0x10b79d498>  1
> 34.7 3.2  1.3 0.2  0x10b7960a8>  1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14563) SQLTransformer.transformSchema is not implemented correctly

2016-04-12 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-14563.
---
   Resolution: Fixed
Fix Version/s: 1.6.2
   2.0.0

Issue resolved by pull request 12330
[https://github.com/apache/spark/pull/12330]

> SQLTransformer.transformSchema is not implemented correctly
> ---
>
> Key: SPARK-14563
> URL: https://issues.apache.org/jira/browse/SPARK-14563
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
> Fix For: 2.0.0, 1.6.2
>
>
> `transformSchema` uses `__THIS__` as a temp table name, which would cause 
> errors under HiveContext (in Spark 1.6):
> {code}
> org.apache.spark.sql.AnalysisException: cannot recognize input near 
> '__THIS__' '' '' in join source; line 1 pos 39
> {code}
> It also exposes race conditions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13597) Python API for GeneralizedLinearRegression

2016-04-12 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13597:
--
Assignee: Kai Jiang

> Python API for GeneralizedLinearRegression
> --
>
> Key: SPARK-13597
> URL: https://issues.apache.org/jira/browse/SPARK-13597
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Reporter: Xiangrui Meng
>Assignee: Kai Jiang
>Priority: Critical
> Fix For: 2.0.0
>
>
> After SPARK-12811, we should add Python API for generalized linear regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13597) Python API for GeneralizedLinearRegression

2016-04-12 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-13597.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11468
[https://github.com/apache/spark/pull/11468]

> Python API for GeneralizedLinearRegression
> --
>
> Key: SPARK-13597
> URL: https://issues.apache.org/jira/browse/SPARK-13597
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Reporter: Xiangrui Meng
>Priority: Critical
> Fix For: 2.0.0
>
>
> After SPARK-12811, we should add Python API for generalized linear regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13322) AFTSurvivalRegression should support feature standardization

2016-04-12 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-13322.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11365
[https://github.com/apache/spark/pull/11365]

> AFTSurvivalRegression should support feature standardization
> 
>
> Key: SPARK-13322
> URL: https://issues.apache.org/jira/browse/SPARK-13322
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
> Fix For: 2.0.0
>
>
> This bug is reported by Stuti Awasthi.
> https://www.mail-archive.com/user@spark.apache.org/msg45643.html
> The lossSum has possibility of infinity because we do not standardize the 
> feature before fitting model, we should support feature standardization.
> Another benefit is that standardization will improve the convergence rate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13590) Document the behavior of spark.ml logistic regression and AFT survival regression when there are constant features

2016-04-12 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13590:
--
Summary: Document the behavior of spark.ml logistic regression and AFT 
survival regression when there are constant features  (was: Document the 
behavior of spark.ml logistic regression when there are constant features)

> Document the behavior of spark.ml logistic regression and AFT survival 
> regression when there are constant features
> --
>
> Key: SPARK-13590
> URL: https://issues.apache.org/jira/browse/SPARK-13590
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> As discussed in SPARK-13029, we decided to keep the current behavior that 
> sets all coefficients associated with constant feature columns to zero, 
> regardless of intercept, regularization, and standardization settings. This 
> is the same behavior as in glmnet. Since this is different from LIBSVM, we 
> should document the behavior correctly, add tests, and generate warning 
> messages if there are constant columns and `addIntercept` is false.
> cc [~coderxiang] [~dbtsai]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13590) Document the behavior of spark.ml logistic regression when there are constant features

2016-04-12 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13590:
--
Assignee: Yanbo Liang

> Document the behavior of spark.ml logistic regression when there are constant 
> features
> --
>
> Key: SPARK-13590
> URL: https://issues.apache.org/jira/browse/SPARK-13590
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> As discussed in SPARK-13029, we decided to keep the current behavior that 
> sets all coefficients associated with constant feature columns to zero, 
> regardless of intercept, regularization, and standardization settings. This 
> is the same behavior as in glmnet. Since this is different from LIBSVM, we 
> should document the behavior correctly, add tests, and generate warning 
> messages if there are constant columns and `addIntercept` is false.
> cc [~coderxiang] [~dbtsai]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-14154) Simplify the implementation for Kolmogorov–Smirnov test

2016-04-12 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reopened SPARK-14154:
---

Re-open this issue to continue discussion.

> Simplify the implementation for Kolmogorov–Smirnov test
> ---
>
> Key: SPARK-14154
> URL: https://issues.apache.org/jira/browse/SPARK-14154
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: yuhao yang
>Assignee: yuhao yang
>Priority: Minor
> Fix For: 2.0.0
>
>
> I just read the code for KolmogorovSmirnovTest and find it could be much 
> simplified following the original definition.
> Send a PR for discussion



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14565) RandomForest should use parseInt and parseDouble for feature subset size instead of regexes

2016-04-12 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14565:
--
Description: Using regex is not robust and hard to maintain.

> RandomForest should use parseInt and parseDouble for feature subset size 
> instead of regexes
> ---
>
> Key: SPARK-14565
> URL: https://issues.apache.org/jira/browse/SPARK-14565
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yong Tang
>
> Using regex is not robust and hard to maintain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14565) RandomForest should use parseInt and parseDouble for feature subset size instead of regexes

2016-04-12 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-14565:
-

 Summary: RandomForest should use parseInt and parseDouble for 
feature subset size instead of regexes
 Key: SPARK-14565
 URL: https://issues.apache.org/jira/browse/SPARK-14565
 Project: Spark
  Issue Type: Bug
  Components: ML
Reporter: Xiangrui Meng
Assignee: Yong Tang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14154) Simplify the implementation for Kolmogorov–Smirnov test

2016-04-12 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15237657#comment-15237657
 ] 

Xiangrui Meng commented on SPARK-14154:
---

[~yuhaoyan] The main purpose of the initial implementation of K-S test was to 
avoid that `zipWithIndex` in your implementation, which triggers one more Spark 
job. Did you compare the performance? Please run a benchmark with some large 
dataset and see whether it is worth to keep the initial implementation. Thanks!

> Simplify the implementation for Kolmogorov–Smirnov test
> ---
>
> Key: SPARK-14154
> URL: https://issues.apache.org/jira/browse/SPARK-14154
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: yuhao yang
>Assignee: yuhao yang
>Priority: Minor
> Fix For: 2.0.0
>
>
> I just read the code for KolmogorovSmirnovTest and find it could be much 
> simplified following the original definition.
> Send a PR for discussion



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14324) Refactor GLMs code in SparkRWrappers

2016-04-12 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-14324.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> Refactor GLMs code in SparkRWrappers
> 
>
> Key: SPARK-14324
> URL: https://issues.apache.org/jira/browse/SPARK-14324
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
> Fix For: 2.0.0
>
>
> We use a single object `SparkRWrappers` 
> (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/r/SparkRWrappers.scala)
>  to wrap method calls to glm and kmeans in SparkR. This is quite hard to 
> maintain. We should refactor them into separate wrappers, like 
> `AFTSurvivalRegressionWrapper` and `NaiveBayesWrapper`.
> The package name should be `spakr.ml.r` instead of `spark.ml.api.r`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12566) GLM model family, link function support in SparkR:::glm

2016-04-12 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-12566.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> GLM model family, link function support in SparkR:::glm
> ---
>
> Key: SPARK-12566
> URL: https://issues.apache.org/jira/browse/SPARK-12566
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
>Priority: Critical
> Fix For: 2.0.0
>
>
> This JIRA is for extending the support of MLlib's Generalized Linear Models 
> (GLMs) to more model families and link functions in SparkR. After 
> SPARK-12811, we should be able to wrap GeneralizedLinearRegression in SparkR 
> with support of popular families and link functions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14563) SQLTransformer.transformSchema is not implemented correctly

2016-04-12 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14563:
--
Description: 
`transformSchema` uses `__THIS__` as a temp table name, which would cause 
errors under HiveContext (in Spark 1.6):

{code}
org.apache.spark.sql.AnalysisException: cannot recognize input near '__THIS__' 
'' '' in join source; line 1 pos 39
{code}

It also exposes race conditions.

  was:
`transformSchema` uses `__THIS__` as a temp table name, which would cause 
errors in a pipeline.

{code}
org.apache.spark.sql.AnalysisException: cannot recognize input near '__THIS__' 
'' '' in join source; line 1 pos 39
{code}


> SQLTransformer.transformSchema is not implemented correctly
> ---
>
> Key: SPARK-14563
> URL: https://issues.apache.org/jira/browse/SPARK-14563
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> `transformSchema` uses `__THIS__` as a temp table name, which would cause 
> errors under HiveContext (in Spark 1.6):
> {code}
> org.apache.spark.sql.AnalysisException: cannot recognize input near 
> '__THIS__' '' '' in join source; line 1 pos 39
> {code}
> It also exposes race conditions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14563) SQLTransformer.transformSchema is not implemented correctly

2016-04-12 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-14563:
-

 Summary: SQLTransformer.transformSchema is not implemented 
correctly
 Key: SPARK-14563
 URL: https://issues.apache.org/jira/browse/SPARK-14563
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 1.6.1, 2.0.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


`transformSchema` uses `__THIS__` as a temp table name, which would cause 
errors in a pipeline.

{code}
org.apache.spark.sql.AnalysisException: cannot recognize input near '__THIS__' 
'' '' in join source; line 1 pos 39
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13600) Use approxQuantile from DataFrame stats in QuantileDiscretizer

2016-04-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-13600.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11553
[https://github.com/apache/spark/pull/11553]

> Use approxQuantile from DataFrame stats in QuantileDiscretizer
> --
>
> Key: SPARK-13600
> URL: https://issues.apache.org/jira/browse/SPARK-13600
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Oliver Pierson
>Assignee: Oliver Pierson
> Fix For: 2.0.0
>
>
> For consistency and code reuse, QuantileDiscretizer should use approxQuantile 
> to find splits in the data rather than implement it's own method.
> Additionally, making this change should remedy a bug where 
> QuantileDiscretizer fails to calculate the correct splits in certain 
> circumstances, resulting in an incorrect number of buckets/bins.
> E.g.
> val df = sc.parallelize(1.0 to 10.0 by 1.0).map(Tuple1.apply).toDF("x")
> val discretizer = new 
> QuantileDiscretizer().setInputCol("x").setOutputCol("y").setNumBuckets(5)
> discretizer.fit(df).getSplits
> gives:
> Array(-Infinity, 2.0, 4.0, 6.0, 8.0, 10.0, Infinity)
> which corresponds to 6 buckets (not 5).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13322) AFTSurvivalRegression should support feature standardization

2016-04-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13322:
--
Shepherd: Xiangrui Meng  (was: DB Tsai)

> AFTSurvivalRegression should support feature standardization
> 
>
> Key: SPARK-13322
> URL: https://issues.apache.org/jira/browse/SPARK-13322
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> This bug is reported by Stuti Awasthi.
> https://www.mail-archive.com/user@spark.apache.org/msg45643.html
> The lossSum has possibility of infinity because we do not standardize the 
> feature before fitting model, we should support feature standardization.
> Another benefit is that standardization will improve the convergence rate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14324) Refactor GLMs code in SparkRWrappers

2016-04-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14324:
--
Assignee: Yanbo Liang

> Refactor GLMs code in SparkRWrappers
> 
>
> Key: SPARK-14324
> URL: https://issues.apache.org/jira/browse/SPARK-14324
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> We use a single object `SparkRWrappers` 
> (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/r/SparkRWrappers.scala)
>  to wrap method calls to glm and kmeans in SparkR. This is quite hard to 
> maintain. We should refactor them into separate wrappers, like 
> `AFTSurvivalRegressionWrapper` and `NaiveBayesWrapper`.
> The package name should be `spakr.ml.r` instead of `spark.ml.api.r`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12566) GLM model family, link function support in SparkR:::glm

2016-04-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-12566:
--
Shepherd: Xiangrui Meng  (was: Yanbo Liang)

> GLM model family, link function support in SparkR:::glm
> ---
>
> Key: SPARK-12566
> URL: https://issues.apache.org/jira/browse/SPARK-12566
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
>Priority: Critical
>
> This JIRA is for extending the support of MLlib's Generalized Linear Models 
> (GLMs) to more model families and link functions in SparkR. After 
> SPARK-12811, we should be able to wrap GeneralizedLinearRegression in SparkR 
> with support of popular families and link functions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14462) Add the mllib-local build to maven pom

2016-04-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-14462.
---
Resolution: Fixed

Issue resolved by pull request 12298
[https://github.com/apache/spark/pull/12298]

> Add the mllib-local build to maven pom
> --
>
> Key: SPARK-14462
> URL: https://issues.apache.org/jira/browse/SPARK-14462
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Blocker
> Fix For: 2.0.0
>
>
> In order to separate the linear algebra, and vector matrix classes into a 
> standalone jar, we need to setup the build first. This task will create a new 
> jar called mllib-local with minimal dependencies. The test scope will still 
> depend on spark-core and spark-core-test in order to use the common 
> utilities, but the runtime will avoid any platform dependency. Couple 
> platform independent classes will be moved to this package to demonstrate how 
> this work. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14510) Add args-checking for LDA and StreamingKMeans

2016-04-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14510:
--
Assignee: zhengruifeng

> Add args-checking for LDA and StreamingKMeans
> -
>
> Key: SPARK-14510
> URL: https://issues.apache.org/jira/browse/SPARK-14510
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: zhengruifeng
>Assignee: zhengruifeng
> Fix For: 2.0.0
>
>
> Add args-checking for LDA and StreamingKMeans



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14510) Add args-checking for LDA and StreamingKMeans

2016-04-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-14510.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12062
[https://github.com/apache/spark/pull/12062]

> Add args-checking for LDA and StreamingKMeans
> -
>
> Key: SPARK-14510
> URL: https://issues.apache.org/jira/browse/SPARK-14510
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: zhengruifeng
> Fix For: 2.0.0
>
>
> Add args-checking for LDA and StreamingKMeans



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14510) Add args-checking for LDA and StreamingKMeans

2016-04-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14510:
--
Priority: Minor  (was: Major)

> Add args-checking for LDA and StreamingKMeans
> -
>
> Key: SPARK-14510
> URL: https://issues.apache.org/jira/browse/SPARK-14510
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 2.0.0
>
>
> Add args-checking for LDA and StreamingKMeans



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14500) Accept Dataset[_] instead of DataFrame in MLlib APIs

2016-04-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-14500.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12274
[https://github.com/apache/spark/pull/12274]

> Accept Dataset[_] instead of DataFrame in MLlib APIs
> 
>
> Key: SPARK-14500
> URL: https://issues.apache.org/jira/browse/SPARK-14500
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
> Fix For: 2.0.0
>
>
> In Spark 2.0, `DataFrame` is an alias of `Dataset[Row]`. MLlib API actually 
> works for other types of `Dataset`, so we should accept `Dataset[_]` instead. 
> It maps to `Dataset` in Java. This is a source compatible change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14497) Use top instead of sortBy() to get top N frequent words as dict in ConutVectorizer

2016-04-10 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14497:
--
Assignee: Feng Wang

> Use top instead of sortBy() to get top N frequent words as dict in 
> ConutVectorizer
> --
>
> Key: SPARK-14497
> URL: https://issues.apache.org/jira/browse/SPARK-14497
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Feng Wang
>Assignee: Feng Wang
> Fix For: 2.0.0
>
>
> It's not necessary to sort the whole rdd to get top n frequent words.
> // Sort terms to select vocab
> wordCounts.sortBy(_._2, ascending = false).take(vocSize)
>   
> we could use top() instead since: 
> top - O ( n )
> sortBy - O (n*logn)
> A minor side effect introduced by top() using default implicit Ordering in 
> Tuple2: 
> if the terms with same TF in dictionary would be sorted in descending order.
> (a:1), (b:1),(c:1)  => dict: [c, b, a]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14497) Use top instead of sortBy() to get top N frequent words as dict in ConutVectorizer

2016-04-10 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-14497.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12265
[https://github.com/apache/spark/pull/12265]

> Use top instead of sortBy() to get top N frequent words as dict in 
> ConutVectorizer
> --
>
> Key: SPARK-14497
> URL: https://issues.apache.org/jira/browse/SPARK-14497
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Feng Wang
> Fix For: 2.0.0
>
>
> It's not necessary to sort the whole rdd to get top n frequent words.
> // Sort terms to select vocab
> wordCounts.sortBy(_._2, ascending = false).take(vocSize)
>   
> we could use top() instead since: 
> top - O ( n )
> sortBy - O (n*logn)
> A minor side effect introduced by top() using default implicit Ordering in 
> Tuple2: 
> if the terms with same TF in dictionary would be sorted in descending order.
> (a:1), (b:1),(c:1)  => dict: [c, b, a]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14339) Add python examples for DCT,MinMaxScaler,MaxAbsScaler

2016-04-09 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-14339.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12063
[https://github.com/apache/spark/pull/12063]

> Add python examples for DCT,MinMaxScaler,MaxAbsScaler
> -
>
> Key: SPARK-14339
> URL: https://issues.apache.org/jira/browse/SPARK-14339
> Project: Spark
>  Issue Type: Improvement
>Reporter: zhengruifeng
>Priority: Minor
> Fix For: 2.0.0
>
>
> add three python examples



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14339) Add python examples for DCT,MinMaxScaler,MaxAbsScaler

2016-04-09 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14339:
--
Assignee: zhengruifeng

> Add python examples for DCT,MinMaxScaler,MaxAbsScaler
> -
>
> Key: SPARK-14339
> URL: https://issues.apache.org/jira/browse/SPARK-14339
> Project: Spark
>  Issue Type: Improvement
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 2.0.0
>
>
> add three python examples



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14462) Add the mllib-local build to maven pom

2016-04-09 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-14462.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12241
[https://github.com/apache/spark/pull/12241]

> Add the mllib-local build to maven pom
> --
>
> Key: SPARK-14462
> URL: https://issues.apache.org/jira/browse/SPARK-14462
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Blocker
> Fix For: 2.0.0
>
>
> In order to separate the linear algebra, and vector matrix classes into a 
> standalone jar, we need to setup the build first. This task will create a new 
> jar called mllib-local with minimal dependencies. The test scope will still 
> depend on spark-core and spark-core-test in order to use the common 
> utilities, but the runtime will avoid any platform dependency. Couple 
> platform independent classes will be moved to this package to demonstrate how 
> this work. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14500) Accept Dataset[_] instead of DataFrame in MLlib APIs

2016-04-08 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-14500:
-

 Summary: Accept Dataset[_] instead of DataFrame in MLlib APIs
 Key: SPARK-14500
 URL: https://issues.apache.org/jira/browse/SPARK-14500
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


In Spark 2.0, `DataFrame` is an alias of `Dataset[Row]`. MLlib API actually 
works for other types of `Dataset`, so we should accept `Dataset[_]` instead. 
It maps to `Dataset` in Java. This is a source compatible change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14305) PySpark ml.clustering BisectingKMeans support export/import

2016-04-01 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14305:
--
Assignee: Yanbo Liang

> PySpark ml.clustering BisectingKMeans support export/import
> ---
>
> Key: SPARK-14305
> URL: https://issues.apache.org/jira/browse/SPARK-14305
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14305) PySpark ml.clustering BisectingKMeans support export/import

2016-04-01 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-14305.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12112
[https://github.com/apache/spark/pull/12112]

> PySpark ml.clustering BisectingKMeans support export/import
> ---
>
> Key: SPARK-14305
> URL: https://issues.apache.org/jira/browse/SPARK-14305
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14324) Refactor GLMs code in SparkRWrappers

2016-04-01 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14324:
--
Assignee: (was: Yanbo Liang)

> Refactor GLMs code in SparkRWrappers
> 
>
> Key: SPARK-14324
> URL: https://issues.apache.org/jira/browse/SPARK-14324
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>
> We use a single object `SparkRWrappers` 
> (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/r/SparkRWrappers.scala)
>  to wrap method calls to glm and kmeans in SparkR. This is quite hard to 
> maintain. We should refactor them into separate wrappers, like 
> `AFTSurvivalRegressionWrapper` and `NaiveBayesWrapper`.
> The package name should be `spakr.ml.r` instead of `spark.ml.api.r`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14303) Refactor SparkRWrappers

2016-04-01 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-14303.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12039
[https://github.com/apache/spark/pull/12039]

> Refactor SparkRWrappers
> ---
>
> Key: SPARK-14303
> URL: https://issues.apache.org/jira/browse/SPARK-14303
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
> Fix For: 2.0.0
>
>
> We use a single object `SparkRWrappers` 
> (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/r/SparkRWrappers.scala)
>  to wrap method calls to glm and kmeans in SparkR. This is quite hard to 
> maintain. We should refactor them into separate wrappers, like 
> `AFTSurvivalRegressionWrapper` and `NaiveBayesWrapper`.
> The package name should be `spakr.ml.r` instead of `spark.ml.api.r`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14324) Refactor GLMs code in SparkRWrappers

2016-04-01 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14324:
--
Summary: Refactor GLMs code in SparkRWrappers  (was: Refactor 
SparkRWrappers)

> Refactor GLMs code in SparkRWrappers
> 
>
> Key: SPARK-14324
> URL: https://issues.apache.org/jira/browse/SPARK-14324
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> We use a single object `SparkRWrappers` 
> (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/r/SparkRWrappers.scala)
>  to wrap method calls to glm and kmeans in SparkR. This is quite hard to 
> maintain. We should refactor them into separate wrappers, like 
> `AFTSurvivalRegressionWrapper` and `NaiveBayesWrapper`.
> The package name should be `spakr.ml.r` instead of `spark.ml.api.r`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14303) Refactor k-means code in SparkRWrappers

2016-04-01 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14303:
--
Summary: Refactor k-means code in SparkRWrappers  (was: Refactor 
SparkRWrappers)

> Refactor k-means code in SparkRWrappers
> ---
>
> Key: SPARK-14303
> URL: https://issues.apache.org/jira/browse/SPARK-14303
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
> Fix For: 2.0.0
>
>
> We use a single object `SparkRWrappers` 
> (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/r/SparkRWrappers.scala)
>  to wrap method calls to glm and kmeans in SparkR. This is quite hard to 
> maintain. We should refactor them into separate wrappers, like 
> `AFTSurvivalRegressionWrapper` and `NaiveBayesWrapper`.
> The package name should be `spakr.ml.r` instead of `spark.ml.api.r`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14324) Refactor SparkRWrappers

2016-04-01 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-14324:
-

 Summary: Refactor SparkRWrappers
 Key: SPARK-14324
 URL: https://issues.apache.org/jira/browse/SPARK-14324
 Project: Spark
  Issue Type: Improvement
  Components: ML, SparkR
Reporter: Xiangrui Meng
Assignee: Yanbo Liang


We use a single object `SparkRWrappers` 
(https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/r/SparkRWrappers.scala)
 to wrap method calls to glm and kmeans in SparkR. This is quite hard to 
maintain. We should refactor them into separate wrappers, like 
`AFTSurvivalRegressionWrapper` and `NaiveBayesWrapper`.

The package name should be `spakr.ml.r` instead of `spark.ml.api.r`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11262) Unit test for gradient, loss layers, memory management for multilayer perceptron

2016-04-01 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-11262.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 9229
[https://github.com/apache/spark/pull/9229]

> Unit test for gradient, loss layers, memory management for multilayer 
> perceptron
> 
>
> Key: SPARK-11262
> URL: https://issues.apache.org/jira/browse/SPARK-11262
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Tests
>Affects Versions: 1.5.1
>Reporter: Alexander Ulanov
>Assignee: Alexander Ulanov
> Fix For: 2.0.0
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Multi-layer perceptron requires more rigorous tests and refactoring of layer 
> interfaces to accommodate development of new features.
> 1)Implement unit test for gradient and loss
> 2)Refactor the internal layer interface to extract "loss function" 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14295) buildReader implementation for LibSVM

2016-04-01 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-14295.
---
Resolution: Fixed

Issue resolved by pull request 12088
[https://github.com/apache/spark/pull/12088]

> buildReader implementation for LibSVM
> -
>
> Key: SPARK-14295
> URL: https://issues.apache.org/jira/browse/SPARK-14295
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14274) Add FileFormat.prepareRead to collect necessary global information

2016-04-01 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-14274.
---
Resolution: Fixed

Issue resolved by pull request 12088
[https://github.com/apache/spark/pull/12088]

> Add FileFormat.prepareRead to collect necessary global information
> --
>
> Key: SPARK-14274
> URL: https://issues.apache.org/jira/browse/SPARK-14274
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
> Fix For: 2.0.0
>
>
> One problem of our newly introduced {{FileFormat.buildReader()}} method is 
> that it only sees pieces of input files. On the other hand, data sources like 
> CSV and LibSVM requires some sort of global information:
> - CSV: the content of the header line if {{header}} option is set to true, so 
> that we can filter out header lines within each input file. This is 
> considered as a global information because it's possible that the header 
> appears in the middle of a file after blocks of comments and empty lines, 
> although this is just a rare/contrived corner case.
> - LibSVM: when {{numFeature}} is not set, we need to scan the whole dataset 
> to infer the total number of features to construct result {{LabeledPoint}} 
> instances.
> Unfortunately, with our current API, this kind of global information can't be 
> gathered.
> The solution proposed here is to add a {{prepareRead}} method, which accepts 
> the same arguments as {{inferSchema}} but returns a {{ReadContext}}, which 
> contains an {{Option\[StructType\]}} for the inferred schema and a 
> {{Map\[String, Any\]}} for any gathered global information. This 
> {{ReadContext}} is then passed to {{buildReader()}}. By default, 
> {{prepareRead}} simply calls {{inferSchema}} (actually the inferred schema 
> itself can be considered as a sort of global information).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14303) Refactor SparkRWrappers

2016-03-31 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14303:
--
Assignee: Yanbo Liang

> Refactor SparkRWrappers
> ---
>
> Key: SPARK-14303
> URL: https://issues.apache.org/jira/browse/SPARK-14303
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> We use a single object `SparkRWrappers` 
> (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/r/SparkRWrappers.scala)
>  to wrap method calls to glm and kmeans in SparkR. This is quite hard to 
> maintain. We should refactor them into separate wrappers, like 
> `AFTSurvivalRegressionWrapper` and `NaiveBayesWrapper`.
> The package name should be `spakr.ml.r` instead of `spark.ml.api.r`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14313) AFTSurvivalRegression model persistence in SparkR

2016-03-31 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220734#comment-15220734
 ] 

Xiangrui Meng commented on SPARK-14313:
---

[~yanboliang] Are you interested working on this? It should contain the basic 
APIs for ml.save/ml.load in SparkR and save/load implementation of AFTWrapper.

> AFTSurvivalRegression model persistence in SparkR
> -
>
> Key: SPARK-14313
> URL: https://issues.apache.org/jira/browse/SPARK-14313
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14314) K-means model persistence in SparkR

2016-03-31 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220732#comment-15220732
 ] 

Xiangrui Meng commented on SPARK-14314:
---

Hold until SPARK-14303 is done.

> K-means model persistence in SparkR
> ---
>
> Key: SPARK-14314
> URL: https://issues.apache.org/jira/browse/SPARK-14314
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    5   6   7   8   9   10   11   12   13   14   >