[jira] [Commented] (SPARK-17692) Document ML/MLlib behavior changes in Spark 2.1

2016-09-27 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15526510#comment-15526510
 ] 

Yanbo Liang commented on SPARK-17692:
-

cc [~mengxr] [~josephkb] [~dbtsai] [~mlnick] [~srowen]

> Document ML/MLlib behavior changes in Spark 2.1
> ---
>
> Key: SPARK-17692
> URL: https://issues.apache.org/jira/browse/SPARK-17692
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, MLlib
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>  Labels: 2.1.0
>
> This JIRA records behavior changes of ML/MLlib between 2.0 and 2.1, so we can 
> note those changes (if any) in the user guide's Migration Guide section. If 
> you found one, please comment below and link the corresponding JIRA here.
> * SPARK-17389: Reduce KMeans default k-means|| init steps to 2 from 5.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17692) Document ML/MLlib behavior changes in Spark 2.1

2016-09-27 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-17692:

Description: 
This JIRA records behavior changes of ML/MLlib between 2.0 and 2.1, so we can 
note those changes (if any) in the user guide's Migration Guide section. If you 
found one, please comment below and link the corresponding JIRA here.
* SPARK-17389: Reduce KMeans default k-means|| init steps to 2 from 5.  

  was:
This JIRA records behavior changes of ML/MLlib between 2.0 and 2.1, so we can 
note those changes (if any) in the user guide's Migration Guide section. If you 
found one, please comment below and link the corresponding JIRA here.
* SPARK-17389 Reduce KMeans default k-means|| init steps to 2 from 5.  


> Document ML/MLlib behavior changes in Spark 2.1
> ---
>
> Key: SPARK-17692
> URL: https://issues.apache.org/jira/browse/SPARK-17692
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, MLlib
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>  Labels: 2.1.0
>
> This JIRA records behavior changes of ML/MLlib between 2.0 and 2.1, so we can 
> note those changes (if any) in the user guide's Migration Guide section. If 
> you found one, please comment below and link the corresponding JIRA here.
> * SPARK-17389: Reduce KMeans default k-means|| init steps to 2 from 5.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17692) Document ML/MLlib behavior changes in Spark 2.1

2016-09-27 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-17692:

Description: 
This JIRA records behavior changes of ML/MLlib between 2.0 and 2.1, so we can 
note those changes (if any) in the user guide's Migration Guide section. If you 
found one, please comment below and link the corresponding JIRA here.
* SPARK-17389 Reduce KMeans default k-means|| init steps to 2 from 5.  

  was:
This JIRA records behavior changes of ML/MLlib between 2.0 and 2.1, so we can 
note those changes (if any) in the user guide's Migration Guide section. If you 
found one, please comment below and link the corresponding JIRA here.
* 


> Document ML/MLlib behavior changes in Spark 2.1
> ---
>
> Key: SPARK-17692
> URL: https://issues.apache.org/jira/browse/SPARK-17692
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, MLlib
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>  Labels: 2.1.0
>
> This JIRA records behavior changes of ML/MLlib between 2.0 and 2.1, so we can 
> note those changes (if any) in the user guide's Migration Guide section. If 
> you found one, please comment below and link the corresponding JIRA here.
> * SPARK-17389 Reduce KMeans default k-means|| init steps to 2 from 5.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17692) Document ML/MLlib behavior changes in Spark 2.1

2016-09-27 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-17692:

Description: 
This JIRA records behavior changes of ML/MLlib between 2.0 and 2.1, so we can 
note those changes (if any) in the user guide's Migration Guide section. If you 
found one, please comment below and link the corresponding JIRA here.
* 

  was:This JIRA keeps a list of MLlib behavior changes in Spark 2.1. So we can 
remember to add them to the migration guide / release notes.


> Document ML/MLlib behavior changes in Spark 2.1
> ---
>
> Key: SPARK-17692
> URL: https://issues.apache.org/jira/browse/SPARK-17692
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, MLlib
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>  Labels: 2.1.0
>
> This JIRA records behavior changes of ML/MLlib between 2.0 and 2.1, so we can 
> note those changes (if any) in the user guide's Migration Guide section. If 
> you found one, please comment below and link the corresponding JIRA here.
> * 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17692) Document ML/MLlib behavior changes in Spark 2.1

2016-09-27 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-17692:
---

 Summary: Document ML/MLlib behavior changes in Spark 2.1
 Key: SPARK-17692
 URL: https://issues.apache.org/jira/browse/SPARK-17692
 Project: Spark
  Issue Type: Documentation
  Components: ML, MLlib
Reporter: Yanbo Liang
Assignee: Yanbo Liang


This JIRA keeps a list of MLlib behavior changes in Spark 2.0. So we can 
remember to add them to the migration guide / release notes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17692) Document ML/MLlib behavior changes in Spark 2.1

2016-09-27 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-17692:

Description: This JIRA keeps a list of MLlib behavior changes in Spark 2.1. 
So we can remember to add them to the migration guide / release notes.  (was: 
This JIRA keeps a list of MLlib behavior changes in Spark 2.0. So we can 
remember to add them to the migration guide / release notes.)

> Document ML/MLlib behavior changes in Spark 2.1
> ---
>
> Key: SPARK-17692
> URL: https://issues.apache.org/jira/browse/SPARK-17692
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, MLlib
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> This JIRA keeps a list of MLlib behavior changes in Spark 2.1. So we can 
> remember to add them to the migration guide / release notes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17428) SparkR executors/workers support virtualenv

2016-09-27 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-17428.
-
Resolution: Done
  Assignee: Yanbo Liang

> SparkR executors/workers support virtualenv
> ---
>
> Key: SPARK-17428
> URL: https://issues.apache.org/jira/browse/SPARK-17428
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> Many users have requirements to use third party R packages in 
> executors/workers, but SparkR can not satisfy this requirements elegantly. 
> For example, you should to mess with the IT/administrators of the cluster to 
> deploy these R packages on each executors/workers node which is very 
> inflexible.
> I think we should support third party R packages for SparkR users as what we 
> do for jar packages in the following two scenarios:
> 1, Users can install R packages from CRAN or custom CRAN-like repository for 
> each executors.
> 2, Users can load their local R packages and install them on each executors.
> To achieve this goal, the first thing is to make SparkR executors support 
> virtualenv like Python conda. I have investigated and found 
> packrat(http://rstudio.github.io/packrat/) is one of the candidates to 
> support virtualenv for R. Packrat is a dependency management system for R and 
> can isolate the dependent R packages in its own private package space. Then 
> SparkR users can install third party packages in the application 
> scope(destroy after the application exit) and don’t need to bother 
> IT/administrators to install these packages manually.
> I would like to know whether it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16356) Add testImplicits for ML unit tests and promote toDF()

2016-09-26 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-16356.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

> Add testImplicits for ML unit tests and promote toDF()
> --
>
> Key: SPARK-16356
> URL: https://issues.apache.org/jira/browse/SPARK-16356
> Project: Spark
>  Issue Type: Test
>  Components: ML
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.1.0
>
>
> This was suggested in 
> https://github.com/apache/spark/commit/101663f1ae222a919fc40510aa4f2bad22d1be6f#commitcomment-17114968
> Currently, implicits such as {{toDF()}} are not available in 
> {{MLlibTestSparkContext}}. 
> It might be great if this class has this and {{toDF()}} can be used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17281) Add treeAggregateDepth parameter for AFTSurvivalRegression

2016-09-22 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-17281.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

> Add treeAggregateDepth parameter for AFTSurvivalRegression
> --
>
> Key: SPARK-17281
> URL: https://issues.apache.org/jira/browse/SPARK-17281
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Minor
> Fix For: 2.1.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Add treeAggregateDepth parameter for AFTSurvivalRegression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17281) Add treeAggregateDepth parameter for AFTSurvivalRegression

2016-09-22 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-17281:

Assignee: Weichen Xu

> Add treeAggregateDepth parameter for AFTSurvivalRegression
> --
>
> Key: SPARK-17281
> URL: https://issues.apache.org/jira/browse/SPARK-17281
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Minor
> Fix For: 2.1.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Add treeAggregateDepth parameter for AFTSurvivalRegression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16356) Add testImplicits for ML unit tests and promote toDF()

2016-09-22 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-16356:

Shepherd: Yanbo Liang

> Add testImplicits for ML unit tests and promote toDF()
> --
>
> Key: SPARK-16356
> URL: https://issues.apache.org/jira/browse/SPARK-16356
> Project: Spark
>  Issue Type: Test
>  Components: ML
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> This was suggested in 
> https://github.com/apache/spark/commit/101663f1ae222a919fc40510aa4f2bad22d1be6f#commitcomment-17114968
> Currently, implicits such as {{toDF()}} are not available in 
> {{MLlibTestSparkContext}}. 
> It might be great if this class has this and {{toDF()}} can be used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16356) Add testImplicits for ML unit tests and promote toDF()

2016-09-22 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-16356:

Assignee: Hyukjin Kwon

> Add testImplicits for ML unit tests and promote toDF()
> --
>
> Key: SPARK-16356
> URL: https://issues.apache.org/jira/browse/SPARK-16356
> Project: Spark
>  Issue Type: Test
>  Components: ML
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> This was suggested in 
> https://github.com/apache/spark/commit/101663f1ae222a919fc40510aa4f2bad22d1be6f#commitcomment-17114968
> Currently, implicits such as {{toDF()}} are not available in 
> {{MLlibTestSparkContext}}. 
> It might be great if this class has this and {{toDF()}} can be used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14709) spark.ml API for linear SVM

2016-09-21 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15512085#comment-15512085
 ] 

Yanbo Liang edited comment on SPARK-14709 at 9/22/16 4:27 AM:
--

[~yuhaoyan] Any update about this? I think providing DataFrame-based SVM 
algorithm is very important to users, so it's better we can get it in ASAP. I'd 
like to get in the implementation with OWLQN and Hinge loss firstly, and to 
discuss SMO version later. Like [~mlnick] said, it's better to get more 
performance number and user case of SMO impl. And it's not very hard to add a 
new internal implementation after we have the basic SVM API. I saw you have an 
implementation with OWLQN and Hinge loss already, could you send the PR? If you 
are busy with other things, I can help and you are still the primary author of 
this PR. Thanks!


was (Author: yanboliang):
[~yuhaoyan] Any update about this? I think providing DataFrame-based SVM 
algorithm is very important to users, so it's better we can get it in ASAP. I'd 
like to get in the implementation with OWLQN and Hinge loss firstly, and to 
discuss SMO version later. Like [~mlnick] said, it's better to get more 
performance number and user case of SMO impl. And it's not very hard to add a 
new internal implementation after we have the basic SVM API. I saw you have a 
implementation with OWLQN and Hinge loss already, could you send the PR? If you 
are busy with other things, I can help and you are still the primary author of 
this PR. Thanks!

> spark.ml API for linear SVM
> ---
>
> Key: SPARK-14709
> URL: https://issues.apache.org/jira/browse/SPARK-14709
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Provide API for SVM algorithm for DataFrames.  I would recommend using 
> OWL-QN, rather than wrapping spark.mllib's SGD-based implementation.
> The API should mimic existing spark.ml.classification APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14709) spark.ml API for linear SVM

2016-09-21 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15512085#comment-15512085
 ] 

Yanbo Liang commented on SPARK-14709:
-

[~yuhaoyan] Any update about this? I think providing DataFrame-based SVM 
algorithm is very important to users, so it's better we can get it in ASAP. I'd 
like to get in the implementation with OWLQN and Hinge loss firstly, and to 
discuss SMO version later. Like [~mlnick] said, it's better to get more 
performance number and user case of SMO impl. And it's not very hard to add a 
new internal implementation after we have the basic SVM API. I saw you have a 
implementation with OWLQN and Hinge loss already, could you send the PR? If you 
are busy with other things, I can help and you are still the primary author of 
this PR. Thanks!

> spark.ml API for linear SVM
> ---
>
> Key: SPARK-14709
> URL: https://issues.apache.org/jira/browse/SPARK-14709
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Provide API for SVM algorithm for DataFrames.  I would recommend using 
> OWL-QN, rather than wrapping spark.mllib's SGD-based implementation.
> The API should mimic existing spark.ml.classification APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17577) SparkR support add files to Spark job and get by executors

2016-09-21 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-17577.
-
   Resolution: Fixed
 Assignee: Yanbo Liang
Fix Version/s: 2.1.0

> SparkR support add files to Spark job and get by executors
> --
>
> Key: SPARK-17577
> URL: https://issues.apache.org/jira/browse/SPARK-17577
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
> Fix For: 2.1.0
>
>
> Scala/Python users can add files to Spark job by submit options {{--files}} 
> or {{SparkContext.addFile()}}. Meanwhile, users can get the added file by 
> {{SparkFiles.get(filename)}}.
> We should also support this function for SparkR users, since they also have 
> the requirements for some shared dependency files. For example, SparkR users 
> can download third party R packages to driver firstly, add these files to the 
> Spark job as dependency by this API and then each executor can install these 
> packages by {{install.packages}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17585) PySpark SparkContext.addFile supports adding files recursively

2016-09-21 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang reassigned SPARK-17585:
---

Assignee: Yanbo Liang

> PySpark SparkContext.addFile supports adding files recursively
> --
>
> Key: SPARK-17585
> URL: https://issues.apache.org/jira/browse/SPARK-17585
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
> Fix For: 2.1.0
>
>
> Users would like to add a directory as dependency in some cases, they can use 
> {{SparkContext.addFile}} with argument {{recursive=true}} to recursively add 
> all files under the directory by using Scala. But Python users can only add 
> file not directory, we should also make it supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17585) PySpark SparkContext.addFile supports adding files recursively

2016-09-21 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-17585.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

> PySpark SparkContext.addFile supports adding files recursively
> --
>
> Key: SPARK-17585
> URL: https://issues.apache.org/jira/browse/SPARK-17585
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
> Fix For: 2.1.0
>
>
> Users would like to add a directory as dependency in some cases, they can use 
> {{SparkContext.addFile}} with argument {{recursive=true}} to recursively add 
> all files under the directory by using Scala. But Python users can only add 
> file not directory, we should also make it supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17588) java.lang.AssertionError: assertion failed: lapack.dppsv returned 105. when running glm using gaussian link function.

2016-09-21 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15509001#comment-15509001
 ] 

Yanbo Liang commented on SPARK-17588:
-

[~sowen] See my comments at SPARK-11918. Thanks.

> java.lang.AssertionError: assertion failed: lapack.dppsv returned 105. when 
> running glm using gaussian link function.
> -
>
> Key: SPARK-17588
> URL: https://issues.apache.org/jira/browse/SPARK-17588
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Affects Versions: 2.0.0
>Reporter: sai pavan kumar chitti
>Assignee: Sean Owen
>Priority: Minor
>
> hi, 
> i am getting java.lang.AssertionError error when running glm, using gaussian 
> link function, on a dataset with 109 columns and  81318461 rows
> Below is the call trace. Can someone please tell me what the issues is 
> related to and how to go about resolving it. Is it because native 
> acceleration is not working as i am also seeing following warning messages.
> WARN netlib.BLAS: Failed to load implementation from: 
> com.github.fommil.netlib.NativeRefBLAS
> WARN netlib.LAPACK: Failed to load implementation from: 
> com.github.fommil.netlib.NativeSystemLAPACK
> WARN netlib.LAPACK: Failed to load implementation from: 
> com.github.fommil.netlib.NativeRefLAPACK
> 16/09/17 13:08:13 ERROR r.RBackendHandler: fit on 
> org.apache.spark.ml.r.GeneralizedLinearRegressionWrapper failed
> Error in invokeJava(isStatic = TRUE, className, methodName, ...) : 
>   java.lang.AssertionError: assertion failed: lapack.dppsv returned 105.
> at scala.Predef$.assert(Predef.scala:170)
> at 
> org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:40)
> at 
> org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:140)
> at 
> org.apache.spark.ml.regression.GeneralizedLinearRegression.train(GeneralizedLinearRegression.scala:265)
> at 
> org.apache.spark.ml.regression.GeneralizedLinearRegression.train(GeneralizedLinearRegression.scala:139)
> at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
> at org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
> at 
> org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:149)
> at 
> org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:145)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at 
> scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.sc
> thanks,
> pavan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11918) WLS can not resolve some kinds of equation

2016-09-21 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15508996#comment-15508996
 ] 

Yanbo Liang commented on SPARK-11918:
-

Cholesky decomposition is unstable (for near-singular and rank deficient 
matrices), but it was often used when matrix A is very large and sparse due to 
faster calculation. QR decomposition is more stable than Cholesky, I think we 
should switch to it in the future. I will take a look at this issue. For 
temporary fix, I think throwing a better exception to let users know the 
failure cause is OK. Thanks.

> WLS can not resolve some kinds of equation
> --
>
> Key: SPARK-11918
> URL: https://issues.apache.org/jira/browse/SPARK-11918
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Priority: Minor
>  Labels: starter
> Attachments: R_GLM_output
>
>
> Weighted Least Squares (WLS) is one of the optimization method for solve 
> Linear Regression (when #feature < 4096). But if the dataset is very ill 
> condition (such as 0-1 based label used for classification and the equation 
> is underdetermined), the WLS failed (But "l-bfgs" can train and get the 
> model). The failure is caused by the underneath lapack library return error 
> value when Cholesky decomposition.
> This issue is easy to reproduce, you can train a LinearRegressionModel by 
> "normal" solver with the example 
> dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt).
>  The following is the exception:
> {code}
> assertion failed: lapack.dpotrs returned 1.
> java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1.
>   at scala.Predef$.assert(Predef.scala:179)
>   at 
> org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42)
>   at 
> org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117)
>   at 
> org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180)
>   at 
> org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17585) PySpark SparkContext.addFile supports adding files recursively

2016-09-19 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-17585:

Description: Users would like to add a directory as dependency in some 
cases, they can use {{SparkContext.addFile}} with argument {{recursive=true}} 
to recursively add all files under the directory by using Scala. But Python 
users can only add file not directory, we should also make it supported.  (was: 
PySpark {{SparkContext.addFile}} should support adding files recursively under 
a directory similar with Scala.
Users would like to add a directory as dependency in some cases, they can use 
{{SparkContext.addFile}} with argument {{recursive=true}} to recursively add 
all files under the directory by using Scala. But Python users can only add 
file not directory, we should also make it supported.)

> PySpark SparkContext.addFile supports adding files recursively
> --
>
> Key: SPARK-17585
> URL: https://issues.apache.org/jira/browse/SPARK-17585
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Reporter: Yanbo Liang
>Priority: Minor
>
> Users would like to add a directory as dependency in some cases, they can use 
> {{SparkContext.addFile}} with argument {{recursive=true}} to recursively add 
> all files under the directory by using Scala. But Python users can only add 
> file not directory, we should also make it supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17585) PySpark SparkContext.addFile supports adding files recursively

2016-09-19 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-17585:

Description: 
PySpark {{SparkContext.addFile}} should support adding files recursively under 
a directory similar with Scala.
Users would like to add a directory as dependency in some cases, they can use 
{{SparkContext.addFile}} with argument {{recursive=true}} to recursively add 
all files under the directory by using Scala. But Python users can only add 
file not directory, we should also make it supported.

  was:PySpark {{SparkContext.addFile}} should support adding files recursively 
under a directory similar with Scala.


> PySpark SparkContext.addFile supports adding files recursively
> --
>
> Key: SPARK-17585
> URL: https://issues.apache.org/jira/browse/SPARK-17585
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Reporter: Yanbo Liang
>Priority: Minor
>
> PySpark {{SparkContext.addFile}} should support adding files recursively 
> under a directory similar with Scala.
> Users would like to add a directory as dependency in some cases, they can use 
> {{SparkContext.addFile}} with argument {{recursive=true}} to recursively add 
> all files under the directory by using Scala. But Python users can only add 
> file not directory, we should also make it supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17577) SparkR support add files to Spark job and get by executors

2016-09-18 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-17577:

Description: 
Scala/Python users can add files to Spark job by submit options {{--files}} or 
{{SparkContext.addFile()}}. Meanwhile, users can get the added file by 
{{SparkFiles.get(filename)}}.
We should also support this function for SparkR users, since they also have the 
requirements for some shared dependency files. For example, SparkR users can 
download third party R packages to driver firstly, add these files to the Spark 
job as dependency by this API and then each executor can install these packages 
by {{install.packages}}.

  was:
Scala/Python users can add files to Spark job by submit options {{--files}} or 
{{SparkContext.addFile()}}. Meanwhile, users can get the added file by 
{{SparkFiles.get(filename)}}.
We should also support this function for SparkR users, since SparkR users 
should can use shared files for each executors. For examples, SparkR users can 
download third party R packages to driver firstly, add these files to the Spark 
job by this API and then each executor can install these packages by 
{{install.packages}}.


> SparkR support add files to Spark job and get by executors
> --
>
> Key: SPARK-17577
> URL: https://issues.apache.org/jira/browse/SPARK-17577
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Yanbo Liang
>
> Scala/Python users can add files to Spark job by submit options {{--files}} 
> or {{SparkContext.addFile()}}. Meanwhile, users can get the added file by 
> {{SparkFiles.get(filename)}}.
> We should also support this function for SparkR users, since they also have 
> the requirements for some shared dependency files. For example, SparkR users 
> can download third party R packages to driver firstly, add these files to the 
> Spark job as dependency by this API and then each executor can install these 
> packages by {{install.packages}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17577) SparkR support add files to Spark job and get by executors

2016-09-18 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-17577:

Description: 
Scala/Python users can add files to Spark job by submit options {{--files}} or 
{{SparkContext.addFile()}}. Meanwhile, users can get the added file by 
{{SparkFiles.get(filename)}}.
We should also support this function for SparkR users, since SparkR users 
should can use shared files for each executors. For examples, SparkR users can 
download third party R packages to driver firstly, add these files to the Spark 
job by this API and then each executor can install these packages by 
{{install.packages}}.

  was:
Scala/Python users can add files to Spark job by submit options {{--files}} or 
{{SparkContext.addFile()}}. Meanwhile, users can get the added file by 
{{SparkFiles.get(filename)}}.
We should also support this function for SparkR users, since SparkR users may 
install third party R packages on each executors. For examples, SparkR users 
can download third party R packages to driver firstly, add these files to the 
Spark job by this API and each executor can install these packages by 
{{install.packages}}.


> SparkR support add files to Spark job and get by executors
> --
>
> Key: SPARK-17577
> URL: https://issues.apache.org/jira/browse/SPARK-17577
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Yanbo Liang
>
> Scala/Python users can add files to Spark job by submit options {{--files}} 
> or {{SparkContext.addFile()}}. Meanwhile, users can get the added file by 
> {{SparkFiles.get(filename)}}.
> We should also support this function for SparkR users, since SparkR users 
> should can use shared files for each executors. For examples, SparkR users 
> can download third party R packages to driver firstly, add these files to the 
> Spark job by this API and then each executor can install these packages by 
> {{install.packages}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17585) PySpark SparkContext.addFile supports adding files recursively

2016-09-18 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-17585:

Component/s: Spark Core

> PySpark SparkContext.addFile supports adding files recursively
> --
>
> Key: SPARK-17585
> URL: https://issues.apache.org/jira/browse/SPARK-17585
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Reporter: Yanbo Liang
>Priority: Minor
>
> PySpark {{SparkContext.addFile}} should support adding files recursively 
> under a directory similar with Scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17585) PySpark SparkContext.addFile supports adding files recursively

2016-09-18 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-17585:
---

 Summary: PySpark SparkContext.addFile supports adding files 
recursively
 Key: SPARK-17585
 URL: https://issues.apache.org/jira/browse/SPARK-17585
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Yanbo Liang
Priority: Minor


PySpark {{SparkContext.addFile}} should support adding files recursively under 
a directory similar with Scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17577) SparkR support add files to Spark job and get by executors

2016-09-17 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-17577:
---

 Summary: SparkR support add files to Spark job and get by executors
 Key: SPARK-17577
 URL: https://issues.apache.org/jira/browse/SPARK-17577
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Reporter: Yanbo Liang


Scala/Python users can add files to Spark job by submit options {{--files}} or 
{{SparkContext.addFile()}}. Meanwhile, users can get the added file by 
{{SparkFiles.get(filename)}}.
We should also support this function for SparkR users, since SparkR users may 
install third party R packages on each executors. For examples, SparkR users 
can download third party R packages to driver firstly, add these files to the 
Spark job by this API and each executor can install these packages by 
{{install.packages}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17471) Add compressed method for Matrix class

2016-09-13 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487035#comment-15487035
 ] 

Yanbo Liang commented on SPARK-17471:
-

[~sethah] I'm sorry that I have some emergent affairs to deal with in these 
days, so please feel free to take over this task. Thanks!

> Add compressed method for Matrix class
> --
>
> Key: SPARK-17471
> URL: https://issues.apache.org/jira/browse/SPARK-17471
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Seth Hendrickson
>
> Vectors in Spark have a {{compressed}} method which selects either sparse or 
> dense representation by minimizing storage requirements. Matrices should also 
> have this method, which is now explicitly needed in {{LogisticRegression}} 
> since we have implemented multiclass regression.
> The compressed method should also give the option to store row major or 
> column major, and if nothing is specified should select the lower storage 
> representation (for sparse).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17471) Add compressed method for Matrix class

2016-09-09 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15477376#comment-15477376
 ] 

Yanbo Liang edited comment on SPARK-17471 at 9/9/16 3:46 PM:
-

[~sethah] I think this task is duplicated with SPARK-17137 which will add 
compressed support for multinomial logistic regression coefficients. I'm 
working on that one and have some {{Matrix}} compression performance test 
results. I will post them here for discussion as soon as possible. Thanks!


was (Author: yanboliang):
[~sethah] I think this task is duplicated with SPARK-17137 which will add 
compressed support for multinomial logistic regression coefficients. I'm 
working on that one and have some {{Matrix}} compression performance test 
result. I will post them here for discussion as soon as possible. Thanks!

> Add compressed method for Matrix class
> --
>
> Key: SPARK-17471
> URL: https://issues.apache.org/jira/browse/SPARK-17471
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Seth Hendrickson
>
> Vectors in Spark have a {{compressed}} method which selects either sparse or 
> dense representation by minimizing storage requirements. Matrices should also 
> have this method, which is now explicitly needed in {{LogisticRegression}} 
> since we have implemented multiclass regression.
> The compressed method should also give the option to store row major or 
> column major, and if nothing is specified should select the lower storage 
> representation (for sparse).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17471) Add compressed method for Matrix class

2016-09-09 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15477376#comment-15477376
 ] 

Yanbo Liang commented on SPARK-17471:
-

[~sethah] I think this task is duplicated with SPARK-17137 which will add 
compressed support for multinomial logistic regression coefficients. I'm 
working on that one and have some {{Matrix}} compression performance test 
result. I will post them here for discussion as soon as possible. Thanks!

> Add compressed method for Matrix class
> --
>
> Key: SPARK-17471
> URL: https://issues.apache.org/jira/browse/SPARK-17471
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Seth Hendrickson
>
> Vectors in Spark have a {{compressed}} method which selects either sparse or 
> dense representation by minimizing storage requirements. Matrices should also 
> have this method, which is now explicitly needed in {{LogisticRegression}} 
> since we have implemented multiclass regression.
> The compressed method should also give the option to store row major or 
> column major, and if nothing is specified should select the lower storage 
> representation (for sparse).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17428) SparkR executors/workers support virtualenv

2016-09-09 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15477150#comment-15477150
 ] 

Yanbo Liang edited comment on SPARK-17428 at 9/9/16 2:14 PM:
-

Yeah, I agree to start with something simple and iterate later. I will do some 
experiments to verify whether it works well for my use case. Thanks for all 
your help! [~shivaram] [~sunrui] [~felixcheung] [~zjffdu]


was (Author: yanboliang):
Yeah, I agree to start with something simple and iterate later. I will do some 
experiments to verify whether it works well for the my use case. Thanks for all 
your help! [~shivaram] [~sunrui] [~felixcheung] [~zjffdu]

> SparkR executors/workers support virtualenv
> ---
>
> Key: SPARK-17428
> URL: https://issues.apache.org/jira/browse/SPARK-17428
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yanbo Liang
>
> Many users have requirements to use third party R packages in 
> executors/workers, but SparkR can not satisfy this requirements elegantly. 
> For example, you should to mess with the IT/administrators of the cluster to 
> deploy these R packages on each executors/workers node which is very 
> inflexible.
> I think we should support third party R packages for SparkR users as what we 
> do for jar packages in the following two scenarios:
> 1, Users can install R packages from CRAN or custom CRAN-like repository for 
> each executors.
> 2, Users can load their local R packages and install them on each executors.
> To achieve this goal, the first thing is to make SparkR executors support 
> virtualenv like Python conda. I have investigated and found 
> packrat(http://rstudio.github.io/packrat/) is one of the candidates to 
> support virtualenv for R. Packrat is a dependency management system for R and 
> can isolate the dependent R packages in its own private package space. Then 
> SparkR users can install third party packages in the application 
> scope(destroy after the application exit) and don’t need to bother 
> IT/administrators to install these packages manually.
> I would like to know whether it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17428) SparkR executors/workers support virtualenv

2016-09-09 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15477150#comment-15477150
 ] 

Yanbo Liang commented on SPARK-17428:
-

Yeah, I agree to start with something simple and iterate later. I will do some 
experiments to verify whether it works well for the my use case. Thanks for all 
your help! [~shivaram] [~sunrui] [~felixcheung] [~zjffdu]

> SparkR executors/workers support virtualenv
> ---
>
> Key: SPARK-17428
> URL: https://issues.apache.org/jira/browse/SPARK-17428
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yanbo Liang
>
> Many users have requirements to use third party R packages in 
> executors/workers, but SparkR can not satisfy this requirements elegantly. 
> For example, you should to mess with the IT/administrators of the cluster to 
> deploy these R packages on each executors/workers node which is very 
> inflexible.
> I think we should support third party R packages for SparkR users as what we 
> do for jar packages in the following two scenarios:
> 1, Users can install R packages from CRAN or custom CRAN-like repository for 
> each executors.
> 2, Users can load their local R packages and install them on each executors.
> To achieve this goal, the first thing is to make SparkR executors support 
> virtualenv like Python conda. I have investigated and found 
> packrat(http://rstudio.github.io/packrat/) is one of the candidates to 
> support virtualenv for R. Packrat is a dependency management system for R and 
> can isolate the dependent R packages in its own private package space. Then 
> SparkR users can install third party packages in the application 
> scope(destroy after the application exit) and don’t need to bother 
> IT/administrators to install these packages manually.
> I would like to know whether it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17464) SparkR spark.als arguments reg should be 0.1 by default

2016-09-09 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-17464.
-
   Resolution: Fixed
 Assignee: Yanbo Liang
Fix Version/s: 2.1.0

> SparkR spark.als arguments reg should be 0.1 by default
> ---
>
> Key: SPARK-17464
> URL: https://issues.apache.org/jira/browse/SPARK-17464
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
> Fix For: 2.1.0
>
>
> SparkR spark.als arguments {{reg}} should be 0.1 by default, which need to be 
> consistent with ML.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17456) Utility for parsing Spark versions

2016-09-09 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-17456.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

> Utility for parsing Spark versions
> --
>
> Key: SPARK-17456
> URL: https://issues.apache.org/jira/browse/SPARK-17456
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
> Fix For: 2.1.0
>
>
> There are many hacks within Spark's codebase to identify and compare Spark 
> versions.  We should add a simple utility to standardize these code paths, 
> especially since there have been mistakes made in the past.  This will let us 
> add unit tests as well.  This initial patch will only add methods for 
> extracting major and minor versions as Int types in Scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17464) SparkR spark.als arguments reg should be 0.1 by default

2016-09-08 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-17464:
---

 Summary: SparkR spark.als arguments reg should be 0.1 by default
 Key: SPARK-17464
 URL: https://issues.apache.org/jira/browse/SPARK-17464
 Project: Spark
  Issue Type: Bug
  Components: ML, SparkR
Reporter: Yanbo Liang
Priority: Minor


SparkR spark.als arguments {{reg}} should be 0.1 by default, which need to be 
consistent with ML.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17428) SparkR executors/workers support virtualenv

2016-09-08 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15473643#comment-15473643
 ] 

Yanbo Liang edited comment on SPARK-17428 at 9/8/16 11:46 AM:
--

[~sunrui] [~shivaram] [~felixcheung] Thanks for your reply.
Yes, we can compile packages at driver and send them to executors. But it 
involves some issues:
* Usually the Spark job is not run as root, but we need root privilege to 
install R packages on executors which is not permitted.
* After we run a SparkR job, the executors' R libraries will be polluted. And 
when another job was running on that executor, it may failed due to some 
conflict.
* The architecture of driver and executor may different, so the packages 
compiled on driver may not work well when it was sending to executors if it 
dependent on some architecture-related code.

These issues can not solved by SparkR currently. I investigated and found 
packrat can help us on this direction, but may need more experiment and study 
to verify. If this proposal make sense, I can work on this feature. Please feel 
free to let me know what you concern about. Thanks!


was (Author: yanboliang):
[~sunrui] [~shivaram] [~felixcheung] Thanks for your reply.
Yes, we can compile packages at driver and send them to executors. But it 
involves some issues:
* Usually the Spark job is not run as root, but we need root privilege to 
install R packages on executors which is not permitted.
* After we run a SparkR job, the executors' R libraries will be polluted. And 
when another job was running on that executor, it may failed due to some 
conflict.
* The architecture of driver and executor may different, so the packages 
compiled on driver may not work well when it was sending to executors if it 
dependent on some architecture-related code.

These issues can not solved by SparkR currently. I investigated and found 
packrat can help us on this direction, but may be need more experiments and 
study. If this proposal make sense, I can work on this feature. Please feel 
free to let me know what you concern about. Thanks!

> SparkR executors/workers support virtualenv
> ---
>
> Key: SPARK-17428
> URL: https://issues.apache.org/jira/browse/SPARK-17428
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yanbo Liang
>
> Many users have requirements to use third party R packages in 
> executors/workers, but SparkR can not satisfy this requirements elegantly. 
> For example, you should to mess with the IT/administrators of the cluster to 
> deploy these R packages on each executors/workers node which is very 
> inflexible.
> I think we should support third party R packages for SparkR users as what we 
> do for jar packages in the following two scenarios:
> 1, Users can install R packages from CRAN or custom CRAN-like repository for 
> each executors.
> 2, Users can load their local R packages and install them on each executors.
> To achieve this goal, the first thing is to make SparkR executors support 
> virtualenv like Python conda. I have investigated and found 
> packrat(http://rstudio.github.io/packrat/) is one of the candidates to 
> support virtualenv for R. Packrat is a dependency management system for R and 
> can isolate the dependent R packages in its own private package space. Then 
> SparkR users can install third party packages in the application 
> scope(destroy after the application exit) and don’t need to bother 
> IT/administrators to install these packages manually.
> I would like to know whether it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17428) SparkR executors/workers support virtualenv

2016-09-08 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15473643#comment-15473643
 ] 

Yanbo Liang edited comment on SPARK-17428 at 9/8/16 11:45 AM:
--

[~sunrui] [~shivaram] [~felixcheung] Thanks for your reply.
Yes, we can compile packages at driver and send them to executors. But it 
involves some issues:
* Usually the Spark job is not run as root, but we need root privilege to 
install R packages on executors which is not permitted.
* After we run a SparkR job, the executors' R libraries will be polluted. And 
when another job was running on that executor, it may failed due to some 
conflict.
* The architecture of driver and executor may different, so the packages 
compiled on driver may not work well when it was sending to executors if it 
dependent on some architecture-related code.

These issues can not solved by SparkR currently. I investigated and found 
packrat can help us on this direction, but may be need more experiments and 
study. If this proposal make sense, I can work on this feature. Please feel 
free to let me know what you concern about. Thanks!


was (Author: yanboliang):
[~sunrui] [~shivaram] [~felixcheung] Thanks for your reply.
Yes, we can compile packages at driver and send them to executors. But it 
involves some issues:
* Usually the Spark job is not run as root, but we need root privilege to 
install R packages on executors which is not permitted.
* After we run a SparkR job, the executors' R libraries will be polluted. And 
when another job was running on that executor, it may failed due to some 
conflict.
* The architecture of driver and executor may different, so the packages 
compiled on driver may not work well when it was sending to executors if it 
dependent on some architecture-related code.

These issues can not solved by SparkR currently. I investigated and found 
packrat can help us on this direction, but may be need more experiments. If 
this proposal make sense, I can work on this feature. Please feel free to let 
me know what you concern about. Thanks!

> SparkR executors/workers support virtualenv
> ---
>
> Key: SPARK-17428
> URL: https://issues.apache.org/jira/browse/SPARK-17428
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yanbo Liang
>
> Many users have requirements to use third party R packages in 
> executors/workers, but SparkR can not satisfy this requirements elegantly. 
> For example, you should to mess with the IT/administrators of the cluster to 
> deploy these R packages on each executors/workers node which is very 
> inflexible.
> I think we should support third party R packages for SparkR users as what we 
> do for jar packages in the following two scenarios:
> 1, Users can install R packages from CRAN or custom CRAN-like repository for 
> each executors.
> 2, Users can load their local R packages and install them on each executors.
> To achieve this goal, the first thing is to make SparkR executors support 
> virtualenv like Python conda. I have investigated and found 
> packrat(http://rstudio.github.io/packrat/) is one of the candidates to 
> support virtualenv for R. Packrat is a dependency management system for R and 
> can isolate the dependent R packages in its own private package space. Then 
> SparkR users can install third party packages in the application 
> scope(destroy after the application exit) and don’t need to bother 
> IT/administrators to install these packages manually.
> I would like to know whether it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17428) SparkR executors/workers support virtualenv

2016-09-08 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15473643#comment-15473643
 ] 

Yanbo Liang commented on SPARK-17428:
-

[~sunrui] [~shivaram] [~felixcheung] Thanks for your reply.
Yes, we can compile packages at driver and send them to executors. But it 
involves some issues:
* Usually the Spark job is not run as root, but we need root privilege to 
install R packages on executors which is not permitted.
* After we run a SparkR job, the executors' R libraries will be polluted. And 
when another job was running on that executor, it may failed due to some 
conflict.
* The architecture of driver and executor may different, so the packages 
compiled on driver may not work well when it was sending to executors if it 
dependent on some architecture-related code.

These issues can not solved by SparkR currently. I investigated and found 
packrat can help us on this direction, but may be need more experiments. If 
this proposal make sense, I can work on this feature. Please feel free to let 
me know what you concern about. Thanks!

> SparkR executors/workers support virtualenv
> ---
>
> Key: SPARK-17428
> URL: https://issues.apache.org/jira/browse/SPARK-17428
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yanbo Liang
>
> Many users have requirements to use third party R packages in 
> executors/workers, but SparkR can not satisfy this requirements elegantly. 
> For example, you should to mess with the IT/administrators of the cluster to 
> deploy these R packages on each executors/workers node which is very 
> inflexible.
> I think we should support third party R packages for SparkR users as what we 
> do for jar packages in the following two scenarios:
> 1, Users can install R packages from CRAN or custom CRAN-like repository for 
> each executors.
> 2, Users can load their local R packages and install them on each executors.
> To achieve this goal, the first thing is to make SparkR executors support 
> virtualenv like Python conda. I have investigated and found 
> packrat(http://rstudio.github.io/packrat/) is one of the candidates to 
> support virtualenv for R. Packrat is a dependency management system for R and 
> can isolate the dependent R packages in its own private package space. Then 
> SparkR users can install third party packages in the application 
> scope(destroy after the application exit) and don’t need to bother 
> IT/administrators to install these packages manually.
> I would like to know whether it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17428) SparkR executors/workers support virtualenv

2016-09-07 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15469736#comment-15469736
 ] 

Yanbo Liang edited comment on SPARK-17428 at 9/7/16 6:40 AM:
-

cc [~shivaram] [~felixcheung] [~sunrui]


was (Author: yanboliang):
cc [~shivaram] [~felixcheung]

> SparkR executors/workers support virtualenv
> ---
>
> Key: SPARK-17428
> URL: https://issues.apache.org/jira/browse/SPARK-17428
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yanbo Liang
>
> Many users have requirements to use third party R packages in 
> executors/workers, but SparkR can not satisfy this requirements elegantly. 
> For example, you should to mess with the IT/administrators of the cluster to 
> deploy these R packages on each executors/workers node which is very 
> inflexible.
> I think we should support third party R packages for SparkR users as what we 
> do for jar packages in the following two scenarios:
> 1, Users can install R packages from CRAN or custom CRAN-like repository for 
> each executors.
> 2, Users can load their local R packages and install them on each executors.
> To achieve this goal, the first thing is to make SparkR executors support 
> virtualenv like Python conda. I have investigated and found 
> packrat(http://rstudio.github.io/packrat/) is one of the candidates to 
> support virtualenv for R. Packrat is a dependency management system for R and 
> can isolate the dependent R packages in its own private package space. Then 
> SparkR users can install third party packages in the application 
> scope(destroy after the application exit) and don’t need to bother 
> IT/administrators to install these packages manually.
> I would like to know whether it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17428) SparkR executors/workers support virtualenv

2016-09-07 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15469736#comment-15469736
 ] 

Yanbo Liang commented on SPARK-17428:
-

cc [~shivaram] [~felixcheung]

> SparkR executors/workers support virtualenv
> ---
>
> Key: SPARK-17428
> URL: https://issues.apache.org/jira/browse/SPARK-17428
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yanbo Liang
>
> Many users have requirements to use third party R packages in 
> executors/workers, but SparkR can not satisfy this requirements elegantly. 
> For example, you should to mess with the IT/administrators of the cluster to 
> deploy these R packages on each executors/workers node which is very 
> inflexible.
> I think we should support third party R packages for SparkR users as what we 
> do for jar packages in the following two scenarios:
> 1, Users can install R packages from CRAN or custom CRAN-like repository for 
> each executors.
> 2, Users can load their local R packages and install them on each executors.
> To achieve this goal, the first thing is to make SparkR executors support 
> virtualenv like Python conda. I have investigated and found 
> packrat(http://rstudio.github.io/packrat/) is one of the candidates to 
> support virtualenv for R. Packrat is a dependency management system for R and 
> can isolate the dependent R packages in its own private package space. Then 
> SparkR users can install third party packages in the application 
> scope(destroy after the application exit) and don’t need to bother 
> IT/administrators to install these packages manually.
> I would like to know whether it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17428) SparkR executors/workers support virtualenv

2016-09-07 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-17428:

Description: 
Many users have requirements to use third party R packages in 
executors/workers, but SparkR can not satisfy this requirements elegantly. For 
example, you should to mess with the IT/administrators of the cluster to deploy 
these R packages on each executors/workers node which is very inflexible.

I think we should support third party R packages for SparkR users as what we do 
for jar packages in the following two scenarios:
1, Users can install R packages from CRAN or custom CRAN-like repository for 
each executors.
2, Users can load their local R packages and install them on each executors.

To achieve this goal, the first thing is to make SparkR executors support 
virtualenv like Python conda. I have investigated and found 
packrat(http://rstudio.github.io/packrat/) is one of the candidates to support 
virtualenv for R. Packrat is a dependency management system for R and can 
isolate the dependent R packages in its own private package space. Then SparkR 
users can install third party packages in the application scope(destroy after 
the application exit) and don’t need to bother IT/administrators to install 
these packages manually.

I would like to know whether it make sense.

  was:
Many users have requirements to use third party R packages in 
executors/workers, but SparkR can not satisfy this requirements elegantly. For 
example, you should to mess with the IT/administrators of the cluster to deploy 
these R packages on each executors/workers node which is very inflexible.

I think we should support third party R packages for SparkR users as what we do 
for jar packages in the following two scenarios:
1, Users can install R packages from CRAN or custom CRAN-like repository for 
each executors.
2, Users can load their local R packages and install them on each executors.

To achieve this goal, the first thing is to make SparkR executors support 
virtualenv like Python conda. I have investigated and found packrat is one of 
the candidates to support virtualenv for R. Packrat is a dependency management 
system for R and can isolate the dependent R packages in its own private 
package space. Then SparkR users can install third party packages in the 
application scope(destroy after the application exit) and don’t need to bother 
IT/administrators to install these packages manually.


> SparkR executors/workers support virtualenv
> ---
>
> Key: SPARK-17428
> URL: https://issues.apache.org/jira/browse/SPARK-17428
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yanbo Liang
>
> Many users have requirements to use third party R packages in 
> executors/workers, but SparkR can not satisfy this requirements elegantly. 
> For example, you should to mess with the IT/administrators of the cluster to 
> deploy these R packages on each executors/workers node which is very 
> inflexible.
> I think we should support third party R packages for SparkR users as what we 
> do for jar packages in the following two scenarios:
> 1, Users can install R packages from CRAN or custom CRAN-like repository for 
> each executors.
> 2, Users can load their local R packages and install them on each executors.
> To achieve this goal, the first thing is to make SparkR executors support 
> virtualenv like Python conda. I have investigated and found 
> packrat(http://rstudio.github.io/packrat/) is one of the candidates to 
> support virtualenv for R. Packrat is a dependency management system for R and 
> can isolate the dependent R packages in its own private package space. Then 
> SparkR users can install third party packages in the application 
> scope(destroy after the application exit) and don’t need to bother 
> IT/administrators to install these packages manually.
> I would like to know whether it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17428) SparkR executors/workers support virtualenv

2016-09-07 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-17428:
---

 Summary: SparkR executors/workers support virtualenv
 Key: SPARK-17428
 URL: https://issues.apache.org/jira/browse/SPARK-17428
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Reporter: Yanbo Liang


Many users have requirements to use third party R packages in 
executors/workers, but SparkR can not satisfy this requirements elegantly. For 
example, you should to mess with the IT/administrators of the cluster to deploy 
these R packages on each executors/workers node which is very inflexible.

I think we should support third party R packages for SparkR users as what we do 
for jar packages in the following two scenarios:
1, Users can install R packages from CRAN or custom CRAN-like repository for 
each executors.
2, Users can load their local R packages and install them on each executors.

To achieve this goal, the first thing is to make SparkR executors support 
virtualenv like Python conda. I have investigated and found packrat is one of 
the candidates to support virtualenv for R. Packrat is a dependency management 
system for R and can isolate the dependent R packages in its own private 
package space. Then SparkR users can install third party packages in the 
application scope(destroy after the application exit) and don’t need to bother 
IT/administrators to install these packages manually.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17197) PySpark LiR/LoR supports tree aggregation level configurable

2016-08-25 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-17197.
-
   Resolution: Fixed
 Assignee: Yanbo Liang
Fix Version/s: 2.1.0

> PySpark LiR/LoR supports tree aggregation level configurable
> 
>
> Key: SPARK-17197
> URL: https://issues.apache.org/jira/browse/SPARK-17197
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
> Fix For: 2.1.0
>
>
> SPARK-17090 make tree aggregation level in LiR/LoR configurable, this jira is 
> used to make PySpark support this function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14378) Review spark.ml parity for regression, except trees

2016-08-25 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-14378.
-
Resolution: Done

> Review spark.ml parity for regression, except trees
> ---
>
> Key: SPARK-14378
> URL: https://issues.apache.org/jira/browse/SPARK-14378
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>
> Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all 
> functionality.  List all missing items.
> This only covers Scala since we can compare Scala vs. Python in spark.ml 
> itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14378) Review spark.ml parity for regression, except trees

2016-08-25 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436529#comment-15436529
 ] 

Yanbo Liang commented on SPARK-14378:
-

Yes, I think we can resolve this as DONE. Thanks!

> Review spark.ml parity for regression, except trees
> ---
>
> Key: SPARK-14378
> URL: https://issues.apache.org/jira/browse/SPARK-14378
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>
> Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all 
> functionality.  List all missing items.
> This only covers Scala since we can compare Scala vs. Python in spark.ml 
> itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-8519) Blockify distance computation in k-means

2016-08-25 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-8519:
---
Comment: was deleted

(was: User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/10306)

> Blockify distance computation in k-means
> 
>
> Key: SPARK-8519
> URL: https://issues.apache.org/jira/browse/SPARK-8519
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>  Labels: advanced
>
> The performance of pairwise distance computation in k-means can benefit from 
> BLAS Level 3 matrix-matrix multiplications. It requires we update the 
> implementation to use blocks. Even for sparse data, we might be able to see 
> some performance gain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14381) Review spark.ml parity for feature transformers

2016-08-24 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-14381:

Fix Version/s: (was: 2.1.0)

> Review spark.ml parity for feature transformers
> ---
>
> Key: SPARK-14381
> URL: https://issues.apache.org/jira/browse/SPARK-14381
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Xusen Yin
>
> Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all 
> functionality. List all missing items.
> This only covers Scala since we can compare Scala vs. Python in spark.ml 
> itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14381) Review spark.ml parity for feature transformers

2016-08-24 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436268#comment-15436268
 ] 

Yanbo Liang commented on SPARK-14381:
-

Resolved this, thanks for working on it.

> Review spark.ml parity for feature transformers
> ---
>
> Key: SPARK-14381
> URL: https://issues.apache.org/jira/browse/SPARK-14381
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Xusen Yin
>
> Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all 
> functionality. List all missing items.
> This only covers Scala since we can compare Scala vs. Python in spark.ml 
> itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14381) Review spark.ml parity for feature transformers

2016-08-24 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-14381.
-
   Resolution: Done
 Assignee: Xusen Yin
Fix Version/s: 2.1.0

> Review spark.ml parity for feature transformers
> ---
>
> Key: SPARK-14381
> URL: https://issues.apache.org/jira/browse/SPARK-14381
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Xusen Yin
> Fix For: 2.1.0
>
>
> Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all 
> functionality. List all missing items.
> This only covers Scala since we can compare Scala vs. Python in spark.ml 
> itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14378) Review spark.ml parity for regression, except trees

2016-08-24 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436264#comment-15436264
 ] 

Yanbo Liang edited comment on SPARK-14378 at 8/25/16 4:30 AM:
--

* 
GeneralizedLinearModel(LinearRegressionModel/RidgeRegressionModel/LassoRegressionModel)
** single-row prediction SPARK-10413
** initialModel (Hold off until SPARK-10780 resolved)
** setValidateData (not important for public API)
** optimizer interface SPARK-17136
** LBFGS set numCorrections (not important for public API)
** PMML SPARK-11239
** toString: print summary
* IsotonicRegressionModel
** single-row prediction SPARK-10413
* StreamingLinearRegression


was (Author: yanboliang):
* 
GeneralizedLinearModel(LinearRegressionModel/RidgeRegressionModel/LassoRegressionModel)
** single-row prediction SPARK-10413
** initialModel (Hold off until SPARK-10780 resolved)
** setValidateData (not important for public API)
** optimizer interface SPARK-17136
** LBFGS set numCorrections (not important for public API)
** PMML SPARK-11237
** toString: print summary
* IsotonicRegressionModel
** single-row prediction SPARK-10413
* StreamingLinearRegression

> Review spark.ml parity for regression, except trees
> ---
>
> Key: SPARK-14378
> URL: https://issues.apache.org/jira/browse/SPARK-14378
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>
> Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all 
> functionality.  List all missing items.
> This only covers Scala since we can compare Scala vs. Python in spark.ml 
> itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14378) Review spark.ml parity for regression, except trees

2016-08-24 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436264#comment-15436264
 ] 

Yanbo Liang edited comment on SPARK-14378 at 8/25/16 4:29 AM:
--

* 
GeneralizedLinearModel(LinearRegressionModel/RidgeRegressionModel/LassoRegressionModel)
** single-row prediction SPARK-10413
** initialModel (Hold off until SPARK-10780 resolved)
** setValidateData (not important for public API)
** optimizer interface SPARK-17136
** LBFGS set numCorrections (not important for public API)
** PMML SPARK-11237
** toString: print summary
* IsotonicRegressionModel
** single-row prediction SPARK-10413
* StreamingLinearRegression


was (Author: yanboliang):
* 
GeneralizedLinearModel(LinearRegressionModel/RidgeRegressionModel/LassoRegressionModel)
** single-row prediction SPARK-10413
** initialModel (Hold off until SPARK-10780 resolved)
** setValidateData (not important for public API)
** optimizer interface SPARK-17136
** LBFGS set numCorrections (not important for public API)
** toString: print summary
* IsotonicRegressionModel
** single-row prediction SPARK-10413

> Review spark.ml parity for regression, except trees
> ---
>
> Key: SPARK-14378
> URL: https://issues.apache.org/jira/browse/SPARK-14378
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>
> Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all 
> functionality.  List all missing items.
> This only covers Scala since we can compare Scala vs. Python in spark.ml 
> itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14378) Review spark.ml parity for regression, except trees

2016-08-24 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436264#comment-15436264
 ] 

Yanbo Liang edited comment on SPARK-14378 at 8/25/16 4:26 AM:
--

* 
GeneralizedLinearModel(LinearRegressionModel/RidgeRegressionModel/LassoRegressionModel)
** single-row prediction SPARK-10413
** initialModel (Hold off until SPARK-10780 resolved)
** setValidateData (not important for public API)
** optimizer interface SPARK-17136
** LBFGS set numCorrections (not important for public API)
** toString: print summary
* IsotonicRegressionModel
** single-row prediction SPARK-10413


was (Author: yanboliang):
* 
GeneralizedLinearModel(LinearRegressionModel/RidgeRegressionModel/LassoRegressionModel)
** single-row prediction SPARK-10413
** initialModel (Hold off until SPARK-10780 resolved)
** setValidateData (not important for public API)
** optimizer interface SPARK-17136
** LBFGS set numCorrections (not important for public API)
** toString: print summary SPARK-14712
* IsotonicRegressionModel
** single-row prediction SPARK-10413

> Review spark.ml parity for regression, except trees
> ---
>
> Key: SPARK-14378
> URL: https://issues.apache.org/jira/browse/SPARK-14378
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>
> Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all 
> functionality.  List all missing items.
> This only covers Scala since we can compare Scala vs. Python in spark.ml 
> itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14378) Review spark.ml parity for regression, except trees

2016-08-24 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang reassigned SPARK-14378:
---

Assignee: Yanbo Liang

> Review spark.ml parity for regression, except trees
> ---
>
> Key: SPARK-14378
> URL: https://issues.apache.org/jira/browse/SPARK-14378
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>
> Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all 
> functionality.  List all missing items.
> This only covers Scala since we can compare Scala vs. Python in spark.ml 
> itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14378) Review spark.ml parity for regression, except trees

2016-08-24 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436264#comment-15436264
 ] 

Yanbo Liang edited comment on SPARK-14378 at 8/25/16 4:25 AM:
--

* 
GeneralizedLinearModel(LinearRegressionModel/RidgeRegressionModel/LassoRegressionModel)
** single-row prediction SPARK-10413
** initialModel (Hold off until SPARK-10780 resolved)
** setValidateData (not important for public API)
** optimizer interface SPARK-17136
** LBFGS set numCorrections (not important for public API)
** toString: print summary SPARK-14712
* IsotonicRegressionModel
** single-row prediction SPARK-10413


was (Author: yanboliang):
* 
GeneralizedLinearModel(LinearRegressionModel/RidgeRegressionModel/LassoRegressionModel)
** single-row prediction SPARK-10413
** initialModel 
** setValidateData (not important for public API)
** optimizer interface SPARK-17136
** LBFGS set numCorrections (not important for public API)
** toString: print summary SPARK-14712
* IsotonicRegressionModel
** single-row prediction SPARK-10413

> Review spark.ml parity for regression, except trees
> ---
>
> Key: SPARK-14378
> URL: https://issues.apache.org/jira/browse/SPARK-14378
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all 
> functionality.  List all missing items.
> This only covers Scala since we can compare Scala vs. Python in spark.ml 
> itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14378) Review spark.ml parity for regression, except trees

2016-08-24 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436264#comment-15436264
 ] 

Yanbo Liang commented on SPARK-14378:
-

* 
GeneralizedLinearModel(LinearRegressionModel/RidgeRegressionModel/LassoRegressionModel)
** single-row prediction SPARK-10413
** initialModel 
** setValidateData (not important for public API)
** optimizer interface SPARK-17136
** LBFGS set numCorrections (not important for public API)
** toString: print summary SPARK-14712
* IsotonicRegressionModel
** single-row prediction SPARK-10413

> Review spark.ml parity for regression, except trees
> ---
>
> Key: SPARK-14378
> URL: https://issues.apache.org/jira/browse/SPARK-14378
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all 
> functionality.  List all missing items.
> This only covers Scala since we can compare Scala vs. Python in spark.ml 
> itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17163) Decide on unified multinomial and binary logistic regression interfaces

2016-08-24 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436191#comment-15436191
 ] 

Yanbo Liang edited comment on SPARK-17163 at 8/25/16 3:22 AM:
--

Exposing a {{family}} or similar parameter sounds good to me.
One question:
{quote}
When the family is set to "binomial" we produce normal logistic regression with 
pivoting and when it is set to "multinomial" (default) it produces logistic 
regression with pivoting. 
{quote}
Should it be {{when it is set to "multinomial" (default) it produces logistic 
regression {color:red}without{color} pivoting}} ? Thanks!



was (Author: yanboliang):
Exposing a {{family}} or similar parameter sounds good to me.

> Decide on unified multinomial and binary logistic regression interfaces
> ---
>
> Key: SPARK-17163
> URL: https://issues.apache.org/jira/browse/SPARK-17163
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> Before the 2.1 release, we should finalize the API for logistic regression. 
> After SPARK-7159, we have both LogisticRegression and 
> MultinomialLogisticRegression models. This may be confusing to users and, is 
> a bit superfluous since MLOR can do basically all of what BLOR does. We 
> should decide if it needs to be changed and implement those changes before 2.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17163) Decide on unified multinomial and binary logistic regression interfaces

2016-08-24 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436191#comment-15436191
 ] 

Yanbo Liang edited comment on SPARK-17163 at 8/25/16 3:22 AM:
--

Exposing a {{family}} or similar parameter sounds good to me.
One more question:
{quote}
When the family is set to "binomial" we produce normal logistic regression with 
pivoting and when it is set to "multinomial" (default) it produces logistic 
regression with pivoting. 
{quote}
Should it be {{when it is set to "multinomial" (default) it produces logistic 
regression {color:red}without{color} pivoting}} ? Thanks!



was (Author: yanboliang):
Exposing a {{family}} or similar parameter sounds good to me.
One question:
{quote}
When the family is set to "binomial" we produce normal logistic regression with 
pivoting and when it is set to "multinomial" (default) it produces logistic 
regression with pivoting. 
{quote}
Should it be {{when it is set to "multinomial" (default) it produces logistic 
regression {color:red}without{color} pivoting}} ? Thanks!


> Decide on unified multinomial and binary logistic regression interfaces
> ---
>
> Key: SPARK-17163
> URL: https://issues.apache.org/jira/browse/SPARK-17163
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> Before the 2.1 release, we should finalize the API for logistic regression. 
> After SPARK-7159, we have both LogisticRegression and 
> MultinomialLogisticRegression models. This may be confusing to users and, is 
> a bit superfluous since MLOR can do basically all of what BLOR does. We 
> should decide if it needs to be changed and implement those changes before 2.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17163) Decide on unified multinomial and binary logistic regression interfaces

2016-08-24 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436191#comment-15436191
 ] 

Yanbo Liang edited comment on SPARK-17163 at 8/25/16 3:14 AM:
--

Exposing a {{family}} or similar parameter sounds good to me.


was (Author: yanboliang):
Exposing a {{family}} or similar parameter to control pivoting sounds good to 
me.

> Decide on unified multinomial and binary logistic regression interfaces
> ---
>
> Key: SPARK-17163
> URL: https://issues.apache.org/jira/browse/SPARK-17163
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> Before the 2.1 release, we should finalize the API for logistic regression. 
> After SPARK-7159, we have both LogisticRegression and 
> MultinomialLogisticRegression models. This may be confusing to users and, is 
> a bit superfluous since MLOR can do basically all of what BLOR does. We 
> should decide if it needs to be changed and implement those changes before 2.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17163) Decide on unified multinomial and binary logistic regression interfaces

2016-08-24 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434412#comment-15434412
 ] 

Yanbo Liang edited comment on SPARK-17163 at 8/25/16 3:12 AM:
--

I think it's hard to unify binary and multinomial logistic regression if we do 
not make any breaking change.
* Like [~sethah] said, we need to find a way to unify the representation of 
{{coefficients}} and {{intercept}}. I think flatten the matrix into a vector is 
still compromise, the best representation should be matrix for {{coefficients}} 
and vector for {{intercept}} even it's a binary classification problem. This 
will be consistent with other ML models such as {{NaiveBayesModel}} which is 
also support multi-class classification. But this will introduce breaking 
change.
* MLOR and LOR return different result for binary classification when 
regularization is used.
* Current LOR code base provide both {{setThreshold}} and {{setThresholds}} for 
binary logistic regression and they have some interactions. If we make MLOR and 
LOR share the old LOR code base, it will also introduce breaking change for 
these APIs. FYI: SPARK-11834 and SPARK-11543.
* Model store/load compatibility.

Here we have two choice: consolidate them which will introduce breaking change; 
or keep them separately.
-I'm more prefer to keep LOR and MLOR for different APIs, but not very strongly 
hold my opinion if you have better proposal. Thanks!-


was (Author: yanboliang):
I think it's hard to unify binary and multinomial logistic regression if we do 
not make any breaking change.
* Like [~sethah] said, we need to find a way to unify the representation of 
{{coefficients}} and {{intercept}}. I think flatten the matrix into a vector is 
still compromise, the best representation should be matrix for {{coefficients}} 
and vector for {{intercept}} even it's a binary classification problem. This 
will be consistent with other ML models such as {{NaiveBayesModel}} which is 
also support multi-class classification. But this will introduce big breaking 
change.
* MLOR and LOR return different result for binary classification when 
regularization is used.
* Current LOR code base provide both {{setThreshold}} and {{setThresholds}} for 
binary logistic regression and they have some interactions. If we make MLOR and 
LOR share the old LOR code base, it will also introduce breaking change for 
these APIs. FYI: SPARK-11834 and SPARK-11543.
* Model store/load compatibility.

Here we have two choice: consolidate them which will introduce breaking change; 
or keep them separately.
-I'm more prefer to keep LOR and MLOR for different APIs, but not very strongly 
hold my opinion if you have better proposal. Thanks!-

> Decide on unified multinomial and binary logistic regression interfaces
> ---
>
> Key: SPARK-17163
> URL: https://issues.apache.org/jira/browse/SPARK-17163
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> Before the 2.1 release, we should finalize the API for logistic regression. 
> After SPARK-7159, we have both LogisticRegression and 
> MultinomialLogisticRegression models. This may be confusing to users and, is 
> a bit superfluous since MLOR can do basically all of what BLOR does. We 
> should decide if it needs to be changed and implement those changes before 2.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17163) Decide on unified multinomial and binary logistic regression interfaces

2016-08-24 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436191#comment-15436191
 ] 

Yanbo Liang commented on SPARK-17163:
-

Exposing a {{family}} or similar parameter to control pivoting sounds good to 
me.

> Decide on unified multinomial and binary logistic regression interfaces
> ---
>
> Key: SPARK-17163
> URL: https://issues.apache.org/jira/browse/SPARK-17163
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> Before the 2.1 release, we should finalize the API for logistic regression. 
> After SPARK-7159, we have both LogisticRegression and 
> MultinomialLogisticRegression models. This may be confusing to users and, is 
> a bit superfluous since MLOR can do basically all of what BLOR does. We 
> should decide if it needs to be changed and implement those changes before 2.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17163) Decide on unified multinomial and binary logistic regression interfaces

2016-08-24 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434798#comment-15434798
 ] 

Yanbo Liang edited comment on SPARK-17163 at 8/24/16 12:12 PM:
---

Think more about this problem, I change my mind to support consolidate MLOR and 
LOR into one since I saw there are lots of duplicated code between them. I 
think it's worth to make the breaking change, otherwise, it will require extra 
efforts to maintain them. Thanks!


was (Author: yanboliang):
Think more about this problem, I change my mind to support consolidate MLOR and 
LOR into one since I saw there are lots of duplicated code between them. I 
think it's worth to make the breaking change, otherwise, it will require 
efforts to maintain them. Thanks!

> Decide on unified multinomial and binary logistic regression interfaces
> ---
>
> Key: SPARK-17163
> URL: https://issues.apache.org/jira/browse/SPARK-17163
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> Before the 2.1 release, we should finalize the API for logistic regression. 
> After SPARK-7159, we have both LogisticRegression and 
> MultinomialLogisticRegression models. This may be confusing to users and, is 
> a bit superfluous since MLOR can do basically all of what BLOR does. We 
> should decide if it needs to be changed and implement those changes before 2.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17163) Decide on unified multinomial and binary logistic regression interfaces

2016-08-24 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434412#comment-15434412
 ] 

Yanbo Liang edited comment on SPARK-17163 at 8/24/16 12:11 PM:
---

I think it's hard to unify binary and multinomial logistic regression if we do 
not make any breaking change.
* Like [~sethah] said, we need to find a way to unify the representation of 
{{coefficients}} and {{intercept}}. I think flatten the matrix into a vector is 
still compromise, the best representation should be matrix for {{coefficients}} 
and vector for {{intercept}} even it's a binary classification problem. This 
will be consistent with other ML models such as {{NaiveBayesModel}} which is 
also support multi-class classification. But this will introduce big breaking 
change.
* MLOR and LOR return different result for binary classification when 
regularization is used.
* Current LOR code base provide both {{setThreshold}} and {{setThresholds}} for 
binary logistic regression and they have some interactions. If we make MLOR and 
LOR share the old LOR code base, it will also introduce breaking change for 
these APIs. FYI: SPARK-11834 and SPARK-11543.
* Model store/load compatibility.

Here we have two choice: consolidate them which will introduce breaking change; 
or keep them separately.
-I'm more prefer to keep LOR and MLOR for different APIs, but not very strongly 
hold my opinion if you have better proposal. Thanks!-


was (Author: yanboliang):
I think it's hard to unify binary and multinomial logistic regression if we do 
not make any breaking change.
* Like [~sethah] said, we need to find a way to unify the representation of 
{{coefficients}} and {{intercept}}. I think flatten the matrix into a vector is 
still compromise, the best representation should be matrix for {{coefficients}} 
and vector for {{intercept}} even it's a binary classification problem. This 
will more or less consistent with other ML models such as {{NaiveBayesModel}} 
which is also support multi-class classification. But this will introduce big 
breaking change.
* MLOR and LOR return different result for binary classification when 
regularization is used.
* Current LOR code base provide both {{setThreshold}} and {{setThresholds}} for 
binary logistic regression and they have some interactions. If we make MLOR and 
LOR share the old LOR code base, it will also introduce breaking change for 
these APIs. FYI: SPARK-11834 and SPARK-11543.
* Model store/load compatibility.

Here we have two choice: consolidate them which will introduce breaking change; 
or keep them separately.
-I'm more prefer to keep LOR and MLOR for different APIs, but not very strongly 
hold my opinion if you have better proposal. Thanks!-

> Decide on unified multinomial and binary logistic regression interfaces
> ---
>
> Key: SPARK-17163
> URL: https://issues.apache.org/jira/browse/SPARK-17163
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> Before the 2.1 release, we should finalize the API for logistic regression. 
> After SPARK-7159, we have both LogisticRegression and 
> MultinomialLogisticRegression models. This may be confusing to users and, is 
> a bit superfluous since MLOR can do basically all of what BLOR does. We 
> should decide if it needs to be changed and implement those changes before 2.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17163) Decide on unified multinomial and binary logistic regression interfaces

2016-08-24 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434412#comment-15434412
 ] 

Yanbo Liang edited comment on SPARK-17163 at 8/24/16 12:10 PM:
---

I think it's hard to unify binary and multinomial logistic regression if we do 
not make any breaking change.
* Like [~sethah] said, we need to find a way to unify the representation of 
{{coefficients}} and {{intercept}}. I think flatten the matrix into a vector is 
still compromise, the best representation should be matrix for {{coefficients}} 
and vector for {{intercept}} even it's a binary classification problem. This 
will more or less consistent with other ML models such as {{NaiveBayesModel}} 
which is also support multi-class classification. But this will introduce big 
breaking change.
* MLOR and LOR return different result for binary classification when 
regularization is used.
* Current LOR code base provide both {{setThreshold}} and {{setThresholds}} for 
binary logistic regression and they have some interactions. If we make MLOR and 
LOR share the old LOR code base, it will also introduce breaking change for 
these APIs. FYI: SPARK-11834 and SPARK-11543.
* Model store/load compatibility.

Here we have two choice: consolidate them which will introduce breaking change; 
or keep them separately.
-I'm more prefer to keep LOR and MLOR for different APIs, but not very strongly 
hold my opinion if you have better proposal. Thanks!-


was (Author: yanboliang):
I think it's hard to unify binary and multinomial logistic regression if we do 
not make any breaking change.
* Like [~sethah] said, we need to find a way to unify the representation of 
{{coefficients}} and {{intercept}}. I think flatten the matrix into a vector is 
still compromise, the best representation should be matrix for {{coefficients}} 
and vector for {{intercept}} even it's a binary classification problem. This 
will more or less consistent with other ML models such as {{NaiveBayesModel}} 
which is also support multi-class classification. But this will introduce big 
breaking change.
* MLOR and LOR return different result for binary classification when 
regularization is used.
* Current LOR code base provide both {{setThreshold}} and {{setThresholds}} for 
binary logistic regression and they have some interactions. If we make MLOR and 
LOR share the old LOR code base, it will also introduce breaking change for 
these APIs. FYI: SPARK-11834 and SPARK-11543.
* Model store/load compatibility.

I'm more prefer to keep LOR and MLOR for different APIs, but not very strongly 
hold my opinion if you have better proposal. Thanks!

> Decide on unified multinomial and binary logistic regression interfaces
> ---
>
> Key: SPARK-17163
> URL: https://issues.apache.org/jira/browse/SPARK-17163
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> Before the 2.1 release, we should finalize the API for logistic regression. 
> After SPARK-7159, we have both LogisticRegression and 
> MultinomialLogisticRegression models. This may be confusing to users and, is 
> a bit superfluous since MLOR can do basically all of what BLOR does. We 
> should decide if it needs to be changed and implement those changes before 2.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17163) Decide on unified multinomial and binary logistic regression interfaces

2016-08-24 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434798#comment-15434798
 ] 

Yanbo Liang commented on SPARK-17163:
-

Think more about this problem, I change my mind to support consolidate MLOR and 
LOR into one since I saw there are lots of duplicated code between them. I 
think it's worth to make the breaking change, otherwise, it will require 
efforts to maintain them. Thanks!

> Decide on unified multinomial and binary logistic regression interfaces
> ---
>
> Key: SPARK-17163
> URL: https://issues.apache.org/jira/browse/SPARK-17163
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> Before the 2.1 release, we should finalize the API for logistic regression. 
> After SPARK-7159, we have both LogisticRegression and 
> MultinomialLogisticRegression models. This may be confusing to users and, is 
> a bit superfluous since MLOR can do basically all of what BLOR does. We 
> should decide if it needs to be changed and implement those changes before 2.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17163) Decide on unified multinomial and binary logistic regression interfaces

2016-08-24 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434412#comment-15434412
 ] 

Yanbo Liang edited comment on SPARK-17163 at 8/24/16 7:54 AM:
--

I think it's hard to unify binary and multinomial logistic regression if we do 
not make any breaking change.
* Like [~sethah] said, we need to find a way to unify the representation of 
{{coefficients}} and {{intercept}}. I think flatten the matrix into a vector is 
still compromise, the best representation should be matrix for {{coefficients}} 
and vector for {{intercept}} even it's a binary classification problem. This 
will more or less consistent with other ML models such as {{NaiveBayesModel}} 
which is also support multi-class classification. But this will introduce big 
breaking change.
* MLOR and LOR return different result for binary classification when 
regularization is used.
* Current LOR code base provide both {{setThreshold}} and {{setThresholds}} for 
binary logistic regression and they have some interactions. If we make MLOR and 
LOR share the old LOR code base, it will also introduce breaking change for 
these APIs. FYI: SPARK-11834 and SPARK-11543.
* Model store/load compatibility.

I'm more prefer to keep LOR and MLOR for different APIs, but not very strongly 
hold my opinion if you have better proposal. Thanks!


was (Author: yanboliang):
I think it's hard to unify binary and multinomial logistic regression if we do 
not make any breaking change.
* Like [~sethah] said, we need to find a way to unify the representation of 
{{coefficients}} and {{intercept}}. I think flatten the matrix into a vector is 
still compromise, the best representation should be matrix for {{coefficients}} 
and vector for {{intercept}} even it's a binary classification problem. This 
will more or less consistent with other ML models such as {{NaiveBayesModel}} 
which is also support multi-class classification. But this will introduce big 
breaking change.
* MLOR and LOR return different result for binary classification when 
regularization is used.
* Current LOR code base provide both {{setThreshold}} and {{setThresholds}} for 
binary logistic regression and they have some interactions. If we make MLOR and 
LOR share the old LOR code base, it will also introduce breaking change for 
these APIs.
* Model store/load compatibility.

I'm more prefer to keep LOR and MLOR for different APIs, but not very strongly 
hold my opinion if you have better proposal. Thanks!

> Decide on unified multinomial and binary logistic regression interfaces
> ---
>
> Key: SPARK-17163
> URL: https://issues.apache.org/jira/browse/SPARK-17163
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> Before the 2.1 release, we should finalize the API for logistic regression. 
> After SPARK-7159, we have both LogisticRegression and 
> MultinomialLogisticRegression models. This may be confusing to users and, is 
> a bit superfluous since MLOR can do basically all of what BLOR does. We 
> should decide if it needs to be changed and implement those changes before 2.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17163) Decide on unified multinomial and binary logistic regression interfaces

2016-08-24 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434412#comment-15434412
 ] 

Yanbo Liang edited comment on SPARK-17163 at 8/24/16 7:52 AM:
--

I think it's hard to unify binary and multinomial logistic regression if we do 
not make any breaking change.
* Like [~sethah] said, we need to find a way to unify the representation of 
{{coefficients}} and {{intercept}}. I think flatten the matrix into a vector is 
still compromise, the best representation should be matrix for {{coefficients}} 
and vector for {{intercept}} even it's a binary classification problem. This 
will more or less consistent with other ML models such as {{NaiveBayesModel}} 
which is also support multi-class classification. But this will introduce big 
breaking change.
* MLOR and LOR return different result for binary classification when 
regularization is used.
* Current LOR code base provide both {{setThreshold}} and {{setThresholds}} for 
binary logistic regression and they have some interactions. If we make MLOR and 
LOR share the old LOR code base, it will also introduce breaking change for 
these APIs.
* Model store/load compatibility.

I'm more prefer to keep LOR and MLOR for different APIs, but not very strongly 
hold my opinion if you have better proposal. Thanks!


was (Author: yanboliang):
I think it's hard to unify binary and multinomial logistic regression if we do 
not make any breaking change.
* Like [~sethah] said, we need to find a way to unify the representation of 
{{coefficients}} and {{intercept}}. I think flatten the matrix into a vector is 
still compromise, the best representation should be matrix for {{coefficients}} 
and vector for {{intercept}} even it's a binary classification problem. This 
will more or less consistent with other ML models such as {{NaiveBayesModel}} 
which is also support multi-class classification. But this will introduce big 
breaking change.
* MLOR and LOR return different result for binary classification when 
regularization is used.
* Current LOR code base provide both {{setThreshold}} and {{setThresholds}} for 
binary logistic regression and they have some interactions. If we make MLOR and 
LOR share the old LOR code base, it will also introduce breaking change for 
these APIs.
* Model store/load compatibility.

I'm more prefer to keep LOR and MLOR in different APIs, but not very strongly 
hold my opinion if you have better proposal. Thanks!

> Decide on unified multinomial and binary logistic regression interfaces
> ---
>
> Key: SPARK-17163
> URL: https://issues.apache.org/jira/browse/SPARK-17163
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> Before the 2.1 release, we should finalize the API for logistic regression. 
> After SPARK-7159, we have both LogisticRegression and 
> MultinomialLogisticRegression models. This may be confusing to users and, is 
> a bit superfluous since MLOR can do basically all of what BLOR does. We 
> should decide if it needs to be changed and implement those changes before 2.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17163) Decide on unified multinomial and binary logistic regression interfaces

2016-08-24 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434412#comment-15434412
 ] 

Yanbo Liang edited comment on SPARK-17163 at 8/24/16 7:50 AM:
--

I think it's hard to unify binary and multinomial logistic regression if we do 
not make any breaking change.
* Like [~sethah] said, we need to find a way to unify the representation of 
{{coefficients}} and {{intercept}}. I think flatten the matrix into a vector is 
still compromise, the best representation should be matrix for {{coefficients}} 
and vector for {{intercept}} even it's a binary classification problem. This 
will more or less consistent with other ML models such as {{NaiveBayesModel}} 
which is also support multi-class classification. But this will introduce big 
breaking change.
* MLOR and LOR return different result for binary classification when 
regularization is used.
* Current LOR code base provide both {{setThreshold}} and {{setThresholds}} for 
binary logistic regression and they have some interactions. If we make MLOR and 
LOR share the old LOR code base, it will also introduce breaking change for 
these APIs.
* Model store/load compatibility.

I'm more prefer to keep LOR and MLOR in different APIs, but not very strongly 
hold my opinion if you have better proposal. Thanks!


was (Author: yanboliang):
I think it's hard to unify binary and multinomial logistic regression if we do 
not make any breaking change.
* Like [~sethah] said, we need to find a way to unify the representation of 
{{coefficients}} and {{intercept}}. I think flatten the matrix into a vector is 
still compromise, the best representation should be matrix for {{coefficients}} 
and vector for {{intercept}} even it's a binary classification problem. This 
will consistent with other ML models such as {{NaiveBayesModel}} which is also 
support multi-class classification. But this will introduce big breaking change.
* MLOR and LOR return different result for binary classification when 
regularization is used.
* Current LOR code base provide both {{setThreshold}} and {{setThresholds}} for 
binary logistic regression and they have some interactions. If we make MLOR and 
LOR share the old LOR code base, it will also introduce breaking change for 
these APIs.
* Model store/load compatibility.

I'm more prefer to keep LOR and MLOR in different APIs, but not very strongly 
hold my opinion if you have better proposal. Thanks!

> Decide on unified multinomial and binary logistic regression interfaces
> ---
>
> Key: SPARK-17163
> URL: https://issues.apache.org/jira/browse/SPARK-17163
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> Before the 2.1 release, we should finalize the API for logistic regression. 
> After SPARK-7159, we have both LogisticRegression and 
> MultinomialLogisticRegression models. This may be confusing to users and, is 
> a bit superfluous since MLOR can do basically all of what BLOR does. We 
> should decide if it needs to be changed and implement those changes before 2.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17163) Decide on unified multinomial and binary logistic regression interfaces

2016-08-24 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434412#comment-15434412
 ] 

Yanbo Liang edited comment on SPARK-17163 at 8/24/16 7:49 AM:
--

I think it's hard to unify binary and multinomial logistic regression if we do 
not make any breaking change.
* Like [~sethah] said, we need to find a way to unify the representation of 
{{coefficients}} and {{intercept}}. I think flatten the matrix into a vector is 
still compromise, the best representation should be matrix for {{coefficients}} 
and vector for {{intercept}} even it's a binary classification problem. This 
will consistent with other ML models such as {{NaiveBayesModel}} which is also 
support multi-class classification. But this will introduce big breaking change.
* MLOR and LOR return different result for binary classification when 
regularization is used.
* Current LOR code base provide both {{setThreshold}} and {{setThresholds}} for 
binary logistic regression and they have some interactions. If we make MLOR and 
LOR share the old LOR code base, it will also introduce breaking change for 
these APIs.
* Model store/load compatibility.

I'm more prefer to keep LOR and MLOR in different APIs, but not very strongly 
hold my opinion if you have better proposal. Thanks!


was (Author: yanboliang):
I think it's hard to unify binary and multinomial logistic regression if we do 
not make any breaking change.
* Like [~sethah] said, we need to find a way to unify the representation of 
{{coefficients}} and {{intercept}}. I think flatten the matrix into a vector is 
still compromise, the best representation should be matrix for {{coefficients}} 
and vector for {{intercept}} even it's a binary classification problem. This 
will consistent with other ML models such as {{NaiveBayesModel}} which is also 
support multi-class classification. 
* MLOR and LOR return different result for binary classification when 
regularization is used.
* Current LOR code base provide both {{setThreshold}} and {{setThresholds}} for 
binary logistic regression and they have some interactions. If we make MLOR and 
LOR share the old LOR code base, it will also introduce breaking change for 
these APIs.
* Model store/load compatibility.

I'm more prefer to keep LOR and MLOR in different APIs, but not very strongly 
hold my opinion if you have better proposal. Thanks!

> Decide on unified multinomial and binary logistic regression interfaces
> ---
>
> Key: SPARK-17163
> URL: https://issues.apache.org/jira/browse/SPARK-17163
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> Before the 2.1 release, we should finalize the API for logistic regression. 
> After SPARK-7159, we have both LogisticRegression and 
> MultinomialLogisticRegression models. This may be confusing to users and, is 
> a bit superfluous since MLOR can do basically all of what BLOR does. We 
> should decide if it needs to be changed and implement those changes before 2.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17163) Decide on unified multinomial and binary logistic regression interfaces

2016-08-24 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434412#comment-15434412
 ] 

Yanbo Liang commented on SPARK-17163:
-

I think it's hard to unify binary and multinomial logistic regression if we do 
not make any breaking change.
* Like [~sethah] said, we need to find a way to unify the representation of 
{{coefficients}} and {{intercept}}. I think flatten the matrix into a vector is 
still compromise, the best representation should be matrix for {{coefficients}} 
and vector for {{intercept}} even it's a binary classification problem. This 
will consistent with other ML models such as {{NaiveBayesModel}} which is also 
support multi-class classification. 
* MLOR and LOR return different result for binary classification when 
regularization is used.
* Current LOR code base provide both {{setThreshold}} and {{setThresholds}} for 
binary logistic regression and they have some interactions. If we make MLOR and 
LOR share the old LOR code base, it will also introduce breaking change for 
these APIs.
* Model store/load compatibility.

I'm more prefer to keep LOR and MLOR in different APIs, but not very strongly 
hold my opinion if you have better proposal. Thanks!

> Decide on unified multinomial and binary logistic regression interfaces
> ---
>
> Key: SPARK-17163
> URL: https://issues.apache.org/jira/browse/SPARK-17163
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> Before the 2.1 release, we should finalize the API for logistic regression. 
> After SPARK-7159, we have both LogisticRegression and 
> MultinomialLogisticRegression models. This may be confusing to users and, is 
> a bit superfluous since MLOR can do basically all of what BLOR does. We 
> should decide if it needs to be changed and implement those changes before 2.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17197) PySpark LiR/LoR supports tree aggregation level configurable

2016-08-22 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-17197:

Priority: Minor  (was: Major)

> PySpark LiR/LoR supports tree aggregation level configurable
> 
>
> Key: SPARK-17197
> URL: https://issues.apache.org/jira/browse/SPARK-17197
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Priority: Minor
>
> SPARK-17090 make tree aggregation level in LiR/LoR configurable, this jira is 
> used to make PySpark support this function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17197) PySpark LiR/LoR supports tree aggregation level configurable

2016-08-22 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-17197:
---

 Summary: PySpark LiR/LoR supports tree aggregation level 
configurable
 Key: SPARK-17197
 URL: https://issues.apache.org/jira/browse/SPARK-17197
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Reporter: Yanbo Liang


SPARK-17090 make tree aggregation level in LiR/LoR configurable, this jira is 
used to make PySpark support this function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11215) Add multiple columns support to StringIndexer

2016-08-22 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang reassigned SPARK-11215:
---

Assignee: Yanbo Liang

> Add multiple columns support to StringIndexer
> -
>
> Key: SPARK-11215
> URL: https://issues.apache.org/jira/browse/SPARK-11215
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> Add multiple columns support to StringIndexer, then users can transform 
> multiple input columns to multiple output columns simultaneously. See 
> discussion SPARK-8418.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17169) To use scala macros to update code when SharedParamsCodeGen.scala changed

2016-08-22 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15430282#comment-15430282
 ] 

Yanbo Liang commented on SPARK-17169:
-

Meanwhile, it's better we can do compile time code-gen for python params as 
well, that is to say run {{python _shared_params_code_gen.py > shared.py}} 
automatically.

> To use scala macros to update code when SharedParamsCodeGen.scala changed
> -
>
> Key: SPARK-17169
> URL: https://issues.apache.org/jira/browse/SPARK-17169
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Qian Huang
>Priority: Minor
>
> As commented in the file SharedParamsCodeGen.scala, we have to manually run
> build/sbt "mllib/runMain org.apache.spark.ml.param.shared.SharedParamsCodeGen"
> to generate and update it.
> It could be better to do compile time code-gen for this using scala macros 
> rather than running the script as described above. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17086) QuantileDiscretizer throws InvalidArgumentException (parameter splits given invalid value) on valid data

2016-08-21 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15430009#comment-15430009
 ] 

Yanbo Liang commented on SPARK-17086:
-

We should not throw exception in this case. If the number of distinct input 
data is less than {{numBuckets}}, we will simply return an array with distinct 
elements as splits. But we should not actually compute  the number of distinct 
input elements which is very expensive, we can collapse adjacent splits 
produced by {{approxQuantile}} that are equal.

> QuantileDiscretizer throws InvalidArgumentException (parameter splits given 
> invalid value) on valid data
> 
>
> Key: SPARK-17086
> URL: https://issues.apache.org/jira/browse/SPARK-17086
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Barry Becker
>
> I discovered this bug when working with a build from the master branch (which 
> I believe is 2.1.0). This used to work fine when running spark 1.6.2.
> I have a dataframe with an "intData" column that has values like 
> {code}
> 1 3 2 1 1 2 3 2 2 2 1 3
> {code}
> I have a stage in my pipeline that uses the QuantileDiscretizer to produce 
> equal weight splits like this
> {code}
> new QuantileDiscretizer()
> .setInputCol("intData")
> .setOutputCol("intData_bin")
> .setNumBuckets(10)
> .fit(df)
> {code}
> But when that gets run it (incorrectly) throws this error:
> {code}
> parameter splits given invalid value [-Infinity, 1.0, 1.0, 2.0, 2.0, 3.0, 
> 3.0, Infinity]
> {code}
> I don't think that there should be duplicate splits generated should there be?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15018) PySpark ML Pipeline raises unclear error when no stages set

2016-08-20 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-15018.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

> PySpark ML Pipeline raises unclear error when no stages set
> ---
>
> Key: SPARK-15018
> URL: https://issues.apache.org/jira/browse/SPARK-15018
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Minor
> Fix For: 2.1.0
>
>
> When fitting a PySpark Pipeline with no stages, it should work as an identity 
> transformer.  Instead the following error is raised:
> {noformat}
> Traceback (most recent call last):
>   File "./spark/python/pyspark/ml/base.py", line 64, in fit
> return self._fit(dataset)
>   File "./spark/python/pyspark/ml/pipeline.py", line 99, in _fit
> for stage in stages:
> TypeError: 'NoneType' object is not iterable
> {noformat}
> The param {{stages}} needs to be an empty list and {{getStages}} should call 
> {{getOrDefault}}.
> Also, since the default value is {{None}} is then changed to and empty list 
> {{[]}}, this never changes the value if passed in as a keyword argument.  
> Instead, the {{kwargs}} value should be changed directly if {{stages is 
> None}}.
> For example
> {noformat}
> if stages is None:
> stages = []
> {noformat}
> should be this
> {noformat}
> if stages is None:
> kwargs['stages'] = []
> {noformat}
> However, since there is no default value in the Scala implementation, 
> assigning a default here is not needed and should be cleaned up.  The pydocs 
> should better indicate that stages is required to be a list.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17138) Python API for multinomial logistic regression

2016-08-20 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15429258#comment-15429258
 ] 

Yanbo Liang commented on SPARK-17138:
-

[~WeichenXu123] Please hold on this task, since SPARK-17163 discuss to unify 
multinomial and binary logistic regression interfaces which may affect the 
Python API. Please wait for SPARK-17163 get merged firstly. Thanks!

> Python API for multinomial logistic regression
> --
>
> Key: SPARK-17138
> URL: https://issues.apache.org/jira/browse/SPARK-17138
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>
> Once [SPARK-7159|https://issues.apache.org/jira/browse/SPARK-7159] is merged, 
> we should make a Python API for it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17137) Add compressed support for multinomial logistic regression coefficients

2016-08-20 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15429255#comment-15429255
 ] 

Yanbo Liang commented on SPARK-17137:
-

Yes, I will do some performance test to weigh the trade-off. Thanks.

> Add compressed support for multinomial logistic regression coefficients
> ---
>
> Key: SPARK-17137
> URL: https://issues.apache.org/jira/browse/SPARK-17137
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>Priority: Minor
>
> For sparse coefficients in MLOR, such as when high L1 regularization, it may 
> be more efficient to store coefficients in compressed format. We can add this 
> option to MLOR and perhaps to do some performance tests to verify 
> improvements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17136) Design optimizer interface for ML algorithms

2016-08-20 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15429253#comment-15429253
 ] 

Yanbo Liang commented on SPARK-17136:
-

Yes, only first order optimizer can scale well in number of features, so only 
this case should be taken into consideration. I recently worked on SPARK-10078  
to support vector-free L-BFGS as optimizer for Spark which also involves the 
design of optimizer interface. So I can give a try for this issue too.
I will make a investigation on how other packages in Python/R/Matlab defining 
the interface firstly, post the findings here and then we can discuss how to 
design the optimizer interface for Spark. Thanks! 

> Design optimizer interface for ML algorithms
> 
>
> Key: SPARK-17136
> URL: https://issues.apache.org/jira/browse/SPARK-17136
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>
> We should consider designing an interface that allows users to use their own 
> optimizers in some of the ML algorithms, similar to MLlib. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17134) Use level 2 BLAS operations in LogisticAggregator

2016-08-19 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15429196#comment-15429196
 ] 

Yanbo Liang commented on SPARK-17134:
-

[~qhuang] Please feel free to take this task and do the performance 
investigation. Thanks! 

> Use level 2 BLAS operations in LogisticAggregator
> -
>
> Key: SPARK-17134
> URL: https://issues.apache.org/jira/browse/SPARK-17134
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>
> Multinomial logistic regression uses LogisticAggregator class for gradient 
> updates. We should look into refactoring MLOR to use level 2 BLAS operations 
> for the updates. Performance testing should be done to show improvements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17141) MinMaxScaler behaves weird when min and max have the same value and some values are NaN

2016-08-19 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-17141.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

> MinMaxScaler behaves weird when min and max have the same value and some 
> values are NaN
> ---
>
> Key: SPARK-17141
> URL: https://issues.apache.org/jira/browse/SPARK-17141
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.2, 2.0.0
> Environment: Databrick's Community, Spark 2.0 + Scala 2.10
>Reporter: Alberto Bonsanto
>Assignee: Yanbo Liang
>Priority: Minor
> Fix For: 2.1.0
>
>
> When you have a {{DataFrame}} with a column named {{features}}, which is a 
> {{DenseVector}} and the *maximum* and *minimum* and some values are 
> {{Double.NaN}} they get replaced by 0.5, and they should remain with the same 
> value, I believe.
> I know how to fix it, but I haven't ever made a pull request. You can check 
> the bug in this 
> [notebook|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2485090270202665/3126465289264547/8589256059752547/latest.html]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17141) MinMaxScaler behaves weird when min and max have the same value and some values are NaN

2016-08-19 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang reassigned SPARK-17141:
---

Assignee: Yanbo Liang

> MinMaxScaler behaves weird when min and max have the same value and some 
> values are NaN
> ---
>
> Key: SPARK-17141
> URL: https://issues.apache.org/jira/browse/SPARK-17141
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.2, 2.0.0
> Environment: Databrick's Community, Spark 2.0 + Scala 2.10
>Reporter: Alberto Bonsanto
>Assignee: Yanbo Liang
>Priority: Minor
>
> When you have a {{DataFrame}} with a column named {{features}}, which is a 
> {{DenseVector}} and the *maximum* and *minimum* and some values are 
> {{Double.NaN}} they get replaced by 0.5, and they should remain with the same 
> value, I believe.
> I know how to fix it, but I haven't ever made a pull request. You can check 
> the bug in this 
> [notebook|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2485090270202665/3126465289264547/8589256059752547/latest.html]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17141) MinMaxScaler behaves weird when min and max have the same value and some values are NaN

2016-08-19 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15427860#comment-15427860
 ] 

Yanbo Liang commented on SPARK-17141:
-

In the existing code, {{MinMaxScaler}} handle NaN value indeterminately.
* If a column has identity value, that is max == min, {{MinMaxScalerModel}} 
transformation will output 0.5 for all rows even the original value is NaN.
* Otherwise, it will remain NaN after transformation.
I think we should unify the behavior by remaining NaN value at any condition, 
since we don't know how to transform a NaN value. In Python sklearn, it will 
throw exception when there is NaN in the dataset.

> MinMaxScaler behaves weird when min and max have the same value and some 
> values are NaN
> ---
>
> Key: SPARK-17141
> URL: https://issues.apache.org/jira/browse/SPARK-17141
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.2, 2.0.0
> Environment: Databrick's Community, Spark 2.0 + Scala 2.10
>Reporter: Alberto Bonsanto
>Priority: Minor
>
> When you have a {{DataFrame}} with a column named {{features}}, which is a 
> {{DenseVector}} and the *maximum* and *minimum* and some values are 
> {{Double.NaN}} they get replaced by 0.5, and they should remain with the same 
> value, I believe.
> I know how to fix it, but I haven't ever made a pull request. You can check 
> the bug in this 
> [notebook|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2485090270202665/3126465289264547/8589256059752547/latest.html]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17141) MinMaxScaler behaves weird when min and max have the same value and some values are NaN

2016-08-19 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15427860#comment-15427860
 ] 

Yanbo Liang edited comment on SPARK-17141 at 8/19/16 9:01 AM:
--

In the existing code, {{MinMaxScaler}} handle NaN value indeterminately.
* If a column has identity value, that is max == min, {{MinMaxScalerModel}} 
transformation will output 0.5 for all rows even the original value is NaN.
* Otherwise, it will remain NaN after transformation.

I think we should unify the behavior by remaining NaN value at any condition, 
since we don't know how to transform a NaN value. In Python sklearn, it will 
throw exception when there is NaN in the dataset.


was (Author: yanboliang):
In the existing code, {{MinMaxScaler}} handle NaN value indeterminately.
* If a column has identity value, that is max == min, {{MinMaxScalerModel}} 
transformation will output 0.5 for all rows even the original value is NaN.
* Otherwise, it will remain NaN after transformation.
I think we should unify the behavior by remaining NaN value at any condition, 
since we don't know how to transform a NaN value. In Python sklearn, it will 
throw exception when there is NaN in the dataset.

> MinMaxScaler behaves weird when min and max have the same value and some 
> values are NaN
> ---
>
> Key: SPARK-17141
> URL: https://issues.apache.org/jira/browse/SPARK-17141
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.2, 2.0.0
> Environment: Databrick's Community, Spark 2.0 + Scala 2.10
>Reporter: Alberto Bonsanto
>Priority: Minor
>
> When you have a {{DataFrame}} with a column named {{features}}, which is a 
> {{DenseVector}} and the *maximum* and *minimum* and some values are 
> {{Double.NaN}} they get replaced by 0.5, and they should remain with the same 
> value, I believe.
> I know how to fix it, but I haven't ever made a pull request. You can check 
> the bug in this 
> [notebook|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2485090270202665/3126465289264547/8589256059752547/latest.html]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17141) MinMaxScaler behaves weird when min and max have the same value and some values are NaN

2016-08-19 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-17141:

Priority: Minor  (was: Trivial)

> MinMaxScaler behaves weird when min and max have the same value and some 
> values are NaN
> ---
>
> Key: SPARK-17141
> URL: https://issues.apache.org/jira/browse/SPARK-17141
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.2, 2.0.0
> Environment: Databrick's Community, Spark 2.0 + Scala 2.10
>Reporter: Alberto Bonsanto
>Priority: Minor
>
> When you have a {{DataFrame}} with a column named {{features}}, which is a 
> {{DenseVector}} and the *maximum* and *minimum* and some values are 
> {{Double.NaN}} they get replaced by 0.5, and they should remain with the same 
> value, I believe.
> I know how to fix it, but I haven't ever made a pull request. You can check 
> the bug in this 
> [notebook|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2485090270202665/3126465289264547/8589256059752547/latest.html]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15018) PySpark ML Pipeline fails when no stages set

2016-08-19 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-15018:

Shepherd: Yanbo Liang
Assignee: Bryan Cutler

> PySpark ML Pipeline fails when no stages set
> 
>
> Key: SPARK-15018
> URL: https://issues.apache.org/jira/browse/SPARK-15018
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>
> When fitting a PySpark Pipeline with no stages, it should work as an identity 
> transformer.  Instead the following error is raised:
> {noformat}
> Traceback (most recent call last):
>   File "./spark/python/pyspark/ml/base.py", line 64, in fit
> return self._fit(dataset)
>   File "./spark/python/pyspark/ml/pipeline.py", line 99, in _fit
> for stage in stages:
> TypeError: 'NoneType' object is not iterable
> {noformat}
> The param {{stages}} should be added to the default param list and 
> {{getStages}} should call {{getOrDefault}}.
> Also, since the default value is {{None}} is then changed to and empty list 
> {{[]}}, this never changes the value if passed in as a keyword argument.  
> Instead, the {{kwargs}} value should be changed directly if {{stages is 
> None}}.
> For example
> {noformat}
> if stages is None:
> stages = []
> {noformat}
> should be this
> {noformat}
> if stages is None:
> kwargs['stages'] = []
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17137) Add compressed support for multinomial logistic regression coefficients

2016-08-18 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15427563#comment-15427563
 ] 

Yanbo Liang commented on SPARK-17137:
-

I think we should provide transparent interface to users rather than exposing a 
param to control whether output dense/sparse coefficients. Spark MLlib 
{{Vector.compressed}} returns a vector in either dense or sparse format, 
whichever uses less storage. I would like to do the performance tests for this 
issue. Thanks!

> Add compressed support for multinomial logistic regression coefficients
> ---
>
> Key: SPARK-17137
> URL: https://issues.apache.org/jira/browse/SPARK-17137
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>Priority: Minor
>
> For sparse coefficients in MLOR, such as when high L1 regularization, it may 
> be more efficient to store coefficients in compressed format. We can add this 
> option to MLOR and perhaps to do some performance tests to verify 
> improvements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17136) Design optimizer interface for ML algorithms

2016-08-18 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15427539#comment-15427539
 ] 

Yanbo Liang commented on SPARK-17136:
-

I would like to know that users' own optimizers have some standard API similar 
with breeze {{LBFGS}} or others?

> Design optimizer interface for ML algorithms
> 
>
> Key: SPARK-17136
> URL: https://issues.apache.org/jira/browse/SPARK-17136
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>
> We should consider designing an interface that allows users to use their own 
> optimizers in some of the ML algorithms, similar to MLlib. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17134) Use level 2 BLAS operations in LogisticAggregator

2016-08-18 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15427529#comment-15427529
 ] 

Yanbo Liang edited comment on SPARK-17134 at 8/19/16 3:04 AM:
--

This is interesting. We also trying to use BLAS to accelerate linear algebra 
operations in other algorithms such as {{KMeans/ALS}} and I have some basic 
performance test result. I would like to contribute to this task after 
SPARK-7159 finished. Thanks!


was (Author: yanboliang):
This is interesting. We also trying to use BLAS to accelerate linear algebra 
operations in other algorithms such as {{KMeans/ALS}} and I have some basic 
performance test result. I would like to contribute to this task. Thanks!

> Use level 2 BLAS operations in LogisticAggregator
> -
>
> Key: SPARK-17134
> URL: https://issues.apache.org/jira/browse/SPARK-17134
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>
> Multinomial logistic regression uses LogisticAggregator class for gradient 
> updates. We should look into refactoring MLOR to use level 2 BLAS operations 
> for the updates. Performance testing should be done to show improvements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17134) Use level 2 BLAS operations in LogisticAggregator

2016-08-18 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15427529#comment-15427529
 ] 

Yanbo Liang commented on SPARK-17134:
-

This is interesting. We also trying to use BLAS to accelerate linear algebra 
operations in other algorithms such as {{KMeans/ALS}} and I have some basic 
performance test result. I would like to contribute to this task. Thanks!

> Use level 2 BLAS operations in LogisticAggregator
> -
>
> Key: SPARK-17134
> URL: https://issues.apache.org/jira/browse/SPARK-17134
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>
> Multinomial logistic regression uses LogisticAggregator class for gradient 
> updates. We should look into refactoring MLOR to use level 2 BLAS operations 
> for the updates. Performance testing should be done to show improvements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17086) QuantileDiscretizer throws InvalidArgumentException (parameter splits given invalid value) on valid data

2016-08-18 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15426254#comment-15426254
 ] 

Yanbo Liang edited comment on SPARK-17086 at 8/18/16 10:49 AM:
---

[~sowen] 
The bucket defined by [1.0, 1.0) will only receive the value 1.0, I think this 
scenario is OK. But if we provide the splits as {{[-Infinity, 1.0, 1.0, 1.0, 
2.0, 2.0, 2.0, 3.0, 3.0, Infinity]}}, it will output {{[-Infinity, 1.0), 1.0,  
1.0, [1.0, 2.0), 2.0, 2.0, [2.0, 3.0), 3.0, [3.0, Infinity]}}. 
>From the document, {{QuantileDiscretizer}} takes a column with continuous 
>features and outputs a column with binned categorical features. So I think it 
>does not make sense if we put the same continuous value into different 
>categorical features. Thanks.


was (Author: yanboliang):
[~sowen] 
The bucket defined by [1.0, 1.0) will only receive the value 1.0, I think this 
scenario is OK. But if we provide the splits as {{[-Infinity, 1.0, 1.0, 1.0, 
2.0, 2.0, 2.0, 3.0, 3.0, Infinity]}}, it will output {{[-Infinity, 1.0), 1.0,  
1.0, [1.0, 2.0), 2.0, 2.0, [2.0, 3.0), [3.0, 3.0), [3.0, Infinity]}}. 
>From the document, {{QuantileDiscretizer}} takes a column with continuous 
>features and outputs a column with binned categorical features. So I think it 
>does not make sense if we put the same continuous value into different 
>categorical features. Thanks.

> QuantileDiscretizer throws InvalidArgumentException (parameter splits given 
> invalid value) on valid data
> 
>
> Key: SPARK-17086
> URL: https://issues.apache.org/jira/browse/SPARK-17086
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Barry Becker
>
> I discovered this bug when working with a build from the master branch (which 
> I believe is 2.1.0). This used to work fine when running spark 1.6.2.
> I have a dataframe with an "intData" column that has values like 
> {code}
> 1 3 2 1 1 2 3 2 2 2 1 3
> {code}
> I have a stage in my pipeline that uses the QuantileDiscretizer to produce 
> equal weight splits like this
> {code}
> new QuantileDiscretizer()
> .setInputCol("intData")
> .setOutputCol("intData_bin")
> .setNumBuckets(10)
> .fit(df)
> {code}
> But when that gets run it (incorrectly) throws this error:
> {code}
> parameter splits given invalid value [-Infinity, 1.0, 1.0, 2.0, 2.0, 3.0, 
> 3.0, Infinity]
> {code}
> I don't think that there should be duplicate splits generated should there be?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17086) QuantileDiscretizer throws InvalidArgumentException (parameter splits given invalid value) on valid data

2016-08-18 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15426254#comment-15426254
 ] 

Yanbo Liang commented on SPARK-17086:
-

[~sowen] 
The bucket defined by [1.0, 1.0) will only receive the value 1.0, I think this 
scenario is OK. But if we provide the splits as {{[-Infinity, 1.0, 1.0, 1.0, 
2.0, 2.0, 2.0, 3.0, 3.0, Infinity]}}, it will output {{[-Infinity, 1.0), 1.0,  
1.0, [1.0, 2.0), 2.0, 2.0, [2.0, 3.0), [3.0, 3.0), [3.0, Infinity]}}. 
>From the document, {{QuantileDiscretizer}} takes a column with continuous 
>features and outputs a column with binned categorical features. So I think it 
>does not make sense if we put the same continuous value into different 
>categorical features. Thanks.

> QuantileDiscretizer throws InvalidArgumentException (parameter splits given 
> invalid value) on valid data
> 
>
> Key: SPARK-17086
> URL: https://issues.apache.org/jira/browse/SPARK-17086
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Barry Becker
>
> I discovered this bug when working with a build from the master branch (which 
> I believe is 2.1.0). This used to work fine when running spark 1.6.2.
> I have a dataframe with an "intData" column that has values like 
> {code}
> 1 3 2 1 1 2 3 2 2 2 1 3
> {code}
> I have a stage in my pipeline that uses the QuantileDiscretizer to produce 
> equal weight splits like this
> {code}
> new QuantileDiscretizer()
> .setInputCol("intData")
> .setOutputCol("intData_bin")
> .setNumBuckets(10)
> .fit(df)
> {code}
> But when that gets run it (incorrectly) throws this error:
> {code}
> parameter splits given invalid value [-Infinity, 1.0, 1.0, 2.0, 2.0, 3.0, 
> 3.0, Infinity]
> {code}
> I don't think that there should be duplicate splits generated should there be?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17086) QuantileDiscretizer throws InvalidArgumentException (parameter splits given invalid value) on valid data

2016-08-18 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15425949#comment-15425949
 ] 

Yanbo Liang commented on SPARK-17086:
-

If the number of distinct input data is less than {{numBuckets}}, it should not 
split the data into buckets. We should figure out a proper way to identify this 
condition and throw corresponding exception.

> QuantileDiscretizer throws InvalidArgumentException (parameter splits given 
> invalid value) on valid data
> 
>
> Key: SPARK-17086
> URL: https://issues.apache.org/jira/browse/SPARK-17086
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Barry Becker
>
> I discovered this bug when working with a build from the master branch (which 
> I believe is 2.1.0). This used to work fine when running spark 1.6.2.
> I have a dataframe with an "intData" column that has values like 
> {code}
> 1 3 2 1 1 2 3 2 2 2 1 3
> {code}
> I have a stage in my pipeline that uses the QuantileDiscretizer to produce 
> equal weight splits like this
> {code}
> new QuantileDiscretizer()
> .setInputCol("intData")
> .setOutputCol("intData_bin")
> .setNumBuckets(10)
> .fit(df)
> {code}
> But when that gets run it (incorrectly) throws this error:
> {code}
> parameter splits given invalid value [-Infinity, 1.0, 1.0, 2.0, 2.0, 3.0, 
> 3.0, Infinity]
> {code}
> I don't think that there should be duplicate splits generated should there be?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17090) Make tree aggregation level in linear/logistic regression configurable

2016-08-17 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15425913#comment-15425913
 ] 

Yanbo Liang commented on SPARK-17090:
-

Making aggregation depth configurable is necessary when Linear/Logistic 
Regression scaling to high dimension. I vote to expose an expert param to make 
it configurable.

> Make tree aggregation level in linear/logistic regression configurable
> --
>
> Key: SPARK-17090
> URL: https://issues.apache.org/jira/browse/SPARK-17090
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Seth Hendrickson
>Priority: Minor
>
> Linear/logistic regression use treeAggregate with default aggregation depth 
> for collecting coefficient gradient updates to the driver. For high 
> dimensional problems, this can case OOM error on the driver. We should make 
> it configurable, perhaps via an expert param, so that users can avoid this 
> problem if their data has many features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16993) model.transform without label column in random forest regression

2016-08-16 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15422384#comment-15422384
 ] 

Yanbo Liang commented on SPARK-16993:
-

[~dulajrajitha] I can not reproduce your reported issue, the following code 
works well.
{code}
val data = 
spark.read.format("libsvm").load("/Users/yliang/data/trunk0/spark/data/mllib/sample_libsvm_data.txt")

val featureIndexer = new VectorIndexer()
  .setInputCol("features")
  .setOutputCol("indexedFeatures")
  .setMaxCategories(4)
  .fit(data)

val trainingData = data
val testData = data.drop("label")

val rf = new RandomForestRegressor()
  .setLabelCol("label")
  .setFeaturesCol("indexedFeatures")

val pipeline = new Pipeline()
  .setStages(Array(featureIndexer, rf))

val model = pipeline.fit(trainingData)

val predictions = model.transform(testData)

predictions.select("prediction", "features").show(5)
{code}
Could you tell me whether this code snippet coincide with your issues? If yes, 
I think it's not a bug. Thanks!

> model.transform without label column in random forest regression
> 
>
> Key: SPARK-16993
> URL: https://issues.apache.org/jira/browse/SPARK-16993
> Project: Spark
>  Issue Type: Question
>  Components: Java API, ML
>Reporter: Dulaj Rajitha
>
> I need to use a separate data set to prediction (Not as show in example's 
> training data split).
> But those data do not have the label column. (Since these data are the data 
> that needs to be predict the label).
> but model.transform is informing label column is missing.
> org.apache.spark.sql.AnalysisException: cannot resolve 'label' given input 
> columns: [id,features,prediction]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17048) ML model read for custom transformers in a pipeline does not work

2016-08-16 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15422365#comment-15422365
 ] 

Yanbo Liang commented on SPARK-17048:
-

[~taras.matyashov...@gmail.com] Would you mind to share your code or provide a 
simple example to make others can help you diagnose this issue? Thanks!

> ML model read for custom transformers in a pipeline does not work 
> --
>
> Key: SPARK-17048
> URL: https://issues.apache.org/jira/browse/SPARK-17048
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.0
> Environment: Spark 2.0.0
> Java API
>Reporter: Taras Matyashovskyy
>  Labels: easyfix, features
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> 0. Use Java API :( 
> 1. Create any custom ML transformer
> 2. Make it MLReadable and MLWritable
> 3. Add to pipeline
> 4. Evaluate model, e.g. CrossValidationModel, and save results to disk
> 5. For custom transformer you can use DefaultParamsReader and 
> DefaultParamsWriter, for instance 
> 6. Load model from saved directory
> 7. All out-of-the-box objects are loaded successfully, e.g. Pipeline, 
> Evaluator, etc.
> 8. Your custom transformer will fail with NPE
> Reason:
> ReadWrite.scala:447
> cls.getMethod("read").invoke(null).asInstanceOf[MLReader[T]].load(path)
> In Java this only works for static methods.
> As we are implementing MLReadable or MLWritable, then this call should be 
> instance method call. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17033) GaussianMixture should use treeAggregate to improve performance

2016-08-15 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-17033.
-
   Resolution: Fixed
 Assignee: Yanbo Liang
Fix Version/s: 2.1.0

> GaussianMixture should use treeAggregate to improve performance
> ---
>
> Key: SPARK-17033
> URL: https://issues.apache.org/jira/browse/SPARK-17033
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
> Fix For: 2.1.0
>
>
> {{GaussianMixture}} should use {{treeAggregate}} rather than {{aggregate}} to 
> improve performance and scalability. In my test of dataset with 200 features 
> and 1M instance, I found there is 20% increased performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16934) Update LogisticCostAggregator serialization code to make it consistent with LinearRegression

2016-08-15 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-16934.
-
  Resolution: Fixed
Assignee: Weichen Xu
   Fix Version/s: 2.1.0
Target Version/s: 2.1.0

> Update LogisticCostAggregator serialization code to make it consistent with 
> LinearRegression
> 
>
> Key: SPARK-16934
> URL: https://issues.apache.org/jira/browse/SPARK-16934
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Weichen Xu
>Assignee: Weichen Xu
> Fix For: 2.1.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Update LogisticCostAggregator serialization code to make it consistent with 
> LinearRegression



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17033) GaussianMixture should use treeAggregate to improve performance

2016-08-12 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-17033:

Description: {{GaussianMixture}} should use {{treeAggregate}} rather than 
{{aggregate}} to improve performance and scalability. In my test of dataset 
with 200 features and 1M instance, I found there are 20% increased performance. 
 (was: {{GaussianMixture}} should use {{treeAggregate}} rather than 
{{aggregate}} to improve performance and scalability. In my test of dataset 
with 200 features and 1M instance, I found there are 15% increased performance.)

> GaussianMixture should use treeAggregate to improve performance
> ---
>
> Key: SPARK-17033
> URL: https://issues.apache.org/jira/browse/SPARK-17033
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Yanbo Liang
>Priority: Minor
>
> {{GaussianMixture}} should use {{treeAggregate}} rather than {{aggregate}} to 
> improve performance and scalability. In my test of dataset with 200 features 
> and 1M instance, I found there are 20% increased performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17033) GaussianMixture should use treeAggregate to improve performance

2016-08-12 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-17033:

Description: {{GaussianMixture}} should use {{treeAggregate}} rather than 
{{aggregate}} to improve performance and scalability. In my test of dataset 
with 200 features and 1M instance, I found there are 15% increased performance. 
 (was: {{GaussianMixture}} should use {{treeAggregate}} rather than 
{{aggregate}} to improve performance and scalability. In my test of dataset 
with 200 features and 1M instance, I found there are 20% increased performance.)

> GaussianMixture should use treeAggregate to improve performance
> ---
>
> Key: SPARK-17033
> URL: https://issues.apache.org/jira/browse/SPARK-17033
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Yanbo Liang
>Priority: Minor
>
> {{GaussianMixture}} should use {{treeAggregate}} rather than {{aggregate}} to 
> improve performance and scalability. In my test of dataset with 200 features 
> and 1M instance, I found there are 15% increased performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17033) GaussianMixture should use treeAggregate to improve performance

2016-08-12 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-17033:

Description: {{GaussianMixture}} should use {{treeAggregate}} rather than 
{{aggregate}} to improve performance and scalability. In my test of dataset 
with 200 features and 1M instance, I found there is 20% increased performance.  
(was: {{GaussianMixture}} should use {{treeAggregate}} rather than 
{{aggregate}} to improve performance and scalability. In my test of dataset 
with 200 features and 1M instance, I found there are 20% increased performance.)

> GaussianMixture should use treeAggregate to improve performance
> ---
>
> Key: SPARK-17033
> URL: https://issues.apache.org/jira/browse/SPARK-17033
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Yanbo Liang
>Priority: Minor
>
> {{GaussianMixture}} should use {{treeAggregate}} rather than {{aggregate}} to 
> improve performance and scalability. In my test of dataset with 200 features 
> and 1M instance, I found there is 20% increased performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17033) GaussianMixture should use treeAggregate to improve performance

2016-08-12 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-17033:
---

 Summary: GaussianMixture should use treeAggregate to improve 
performance
 Key: SPARK-17033
 URL: https://issues.apache.org/jira/browse/SPARK-17033
 Project: Spark
  Issue Type: Improvement
Reporter: Yanbo Liang
Priority: Minor


{{GaussianMixture}} should use {{treeAggregate}} rather than {{aggregate}} to 
improve performance and scalability. In my test of dataset with 200 features 
and 1M instance, I found there are 20% increased performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2   3   4   5   6   7   8   9   10   >