[GitHub] spark pull request: [SPARK-14661] [MLlib] trim PCAModel by require...

2016-04-28 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/12419#discussion_r61462095
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala 
---
@@ -379,15 +379,21 @@ class RowMatrix @Since("1.0.0") (
*
* Note that this cannot be computed on matrices with more than 65535 
columns.
*
-   * @param k number of top principal components.
+   * @param filter either the number of top principal components or 
variance
+   *   retained by the minimal set of principal components.
* @return a matrix of size n-by-k, whose columns are principal 
components, and
* a vector of values which indicate how much variance each principal 
component
* explains
*/
   @Since("1.6.0")
-  def computePrincipalComponentsAndExplainedVariance(k: Int): (Matrix, 
Vector) = {
+  def computePrincipalComponentsAndExplainedVariance(filter: Either[Int, 
Double])
--- End diff --

Aha you're right, it wasn't in 1.6. This is my fault: 
https://github.com/apache/spark/commit/21b3d2a75f679b252e293000d706741dca33624a
It never was added to branch 1.6, despite the apparent intention. At this 
point I think it should be considered 2.0+ and you can fix that annotation 
here. So yeah this method was never 'released'. Still I think we want to do 
something different with the argument anyway.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14661] [MLlib] trim PCAModel by require...

2016-04-28 Thread psuszyns
Github user psuszyns commented on a diff in the pull request:

https://github.com/apache/spark/pull/12419#discussion_r61456369
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala 
---
@@ -379,15 +379,21 @@ class RowMatrix @Since("1.0.0") (
*
* Note that this cannot be computed on matrices with more than 65535 
columns.
*
-   * @param k number of top principal components.
+   * @param filter either the number of top principal components or 
variance
+   *   retained by the minimal set of principal components.
* @return a matrix of size n-by-k, whose columns are principal 
components, and
* a vector of values which indicate how much variance each principal 
component
* explains
*/
   @Since("1.6.0")
-  def computePrincipalComponentsAndExplainedVariance(k: Int): (Matrix, 
Vector) = {
+  def computePrincipalComponentsAndExplainedVariance(filter: Either[Int, 
Double])
--- End diff --

This is RowMatrix as in 1.6.1 release: 
https://github.com/apache/spark/blob/15de51c238a7340fa81cb0b80d029a05d97bfc5c/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala
 am I correct? If yes then can you find there a method named 
computePrincipalComponentsAndExplainedVariance? I can't, yet on master it is 
annotated with Since("1.6.0") - isn't it false?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14661] [MLlib] trim PCAModel by require...

2016-04-28 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/12419#discussion_r61395945
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala 
---
@@ -379,15 +379,21 @@ class RowMatrix @Since("1.0.0") (
*
* Note that this cannot be computed on matrices with more than 65535 
columns.
*
-   * @param k number of top principal components.
+   * @param filter either the number of top principal components or 
variance
+   *   retained by the minimal set of principal components.
* @return a matrix of size n-by-k, whose columns are principal 
components, and
* a vector of values which indicate how much variance each principal 
component
* explains
*/
   @Since("1.6.0")
-  def computePrincipalComponentsAndExplainedVariance(k: Int): (Matrix, 
Vector) = {
+  def computePrincipalComponentsAndExplainedVariance(filter: Either[Int, 
Double])
--- End diff --

Yeah having one method to mean two things using an `Either` is too strange. 
At least, you would provide two overloads. And then, no reason to overload 
versus given them distinct and descriptive names.

I don't understand the question about unreleased APIs -- 1.6.0 was released 
a while ago and this method takes an Int parameter there. We certainly want to 
keep the ability to set a fixed number of principal components.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14661] [MLlib] trim PCAModel by require...

2016-04-27 Thread jodersky
Github user jodersky commented on a diff in the pull request:

https://github.com/apache/spark/pull/12419#discussion_r61335359
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala 
---
@@ -379,15 +379,21 @@ class RowMatrix @Since("1.0.0") (
*
* Note that this cannot be computed on matrices with more than 65535 
columns.
*
-   * @param k number of top principal components.
+   * @param filter either the number of top principal components or 
variance
+   *   retained by the minimal set of principal components.
* @return a matrix of size n-by-k, whose columns are principal 
components, and
* a vector of values which indicate how much variance each principal 
component
* explains
*/
   @Since("1.6.0")
-  def computePrincipalComponentsAndExplainedVariance(k: Int): (Matrix, 
Vector) = {
+  def computePrincipalComponentsAndExplainedVariance(filter: Either[Int, 
Double])
--- End diff --

Not sure about the breakage, nevertheless I would recommend implementing a 
new method regardless. I find the method's parameter type `Either[Int, Double]` 
to be quite confusing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14661] [MLlib] trim PCAModel by require...

2016-04-27 Thread psuszyns
Github user psuszyns commented on a diff in the pull request:

https://github.com/apache/spark/pull/12419#discussion_r61326101
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala 
---
@@ -379,15 +379,21 @@ class RowMatrix @Since("1.0.0") (
*
* Note that this cannot be computed on matrices with more than 65535 
columns.
*
-   * @param k number of top principal components.
+   * @param filter either the number of top principal components or 
variance
+   *   retained by the minimal set of principal components.
* @return a matrix of size n-by-k, whose columns are principal 
components, and
* a vector of values which indicate how much variance each principal 
component
* explains
*/
   @Since("1.6.0")
-  def computePrincipalComponentsAndExplainedVariance(k: Int): (Matrix, 
Vector) = {
+  def computePrincipalComponentsAndExplainedVariance(filter: Either[Int, 
Double])
--- End diff --

@sethah @jodersky It looks like the comment Since("1.6.0") is false becaue 
this method is not available in spark 1.6 - this change was merged to master 
instead of 1.6 branch. Do you still consider this change as API breaking given 
that it modifies API that wasn't yet released? If yes then I'll do as @jodersky 
said and introduce a new method and move common code to a new private one. I'd 
really like to have this feature in MLlib version because I use it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14661] [MLlib] trim PCAModel by require...

2016-04-27 Thread jodersky
Github user jodersky commented on a diff in the pull request:

https://github.com/apache/spark/pull/12419#discussion_r61313031
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala 
---
@@ -379,15 +379,21 @@ class RowMatrix @Since("1.0.0") (
*
* Note that this cannot be computed on matrices with more than 65535 
columns.
*
-   * @param k number of top principal components.
+   * @param filter either the number of top principal components or 
variance
+   *   retained by the minimal set of principal components.
* @return a matrix of size n-by-k, whose columns are principal 
components, and
* a vector of values which indicate how much variance each principal 
component
* explains
*/
   @Since("1.6.0")
-  def computePrincipalComponentsAndExplainedVariance(k: Int): (Matrix, 
Vector) = {
+  def computePrincipalComponentsAndExplainedVariance(filter: Either[Int, 
Double])
--- End diff --

I'm no expert in the ML domain, but from a user perspective, this breaks 
API backwards compatibility.
An alternative could be to create a new method and factor out common 
behaviour shared with the current 
`computePrincipalComponentsAndExplainedVariance` into a private utility method.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14661] [MLlib] trim PCAModel by require...

2016-04-27 Thread sethah
Github user sethah commented on the pull request:

https://github.com/apache/spark/pull/12419#issuecomment-215110515
  
@psuszyns This introduces a breaking change to the MLlib API, which we 
should avoid since it is not strictly necessary. Looking at this more 
carefully, the simplest way to do this seems like it would be to add this for 
only spark.ML by requesting the full PCA from MLlib, then trimming according to 
retained variance in the spark.ML fit method. I'm not sure if we ought to make 
this available in MLlib, given that we could avoid some of the complexity. If 
we do, we need to do it in a way that does not break the APIs. 

Also, please do run the style checker, and see [Contributing to 
Spark](https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark) 
for Spark specific style guidelines.

@srowen @mengxr What do you think about this change?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14661] [MLlib] trim PCAModel by require...

2016-04-26 Thread psuszyns
Github user psuszyns commented on the pull request:

https://github.com/apache/spark/pull/12419#issuecomment-214748975
  
@sethah please review my latest commit - is it any close to what you had in 
your mind?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14661] [MLlib] trim PCAModel by require...

2016-04-15 Thread sethah
Github user sethah commented on the pull request:

https://github.com/apache/spark/pull/12419#issuecomment-210622479
  
I don't believe this will break the API. You can get away without even 
changing the MLlib API by adding a private constructor or a private call to the 
fit method that passes in a retained variance parameter. Also, it looks like it 
would be good to update the `computePrincipalComponentsAndExplainedVariance` 
method to alternately work with an retained variance parameter. Otherwise, 
you'd have to pass it a value for `k` equal to the full number of principal 
components and then trim it manually. Thoughts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14661] [MLlib] trim PCAModel by require...

2016-04-15 Thread psuszyns
Github user psuszyns commented on the pull request:

https://github.com/apache/spark/pull/12419#issuecomment-210616221
  
Well I wanted to add this option without braking/amending current API. In 
my app I use it by first training the PCA with k = number of features and then 
calling the method I added. But I agree that it would be nicer to have the 
'variance retained' as the input parameter of the PCA. I'll add appropriate 
setters to the 'ml' version and another constructor to the 'mllib' version, ok?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14661] [MLlib] trim PCAModel by require...

2016-04-15 Thread sethah
Github user sethah commented on the pull request:

https://github.com/apache/spark/pull/12419#issuecomment-210570393
  
@psuszyns I have some high level comments. To me, it does not make sense to 
train a PCA model, keeping k components, and then trim by variance explained. 
If I have a model with 10 columns, and I train a PCA model with k = 6 
components, I retain some fraction of the variance. Then I request to trim the 
model by some fraction that might be _greater_ than the variance I originally 
retained, so it will be impossible.

I think this should be implemented by having two parameters `k` and 
`retainedVariance` where the full PCA is trained, and then the model is trimmed 
by one of the two possible methods. When you set one of the params, you can 
automatically unset the other since it doesn't make sense to use them both 
(this is done, for example, in Logistic Regression with `threshold` and 
`thresholds`. This would require changing ML _and_ MLlib, which is ok. Perhaps 
@srowen could provide some thoughts.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14661] [MLlib] trim PCAModel by require...

2016-04-15 Thread psuszyns
Github user psuszyns commented on the pull request:

https://github.com/apache/spark/pull/12419#issuecomment-210553807
  
@srowen, @sethah: I've pushed a commit that's supposed to address your 
concerns


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14661] [MLlib] trim PCAModel by require...

2016-04-15 Thread sethah
Github user sethah commented on the pull request:

https://github.com/apache/spark/pull/12419#issuecomment-210514637
  
It looks like this could just as well be implemented in ML instead of 
MLlib. It is my understanding that we should avoid adding new features to MLlib 
unless it's blocking an improvement in ML. That doesn't seem to be the case 
here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14661] [MLlib] trim PCAModel by require...

2016-04-15 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/12419#discussion_r59891984
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/PCA.scala ---
@@ -109,4 +111,21 @@ class PCAModel private[spark] (
   s"SparseVector or DenseVector. Instead got: ${vector.getClass}")
 }
   }
+
+  def minimalByVarianceExplained(requiredVarianceRetained: Double): 
PCAModel = {
+val minFeaturesNum = explainedVariance
--- End diff --

How about `explainedVariance.values.scanLeft(0.0)(_ + _).indexWhere(_ >= 
requiredVarianceRetained) + 1`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14661] [MLlib] trim PCAModel by require...

2016-04-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12419#issuecomment-210479410
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14661] [MLlib] trim PCAModel by require...

2016-04-15 Thread psuszyns
GitHub user psuszyns opened a pull request:

https://github.com/apache/spark/pull/12419

[SPARK-14661] [MLlib] trim PCAModel by required explained variance

## What changes were proposed in this pull request?

New method in PCAModel for auto-trimming the model to minimal number of 
features calculated from required variance retained by those features


## How was this patch tested?

unit tests


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/psuszyns/spark PCA_trimmed_by_variance

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/12419.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #12419


commit ac1b9fe9c920171b6baf1a3bb12a4c7ea7d79ec1
Author: Piotr Suszyński 
Date:   2016-04-15T11:55:41Z

SPARK-14661: PCA trimmed by variance




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org