[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user shahidki31 commented on the issue: https://github.com/apache/spark/pull/22784 Thanks a lot @srowen --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22784 Merged to master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22784 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98178/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22784 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22784 **[Test build #98178 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98178/testReport)** for PR 22784 at commit [`0effc85`](https://github.com/apache/spark/commit/0effc85ccfc831bcc4c469b4a4c1d8db26fab72e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22784 **[Test build #98178 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98178/testReport)** for PR 22784 at commit [`0effc85`](https://github.com/apache/spark/commit/0effc85ccfc831bcc4c469b4a4c1d8db26fab72e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22784 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98164/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22784 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22784 **[Test build #98164 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98164/testReport)** for PR 22784 at commit [`2b7ee7b`](https://github.com/apache/spark/commit/2b7ee7b0a6d2cbcc159826d8dbe286a4a144d463). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22784 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98161/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22784 **[Test build #98161 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98161/testReport)** for PR 22784 at commit [`0d9eea8`](https://github.com/apache/spark/commit/0d9eea8fcffbdd72bdb8dd8b93de3ac9a782fc85). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22784 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22784 **[Test build #98164 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98164/testReport)** for PR 22784 at commit [`2b7ee7b`](https://github.com/apache/spark/commit/2b7ee7b0a6d2cbcc159826d8dbe286a4a144d463). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22784 **[Test build #98161 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98161/testReport)** for PR 22784 at commit [`0d9eea8`](https://github.com/apache/spark/commit/0d9eea8fcffbdd72bdb8dd8b93de3ac9a782fc85). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22784 **[Test build #98145 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98145/testReport)** for PR 22784 at commit [`18af032`](https://github.com/apache/spark/commit/18af0325e95552a00983983224795e71f2e66204). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22784 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22784 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98145/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22784 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22784 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98144/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22784 **[Test build #98144 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98144/testReport)** for PR 22784 at commit [`094594b`](https://github.com/apache/spark/commit/094594bf63a22be65bac7b31932d5d870f1142d3). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22784 **[Test build #98145 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98145/testReport)** for PR 22784 at commit [`18af032`](https://github.com/apache/spark/commit/18af0325e95552a00983983224795e71f2e66204). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22784 **[Test build #98144 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98144/testReport)** for PR 22784 at commit [`094594b`](https://github.com/apache/spark/commit/094594bf63a22be65bac7b31932d5d870f1142d3). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22784 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98141/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22784 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22784 **[Test build #98141 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98141/testReport)** for PR 22784 at commit [`3cbe017`](https://github.com/apache/spark/commit/3cbe017c640764db0fe95bcc2a820917bbc5fb3e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22784 **[Test build #98140 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98140/testReport)** for PR 22784 at commit [`5674e17`](https://github.com/apache/spark/commit/5674e177b7894d61904c6748dbf7721359163938). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22784 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98140/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22784 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22784 **[Test build #98141 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98141/testReport)** for PR 22784 at commit [`3cbe017`](https://github.com/apache/spark/commit/3cbe017c640764db0fe95bcc2a820917bbc5fb3e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22784 **[Test build #98140 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98140/testReport)** for PR 22784 at commit [`5674e17`](https://github.com/apache/spark/commit/5674e177b7894d61904c6748dbf7721359163938). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22784 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22784 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98135/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22784 **[Test build #98135 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98135/testReport)** for PR 22784 at commit [`a8c4391`](https://github.com/apache/spark/commit/a8c43919a5d8624a5a5ddf7ea862a93f2db098c6). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22784 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98134/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22784 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22784 **[Test build #98134 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98134/testReport)** for PR 22784 at commit [`b1789d7`](https://github.com/apache/spark/commit/b1789d7a2305c53b463960e1d60f85abde5934ad). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22784 **[Test build #98135 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98135/testReport)** for PR 22784 at commit [`a8c4391`](https://github.com/apache/spark/commit/a8c43919a5d8624a5a5ddf7ea862a93f2db098c6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user shahidki31 commented on the issue: https://github.com/apache/spark/pull/22784 Thank you @srowen for the review. I have addressed the comments. > I wonder if the SVD should be used at even smaller scales? as you point out, it's pretty hard to compute a gramian on even a 40k x 40k matrix. > Yes. We can compute the PCA using SVD even for smaller scales. In fact if the number of columns are lesser, Spark SVD computes eigen decomposition by computing gramian matrix first, which is the same approach as in PCA. The condition for whether to compute gramian matrix first or not is given below, for Spark SVD. https://github.com/apache/spark/blob/d5573c578a1eea9ee04886d9df37c7178e67bb30/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala#L232-L243 So, for smaller number of columns, (< 15000 columns), Spark SVD prefers computation of graminan matrix first and then computing the svd, which is same as the current implimentation of PCA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22784 **[Test build #98134 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98134/testReport)** for PR 22784 at commit [`b1789d7`](https://github.com/apache/spark/commit/b1789d7a2305c53b463960e1d60f85abde5934ad). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22784 Sorry for my mistake. My keyboard '4' sometimes has a trouble. > I think, INT_MAX is 2147483647, so n ~= sqrt(2*2147483647) = 65536. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user shahidki31 commented on the issue: https://github.com/apache/spark/pull/22784 Hi @srowen , Thanks for the comment. As per my knowledge, PCA/SVD is not limited on row size. 1) Currently row size is not a constraint. Ultimately we need to compute graminan matrix/GramianMatrix vector product to compute SVD. So, computation of svd is limited by columns only. 5) Sparsity is only for the computation of gramian matrix/ gramian matrix vector product in both PCA and Spark SVD. Mean centred vector will always be dense. Currently PCA is computed with dense matrix and SVD uses dense vector. So, only constraint about dense is coming in the matrix vector product computation. 6) In this PR, if the limit exceeds, it will compute in the distributed manner, which current PCA doesn't support. 2) Currently PCA is not scalable in terms of column number For 40GB driver memory, and number of columns is 40,000 and number of rows is 1lakh, I am getting following error. ``` scala> val pca = new PCA(k).fit(rad) 2018-10-22 22:44:23,128 | WARN | main | 4 columns will require at least 12800 megabytes of memory! | org.apache.spark.internal.Logging$class.logWarning(Logging.scala:66) 2018-10-22 22:47:02,836 | WARN | main | 4 columns will require at least 12800 megabytes of memory! | org.apache.spark.internal.Logging$class.logWarning(Logging.scala:66) java.lang.OutOfMemoryError at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877) at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22784 Hm, as a general comment, is this going to scale? This is making a potentially huge sparse data set dense, and computing a PCA via SVD. I get the idea that it's better to have some option than none, but I wonder if this approach is realistic for a data set with even 100K rows, and if not, is it going to confuse people. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user shahidki31 commented on the issue: https://github.com/apache/spark/pull/22784 Hi @kiszk Maximum it can go upto the following limit. https://github.com/apache/spark/blob/23cfda1547355a823a3b2b2d374e64608c9ce175/mllib/src/main/scala/org/apache/spark/mllib/linalg/EigenValueDecomposition.scala#L78-L79 where ncv = min(n, 2*k), normally k << n. For eg: if n = 1 million features, we can compute top 100 principle components. Number of principle components to compute is configurable. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22784 One question: After this PR, what is the maximum column that we can accept? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22784 Can I clarify the description? > Because we are passing an array of size n*(n+1)/2 to the breeze library and the size cannot be more than INT_MAX. so, the maximum column size we can give is 65,500. If n > 20726, `n*(n+1)/2` > 214783647 ( = INT_MAX)`. Where does this limitation `65,500` come from? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22784 Can I clarify the description? > Because we are passing an array of size n*(n+1)/2 to the breeze library and the size cannot be more than INT_MAX. so, the maximum column size we can give is 65,500. If n > 20726, `n*(n+1)/2` > 214783647 ( = INT_MAX)`. Where does this limitation `65,500` come from? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user shahidki31 commented on the issue: https://github.com/apache/spark/pull/22784 Hi @kiszk , I think, INT_MAX is 2147483647, so n ~= sqrt(2*2147483647) = 65536. Thanks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22784 Can I clarify the description? > Because we are passing an array of size n*(n+1)/2 to the breeze library and the size cannot be more than INT_MAX. so, the maximum column size we can give is 65,500. If n > 20726, `n*(n+1)/2` > 214783647 ( = INT_MAX)`. Where does this limitation `65,500` come from? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22784 Can I clarify the description? > Because we are passing an array of size n*(n+1)/2 to the breeze library and the size cannot be more than INT_MAX. so, the maximum column size we can give is 65,500. If n > 20726, `n*(n+1)/2` > 214783647 ( = INT_MAX)`. Where does `65,500` come from? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22784 Can I clarify the description? > Because we are passing an array of size n*(n+1)/2 to the breeze library and the size cannot be more than INT_MAX. so, the maximum column size we can give is 65,500. If n > 20726, `n*(n+1)/2` > 214783647 ( = INT_MAX)`. Where does `65,500` come from? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22784 Can I clarify the description? > Because we are passing an array of size n*(n+1)/2 to the breeze library and the size cannot be more than INT_MAX. so, the maximum column size we can give is 65,500. If n > 20726, `n*(n+1)/2` > 214783647 ( = INT_MAX)`. Where does `65,500` come from? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22784 Can I clarify the description? > Because we are passing an array of size n*(n+1)/2 to the breeze library and the size cannot be more than INT_MAX. so, the maximum column size we can give is 65,500. If n > 20726, `n*(n+1)/2` > 214783647 ( = INT_MAX)`. Where does `65,500` come from? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22784 Can I clarify the description? > Because we are passing an array of size n*(n+1)/2 to the breeze library and the size cannot be more than INT_MAX. so, the maximum column size we can give is 65,500. If n > 20726, `n*(n+1)/2` > 214783647 ( = INT_MAX)`. Where does `65,500` come from? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22784 Can I clarify the description? > Because we are passing an array of size n*(n+1)/2 to the breeze library and the size cannot be more than INT_MAX. so, the maximum column size we can give is 65,500. If n > 20726, `n*(n+1)/2` > 214783647 ( = INT_MAX)`. Where does `65,500` come from? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22784 Can I clarify the description? > Because we are passing an array of size n*(n+1)/2 to the breeze library and the size cannot be more than INT_MAX. so, the maximum column size we can give is 65,500. If n > 20726, `n*(n+1)/2` > 214783647 ( = INT_MAX)`. Where does `65,500` come from? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22784 Can I clarify the description? > Because we are passing an array of size n*(n+1)/2 to the breeze library and the size cannot be more than INT_MAX. so, the maximum column size we can give is 65,500. If n > 20726, `n*(n+1)/2` > 214783647 ( = INT_MAX)`. Where does `65,500` come from? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22784 Can I clarify the description? > Because we are passing an array of size n*(n+1)/2 to the breeze library and the size cannot be more than INT_MAX. so, the maximum column size we can give is 65,500. If n > 20726, `n*(n+1)/2` > 214783647 ( = INT_MAX)`. Where does `65,500` come from? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22784 Can I clarify the description? > Because we are passing an array of size n*(n+1)/2 to the breeze library and the size cannot be more than INT_MAX. so, the maximum column size we can give is 65,500. If n > 20726, `n*(n+1)/2` > 214783647 ( = INT_MAX)`. Where does `65,500` come from? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22784 **[Test build #97683 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97683/testReport)** for PR 22784 at commit [`23cfda1`](https://github.com/apache/spark/commit/23cfda1547355a823a3b2b2d374e64608c9ce175). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22784 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97683/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22784 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22784 **[Test build #97683 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97683/testReport)** for PR 22784 at commit [`23cfda1`](https://github.com/apache/spark/commit/23cfda1547355a823a3b2b2d374e64608c9ce175). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user shahidki31 commented on the issue: https://github.com/apache/spark/pull/22784 All the UTs are passing locally. Seems random error. retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22784 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22784 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97682/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22784 **[Test build #97682 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97682/testReport)** for PR 22784 at commit [`9aff54f`](https://github.com/apache/spark/commit/9aff54fecc530c77e5f97941e15a478a421827d0). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22784 **[Test build #97682 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97682/testReport)** for PR 22784 at commit [`9aff54f`](https://github.com/apache/spark/commit/9aff54fecc530c77e5f97941e15a478a421827d0). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22784 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22784 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22784 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97680/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22784 **[Test build #97680 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97680/testReport)** for PR 22784 at commit [`9aff54f`](https://github.com/apache/spark/commit/9aff54fecc530c77e5f97941e15a478a421827d0). * This patch **fails due to an unknown error code, -9**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22784 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22784 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97679/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22784 **[Test build #97679 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97679/testReport)** for PR 22784 at commit [`4c1776f`](https://github.com/apache/spark/commit/4c1776f14f1453c2f64350a58cab7209764f826c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22784 **[Test build #97680 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97680/testReport)** for PR 22784 at commit [`9aff54f`](https://github.com/apache/spark/commit/9aff54fecc530c77e5f97941e15a478a421827d0). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22784 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97677/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22784 **[Test build #97677 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97677/testReport)** for PR 22784 at commit [`e6cf661`](https://github.com/apache/spark/commit/e6cf6612ee488ace7bfc11db26ee4d6cc72e3368). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22784 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22784 **[Test build #97679 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97679/testReport)** for PR 22784 at commit [`4c1776f`](https://github.com/apache/spark/commit/4c1776f14f1453c2f64350a58cab7209764f826c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22784 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97678/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22784 **[Test build #97678 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97678/testReport)** for PR 22784 at commit [`9111fca`](https://github.com/apache/spark/commit/9111fcab296ca71bfa280010d60aa803ece69509). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22784 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22784 **[Test build #97678 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97678/testReport)** for PR 22784 at commit [`9111fca`](https://github.com/apache/spark/commit/9111fcab296ca71bfa280010d60aa803ece69509). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22784 **[Test build #97677 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97677/testReport)** for PR 22784 at commit [`e6cf661`](https://github.com/apache/spark/commit/e6cf6612ee488ace7bfc11db26ee4d6cc72e3368). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user shahidki31 commented on the issue: https://github.com/apache/spark/pull/22784 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22784 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97673/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22784 **[Test build #97673 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97673/testReport)** for PR 22784 at commit [`1252526`](https://github.com/apache/spark/commit/12525266fe76b767974e5ba94cd131251bc7ed3e). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22784 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22784 **[Test build #97673 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97673/testReport)** for PR 22784 at commit [`1252526`](https://github.com/apache/spark/commit/12525266fe76b767974e5ba94cd131251bc7ed3e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22784 ok to test. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user shahidki31 commented on the issue: https://github.com/apache/spark/pull/22784 cc @srowen --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user shahidki31 commented on the issue: https://github.com/apache/spark/pull/22784 Test results with existing PCA and using SVD without computing covariance matrix val data = Array( Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))), Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0), Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)) 1) PCA using covariance matrix explained Variance = [ 0.7943932532, 0.2056067468, 1.26E-16] Top 2 Principle components : [[-0.44859172075072673 -0.28423808214073987 0.13301985745398526 -0.05621155904253121 -0.1252315635978212 0.7636264774662965 0.21650756651919933 -0.5652958773533949 -0.8476512931126826 -0.11560340501314653 ]] 2) PCA using SVD, without computing covariance matrix: explained Variance = [0.7943932532, 0.2056067468, 5.55E-17] Top 2 Principle components : [[-0.44859172075072673 -0.2842380821407399 0.13301985745398529 -0.056211559042531424 -0.12523156359782125 0.7636264774662964 0.21650756651919945 -0.5652958773533953 -0.8476512931126826 -0.11560340501314664]] **Leading Eigen Values MSE = 0.0 Leading eigen vectors MSE = 0.0** --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22784 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22784 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22784 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org