[GitHub] spark pull request: [SPARK-5021] Gaussian Mixture now supports Spa...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4459#issuecomment-73463364 [Test build #27087 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27087/consoleFull) for PR 4459 at commit [`c235bd0`](https://github.com/apache/spark/commit/c235bd0f787e39bfd032db45e5ea2da2329c671a). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5021] Gaussian Mixture now supports Spa...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/4459#discussion_r24310793 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/clustering/GaussianMixtureSuite.scala --- @@ -40,10 +41,15 @@ class GaussianMixtureSuite extends FunSuite with MLlibTestSparkContext { val seeds = Array(314589, 29032897, 50181, 494821, 4660) seeds.foreach { seed => val gmm = new GaussianMixture().setK(1).setSeed(seed).run(data) + val sparseGMM = new GaussianMixture().setK(1).setSeed(seed).run(sparseData) assert(gmm.weights(0) ~== Ew absTol 1E-5) assert(gmm.gaussians(0).mu ~== Emu absTol 1E-5) assert(gmm.gaussians(0).sigma ~== Esigma absTol 1E-5) + assert(sparseGMM.weights(0) ~== Ew absTol 1E-5) + assert(sparseGMM.gaussians(0).mu ~== Emu absTol 1E-5) + assert(sparseGMM.gaussians(0).sigma ~== Esigma absTol 1E-5) --- End diff -- I was thinking that there would be something more to this data, than randomly generated numbers. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5021] Gaussian Mixture now supports Spa...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/4459#discussion_r24310707 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala --- @@ -215,20 +217,29 @@ private object ExpectationSum { def add( weights: Array[Double], dists: Array[MultivariateGaussian]) - (sums: ExpectationSum, x: BreezeVector[Double]): ExpectationSum = { + (sums: ExpectationSum, x: BV[Double]): ExpectationSum = { val p = weights.zip(dists).map { case (weight, dist) => MLUtils.EPSILON + weight * dist.pdf(x) } val pSum = p.sum sums.logLikelihood += math.log(pSum) -val xxt = x * new Transpose(x) var i = 0 +val isSparse = x match { + case _: BSV[Double] => true + case _: BDV[Double] => false +} while (i < sums.k) { p(i) /= pSum sums.weights(i) += p(i) sums.means(i) += x * p(i) - BLAS.syr(p(i), Vectors.fromBreeze(x).asInstanceOf[DenseVector], -Matrices.fromBreeze(sums.sigmas(i)).asInstanceOf[DenseMatrix]) + if (isSparse){ +BLAS.syr(p(i), Vectors.fromBreeze(x).asInstanceOf[SparseVector], + Matrices.fromBreeze(sums.sigmas(i)).asInstanceOf[DenseMatrix]) + } + else { +BLAS.syr(p(i), Vectors.fromBreeze(x).asInstanceOf[DenseVector], + Matrices.fromBreeze(sums.sigmas(i)).asInstanceOf[DenseMatrix]) + } --- End diff -- If I do that, I get a type check error, saying value values is not a member of org.apache.spark.mllib.linalg.Vector which might explain why there are separate dot methods for Sparse and Dense Vectors. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5021] Gaussian Mixture now supports Spa...
Github user MechCoder commented on the pull request: https://github.com/apache/spark/pull/4459#issuecomment-73456587 @tgaloppo Thanks. I shall fix them in a while. However, does the general code look good to you? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5021] Gaussian Mixture now supports Spa...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/4459#discussion_r24308815 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/clustering/GaussianMixtureSuite.scala --- @@ -40,10 +41,15 @@ class GaussianMixtureSuite extends FunSuite with MLlibTestSparkContext { val seeds = Array(314589, 29032897, 50181, 494821, 4660) seeds.foreach { seed => val gmm = new GaussianMixture().setK(1).setSeed(seed).run(data) + val sparseGMM = new GaussianMixture().setK(1).setSeed(seed).run(sparseData) assert(gmm.weights(0) ~== Ew absTol 1E-5) assert(gmm.gaussians(0).mu ~== Emu absTol 1E-5) assert(gmm.gaussians(0).sigma ~== Esigma absTol 1E-5) + assert(sparseGMM.weights(0) ~== Ew absTol 1E-5) + assert(sparseGMM.gaussians(0).mu ~== Emu absTol 1E-5) + assert(sparseGMM.gaussians(0).sigma ~== Esigma absTol 1E-5) --- End diff -- How do I come up with the data? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5021] Gaussian Mixture now supports Spa...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4459#issuecomment-73434003 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27049/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5021] Gaussian Mixture now supports Spa...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4459#issuecomment-73434001 [Test build #27049 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27049/consoleFull) for PR 4459 at commit [`6d8d4bb`](https://github.com/apache/spark/commit/6d8d4bbcadde1fecb12989ae49c4ffe975979aa1). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5021] Gaussian Mixture now supports Spa...
Github user tgaloppo commented on a diff in the pull request: https://github.com/apache/spark/pull/4459#discussion_r24302764 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala --- @@ -255,6 +255,20 @@ private[spark] object BLAS extends Serializable with Logging { } } + // Workaround for SparseVectors. + def syr(alpha: Double, x: SparseVector, A: DenseMatrix) { +val mA = A.numRows +val nA = A.numCols +require(mA == nA, s"A is not a symmetric matrix. A: $mA x $nA") --- End diff -- Just a general comment -- I find this message a little misleading, since it really only checks for squareness; I realize this is copied from the dense syr() [I found it misleading there, too]. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5021] Gaussian Mixture now supports Spa...
Github user tgaloppo commented on a diff in the pull request: https://github.com/apache/spark/pull/4459#discussion_r24302677 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/clustering/GaussianMixtureSuite.scala --- @@ -40,10 +41,15 @@ class GaussianMixtureSuite extends FunSuite with MLlibTestSparkContext { val seeds = Array(314589, 29032897, 50181, 494821, 4660) seeds.foreach { seed => val gmm = new GaussianMixture().setK(1).setSeed(seed).run(data) + val sparseGMM = new GaussianMixture().setK(1).setSeed(seed).run(sparseData) assert(gmm.weights(0) ~== Ew absTol 1E-5) assert(gmm.gaussians(0).mu ~== Emu absTol 1E-5) assert(gmm.gaussians(0).sigma ~== Esigma absTol 1E-5) + assert(sparseGMM.weights(0) ~== Ew absTol 1E-5) + assert(sparseGMM.gaussians(0).mu ~== Emu absTol 1E-5) + assert(sparseGMM.gaussians(0).sigma ~== Esigma absTol 1E-5) --- End diff -- It might be good to create separate test cases for sparse inputs --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5021] Gaussian Mixture now supports Spa...
Github user tgaloppo commented on a diff in the pull request: https://github.com/apache/spark/pull/4459#discussion_r24302656 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala --- @@ -215,20 +217,29 @@ private object ExpectationSum { def add( weights: Array[Double], dists: Array[MultivariateGaussian]) - (sums: ExpectationSum, x: BreezeVector[Double]): ExpectationSum = { + (sums: ExpectationSum, x: BV[Double]): ExpectationSum = { val p = weights.zip(dists).map { case (weight, dist) => MLUtils.EPSILON + weight * dist.pdf(x) } val pSum = p.sum sums.logLikelihood += math.log(pSum) -val xxt = x * new Transpose(x) var i = 0 +val isSparse = x match { + case _: BSV[Double] => true + case _: BDV[Double] => false +} while (i < sums.k) { p(i) /= pSum sums.weights(i) += p(i) sums.means(i) += x * p(i) - BLAS.syr(p(i), Vectors.fromBreeze(x).asInstanceOf[DenseVector], -Matrices.fromBreeze(sums.sigmas(i)).asInstanceOf[DenseMatrix]) + if (isSparse){ +BLAS.syr(p(i), Vectors.fromBreeze(x).asInstanceOf[SparseVector], + Matrices.fromBreeze(sums.sigmas(i)).asInstanceOf[DenseMatrix]) + } + else { +BLAS.syr(p(i), Vectors.fromBreeze(x).asInstanceOf[DenseVector], + Matrices.fromBreeze(sums.sigmas(i)).asInstanceOf[DenseMatrix]) + } --- End diff -- I would push the logic of checking for sparsity into the BLAS.syr method... (many other BLAS routines already do this... see BLAS.{axpy, copy, dot, gemm, scal, ...}) ... then this could simply be BLAS.syr(p(i), Vectors.fromBreeze(x), Matrices.fromBreeze(sums.sigmas(i).asInstanceOf[DenseMatrix]) (obviating the need for the isSparse value) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5021] Gaussian Mixture now supports Spa...
Github user tgaloppo commented on a diff in the pull request: https://github.com/apache/spark/pull/4459#discussion_r24302522 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala --- @@ -19,10 +19,12 @@ package org.apache.spark.mllib.clustering import scala.collection.mutable.IndexedSeq -import breeze.linalg.{DenseMatrix => BreezeMatrix, DenseVector => BreezeVector, Transpose, diag} +import breeze.linalg.{DenseVector => BDV, Vector => BV, SparseVector => BSV, + DenseMatrix => BreezeMatrix, diag, Transpose} import org.apache.spark.annotation.Experimental -import org.apache.spark.mllib.linalg.{BLAS, DenseMatrix, DenseVector, Matrices, Vector, Vectors} +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors, DenseVector, --- End diff -- alphabetize these --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5021] Gaussian Mixture now supports Spa...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4459#issuecomment-73430021 [Test build #27049 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27049/consoleFull) for PR 4459 at commit [`6d8d4bb`](https://github.com/apache/spark/commit/6d8d4bbcadde1fecb12989ae49c4ffe975979aa1). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5021] Gaussian Mixture now supports Spa...
Github user MechCoder commented on the pull request: https://github.com/apache/spark/pull/4459#issuecomment-73429922 ping @tgaloppo and @jkbradley (whenever you are back!) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5021] Gaussian Mixture now supports Spa...
GitHub user MechCoder opened a pull request: https://github.com/apache/spark/pull/4459 [SPARK-5021] Gaussian Mixture now supports Sparse Input Following discussion in the Jira. You can merge this pull request into a Git repository by running: $ git pull https://github.com/MechCoder/spark sparse_gmm Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4459.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4459 commit 6d8d4bbcadde1fecb12989ae49c4ffe975979aa1 Author: MechCoder Date: 2015-02-08T20:07:06Z [SPARK-5021] Gaussian Mixture now supports Sparse Input --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org