[GitHub] spark pull request: [SPARK-5021] Gaussian Mixture now supports Spa...

2015-02-08 Thread tgaloppo
Github user tgaloppo commented on a diff in the pull request:

https://github.com/apache/spark/pull/4459#discussion_r24302522
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala ---
@@ -19,10 +19,12 @@ package org.apache.spark.mllib.clustering
 
 import scala.collection.mutable.IndexedSeq
 
-import breeze.linalg.{DenseMatrix = BreezeMatrix, DenseVector = 
BreezeVector, Transpose, diag}
+import breeze.linalg.{DenseVector = BDV, Vector = BV, SparseVector = 
BSV,
+  DenseMatrix = BreezeMatrix, diag, Transpose}
 
 import org.apache.spark.annotation.Experimental
-import org.apache.spark.mllib.linalg.{BLAS, DenseMatrix, DenseVector, 
Matrices, Vector, Vectors}
+import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors, 
DenseVector,
--- End diff --

alphabetize these


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5021] Gaussian Mixture now supports Spa...

2015-02-08 Thread tgaloppo
Github user tgaloppo commented on a diff in the pull request:

https://github.com/apache/spark/pull/4459#discussion_r24302656
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala ---
@@ -215,20 +217,29 @@ private object ExpectationSum {
   def add(
   weights: Array[Double], 
   dists: Array[MultivariateGaussian])
-  (sums: ExpectationSum, x: BreezeVector[Double]): ExpectationSum = {
+  (sums: ExpectationSum, x: BV[Double]): ExpectationSum = {
 val p = weights.zip(dists).map {
   case (weight, dist) = MLUtils.EPSILON + weight * dist.pdf(x)
 }
 val pSum = p.sum
 sums.logLikelihood += math.log(pSum)
-val xxt = x * new Transpose(x)
 var i = 0
+val isSparse = x match {
+  case _: BSV[Double] = true
+  case _: BDV[Double] = false
+}
 while (i  sums.k) {
   p(i) /= pSum
   sums.weights(i) += p(i)
   sums.means(i) += x * p(i)
-  BLAS.syr(p(i), Vectors.fromBreeze(x).asInstanceOf[DenseVector],
-Matrices.fromBreeze(sums.sigmas(i)).asInstanceOf[DenseMatrix])
+  if (isSparse){
+BLAS.syr(p(i), Vectors.fromBreeze(x).asInstanceOf[SparseVector],
+  Matrices.fromBreeze(sums.sigmas(i)).asInstanceOf[DenseMatrix])
+  }
+  else {
+BLAS.syr(p(i), Vectors.fromBreeze(x).asInstanceOf[DenseVector],
+  Matrices.fromBreeze(sums.sigmas(i)).asInstanceOf[DenseMatrix])
+  }
--- End diff --

I would push the logic of checking for sparsity into the BLAS.syr method... 
(many other BLAS routines already do this... see BLAS.{axpy, copy, dot, gemm, 
scal, ...}) ... then this could simply be

BLAS.syr(p(i), Vectors.fromBreeze(x), 
Matrices.fromBreeze(sums.sigmas(i).asInstanceOf[DenseMatrix])

(obviating the need for the isSparse value)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5021] Gaussian Mixture now supports Spa...

2015-02-08 Thread tgaloppo
Github user tgaloppo commented on a diff in the pull request:

https://github.com/apache/spark/pull/4459#discussion_r24302677
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/clustering/GaussianMixtureSuite.scala
 ---
@@ -40,10 +41,15 @@ class GaussianMixtureSuite extends FunSuite with 
MLlibTestSparkContext {
 val seeds = Array(314589, 29032897, 50181, 494821, 4660)
 seeds.foreach { seed =
   val gmm = new GaussianMixture().setK(1).setSeed(seed).run(data)
+  val sparseGMM = new 
GaussianMixture().setK(1).setSeed(seed).run(sparseData)
   assert(gmm.weights(0) ~== Ew absTol 1E-5)
   assert(gmm.gaussians(0).mu ~== Emu absTol 1E-5)
   assert(gmm.gaussians(0).sigma ~== Esigma absTol 1E-5)
+  assert(sparseGMM.weights(0) ~== Ew absTol 1E-5)
+  assert(sparseGMM.gaussians(0).mu ~== Emu absTol 1E-5)
+  assert(sparseGMM.gaussians(0).sigma ~== Esigma absTol 1E-5)
--- End diff --

It might be good to create separate test cases for sparse inputs


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5021] Gaussian Mixture now supports Spa...

2015-02-08 Thread tgaloppo
Github user tgaloppo commented on a diff in the pull request:

https://github.com/apache/spark/pull/4459#discussion_r24302764
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala ---
@@ -255,6 +255,20 @@ private[spark] object BLAS extends Serializable with 
Logging {
 }
   }
 
+  // Workaround for SparseVectors.
+  def syr(alpha: Double, x: SparseVector, A: DenseMatrix) {
+val mA = A.numRows
+val nA = A.numCols
+require(mA == nA, sA is not a symmetric matrix. A: $mA x $nA)
--- End diff --

Just a general comment -- I find this message a little misleading, since it 
really only checks for squareness; I realize this is copied from the dense 
syr() [I found it misleading there, too].


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5021] Gaussian Mixture now supports Spa...

2015-02-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4459#issuecomment-73434003
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27049/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5021] Gaussian Mixture now supports Spa...

2015-02-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4459#issuecomment-73434001
  
  [Test build #27049 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27049/consoleFull)
 for   PR 4459 at commit 
[`6d8d4bb`](https://github.com/apache/spark/commit/6d8d4bbcadde1fecb12989ae49c4ffe975979aa1).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5021] Gaussian Mixture now supports Spa...

2015-02-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4459#issuecomment-73430021
  
  [Test build #27049 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27049/consoleFull)
 for   PR 4459 at commit 
[`6d8d4bb`](https://github.com/apache/spark/commit/6d8d4bbcadde1fecb12989ae49c4ffe975979aa1).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5021] Gaussian Mixture now supports Spa...

2015-02-08 Thread MechCoder
GitHub user MechCoder opened a pull request:

https://github.com/apache/spark/pull/4459

[SPARK-5021] Gaussian Mixture now supports Sparse Input

Following discussion in the Jira.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/MechCoder/spark sparse_gmm

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4459.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4459


commit 6d8d4bbcadde1fecb12989ae49c4ffe975979aa1
Author: MechCoder manojkumarsivaraj...@gmail.com
Date:   2015-02-08T20:07:06Z

[SPARK-5021] Gaussian Mixture now supports Sparse Input




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5021] Gaussian Mixture now supports Spa...

2015-02-08 Thread MechCoder
Github user MechCoder commented on the pull request:

https://github.com/apache/spark/pull/4459#issuecomment-73429922
  
ping @tgaloppo and @jkbradley (whenever you are back!)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5021] Gaussian Mixture now supports Spa...

2015-02-08 Thread MechCoder
Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/4459#discussion_r24310707
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala ---
@@ -215,20 +217,29 @@ private object ExpectationSum {
   def add(
   weights: Array[Double], 
   dists: Array[MultivariateGaussian])
-  (sums: ExpectationSum, x: BreezeVector[Double]): ExpectationSum = {
+  (sums: ExpectationSum, x: BV[Double]): ExpectationSum = {
 val p = weights.zip(dists).map {
   case (weight, dist) = MLUtils.EPSILON + weight * dist.pdf(x)
 }
 val pSum = p.sum
 sums.logLikelihood += math.log(pSum)
-val xxt = x * new Transpose(x)
 var i = 0
+val isSparse = x match {
+  case _: BSV[Double] = true
+  case _: BDV[Double] = false
+}
 while (i  sums.k) {
   p(i) /= pSum
   sums.weights(i) += p(i)
   sums.means(i) += x * p(i)
-  BLAS.syr(p(i), Vectors.fromBreeze(x).asInstanceOf[DenseVector],
-Matrices.fromBreeze(sums.sigmas(i)).asInstanceOf[DenseMatrix])
+  if (isSparse){
+BLAS.syr(p(i), Vectors.fromBreeze(x).asInstanceOf[SparseVector],
+  Matrices.fromBreeze(sums.sigmas(i)).asInstanceOf[DenseMatrix])
+  }
+  else {
+BLAS.syr(p(i), Vectors.fromBreeze(x).asInstanceOf[DenseVector],
+  Matrices.fromBreeze(sums.sigmas(i)).asInstanceOf[DenseMatrix])
+  }
--- End diff --

If I do that, I get a type check error, saying value values is not a member 
of org.apache.spark.mllib.linalg.Vector which might explain why there are 
separate dot methods for Sparse and Dense Vectors.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5021] Gaussian Mixture now supports Spa...

2015-02-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4459#issuecomment-73463364
  
  [Test build #27087 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27087/consoleFull)
 for   PR 4459 at commit 
[`c235bd0`](https://github.com/apache/spark/commit/c235bd0f787e39bfd032db45e5ea2da2329c671a).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5021] Gaussian Mixture now supports Spa...

2015-02-08 Thread MechCoder
Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/4459#discussion_r24310793
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/clustering/GaussianMixtureSuite.scala
 ---
@@ -40,10 +41,15 @@ class GaussianMixtureSuite extends FunSuite with 
MLlibTestSparkContext {
 val seeds = Array(314589, 29032897, 50181, 494821, 4660)
 seeds.foreach { seed =
   val gmm = new GaussianMixture().setK(1).setSeed(seed).run(data)
+  val sparseGMM = new 
GaussianMixture().setK(1).setSeed(seed).run(sparseData)
   assert(gmm.weights(0) ~== Ew absTol 1E-5)
   assert(gmm.gaussians(0).mu ~== Emu absTol 1E-5)
   assert(gmm.gaussians(0).sigma ~== Esigma absTol 1E-5)
+  assert(sparseGMM.weights(0) ~== Ew absTol 1E-5)
+  assert(sparseGMM.gaussians(0).mu ~== Emu absTol 1E-5)
+  assert(sparseGMM.gaussians(0).sigma ~== Esigma absTol 1E-5)
--- End diff --

I was thinking that there would be something more to this data, than 
randomly generated numbers.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org




[GitHub] spark pull request: [SPARK-5021] Gaussian Mixture now supports Spa...

2015-02-08 Thread MechCoder
Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/4459#discussion_r24308815
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/clustering/GaussianMixtureSuite.scala
 ---
@@ -40,10 +41,15 @@ class GaussianMixtureSuite extends FunSuite with 
MLlibTestSparkContext {
 val seeds = Array(314589, 29032897, 50181, 494821, 4660)
 seeds.foreach { seed =
   val gmm = new GaussianMixture().setK(1).setSeed(seed).run(data)
+  val sparseGMM = new 
GaussianMixture().setK(1).setSeed(seed).run(sparseData)
   assert(gmm.weights(0) ~== Ew absTol 1E-5)
   assert(gmm.gaussians(0).mu ~== Emu absTol 1E-5)
   assert(gmm.gaussians(0).sigma ~== Esigma absTol 1E-5)
+  assert(sparseGMM.weights(0) ~== Ew absTol 1E-5)
+  assert(sparseGMM.gaussians(0).mu ~== Emu absTol 1E-5)
+  assert(sparseGMM.gaussians(0).sigma ~== Esigma absTol 1E-5)
--- End diff --

How do I come up with the data?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5021] Gaussian Mixture now supports Spa...

2015-02-08 Thread MechCoder
Github user MechCoder commented on the pull request:

https://github.com/apache/spark/pull/4459#issuecomment-73456587
  
@tgaloppo Thanks. I shall fix them in a while. However, does the general 
code look good to you?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org