[jira] [Commented] (SPARK-6398) Improve utility of GaussianMixture for higer dimensional data

2015-03-18 Thread Travis Galoppo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14367028#comment-14367028
 ] 

Travis Galoppo commented on SPARK-6398:
---

Please assign to me

 Improve utility of GaussianMixture for higer dimensional data
 -

 Key: SPARK-6398
 URL: https://issues.apache.org/jira/browse/SPARK-6398
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Travis Galoppo

 The current EM implementation for GaussianMixture protects itself from 
 numerical instability at the expense of utility in high dimensions.  A few 
 options exist for extending utility into higher dimensions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6398) Improve utility of GaussianMixture for higer dimensional data

2015-03-18 Thread Travis Galoppo (JIRA)
Travis Galoppo created SPARK-6398:
-

 Summary: Improve utility of GaussianMixture for higer dimensional 
data
 Key: SPARK-6398
 URL: https://issues.apache.org/jira/browse/SPARK-6398
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Travis Galoppo


The current EM implementation for GaussianMixture protects itself from 
numerical instability at the expense of utility in high dimensions.  A few 
options exist for extending utility into higher dimensions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k

2015-02-20 Thread Travis Galoppo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328868#comment-14328868
 ] 

Travis Galoppo commented on SPARK-5016:
---

@mechcoder This falls directly from the (2*pi)^-(k/2) term in the pdf... 

(2*pi)^-k/2^  eps
log ~2*pi~ (2*pi)^-k/2^  log ~2*pi~ eps
-k/2  log(eps) / log(2*pi)
k  -2 * log(eps) * log(2*pi)



 GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
 ---

 Key: SPARK-5016
 URL: https://issues.apache.org/jira/browse/SPARK-5016
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley

 If numFeatures or k are large, GMM EM should distribute the matrix inverse 
 computation for Gaussian initialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k

2015-02-18 Thread Travis Galoppo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326229#comment-14326229
 ] 

Travis Galoppo commented on SPARK-5016:
---

[~josephkb] My previous comment got me thinking about how to make the algorithm 
usable in higher dimensions,,, the underflow problem is caused by the addition 
of EPSILON to every likelihood value computed; this is done to avoid some 
numerical gotchas... but EPSILON is determined such that 1.0 + (EPSILON / 2) == 
1.0, which dominates the densities as dimension increases.  We could derive a 
smaller epsilon value based on the maximum density that we expect to see, eg, 
such that x + (EPSILON / 2) == x, where x = (2 * pi)^-(k/2) ... this, of 
course, is somewhat simplified because it assumes the covariance matrix has 
determinant of 1, but it would lead to a lower epsilon value and likely extend 
the utility of the algorithm into higher dimensions ... and likely make this 
ticket more relevant.


 GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
 ---

 Key: SPARK-5016
 URL: https://issues.apache.org/jira/browse/SPARK-5016
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley

 If numFeatures or k are large, GMM EM should distribute the matrix inverse 
 computation for Gaussian initialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k

2015-02-16 Thread Travis Galoppo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14323412#comment-14323412
 ] 

Travis Galoppo commented on SPARK-5016:
---

Realistically, I think it will be very difficult to realize any performance 
increase from this modification.  In particular, the algorithm simply will not 
work well in high enough dimension to make it worthwhile (from the numFeatures 
perspective, anyway) ... consider that the density of a Multivariate Gaussian 
will underflow EPSILON *at the mean* when numFeatures  -2 * log(EPSILON) / 
log(2*pi) ... this means 40 features will underflow 2.2204e-16 (eps in Octave 
on my laptop), and 131 features would underflow 1e-52; as the pdf approaches 
EPS, it will assign points uniformly to all clusters... so it breaks.  These 
are not particularly large matrices ... I'm guessing the SVD time is too small 
to make the extra communication worthwhile.  At a minimum, I would suggest some 
solid benchmarking to make sure this is a real improvement.


 GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
 ---

 Key: SPARK-5016
 URL: https://issues.apache.org/jira/browse/SPARK-5016
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley

 If numFeatures or k are large, GMM EM should distribute the matrix inverse 
 computation for Gaussian initialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k

2015-02-14 Thread Travis Galoppo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321538#comment-14321538
 ] 

Travis Galoppo commented on SPARK-5016:
---

@mechcoder I may well be missing something simple here... but the sums for each 
cluster are not independent... you need the sums of the likelihoods from each 
to compute the partial assignments (see 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala#L217)...
 so it seems to me there would be an additional communication step involved in 
this.

Again, I may be missing something simple.


 GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
 ---

 Key: SPARK-5016
 URL: https://issues.apache.org/jira/browse/SPARK-5016
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley

 If numFeatures or k are large, GMM EM should distribute the matrix inverse 
 computation for Gaussian initialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k

2015-02-14 Thread Travis Galoppo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321554#comment-14321554
 ] 

Travis Galoppo commented on SPARK-5016:
---

Right, unless each reducer is computing the likelihood for all clusters (just 
to update a single cluster)... essentially doing k times as much work as is 
currently done.. which brings me back to my feeling of awkwardness.


 GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
 ---

 Key: SPARK-5016
 URL: https://issues.apache.org/jira/browse/SPARK-5016
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley

 If numFeatures or k are large, GMM EM should distribute the matrix inverse 
 computation for Gaussian initialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k

2015-02-13 Thread Travis Galoppo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321099#comment-14321099
 ] 

Travis Galoppo commented on SPARK-5016:
---

Hmm. I'm having trouble conceptualizing how to use aggregateByKey here; the 
breezeData RDD is not keyed.  We could have a keyed RDD of expectation sums 
(with a little rework), but each entry in the breezeData RDD would need to be 
operated on by each reducer (which seems awkward?)... or I'm way off?  


 GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
 ---

 Key: SPARK-5016
 URL: https://issues.apache.org/jira/browse/SPARK-5016
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley

 If numFeatures or k are large, GMM EM should distribute the matrix inverse 
 computation for Gaussian initialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k

2015-02-09 Thread Travis Galoppo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312269#comment-14312269
 ] 

Travis Galoppo commented on SPARK-5016:
---

The k Gaussians are updated with code that right now looks like

{code}
  var i = 0
  while (i  k) {
val mu = sums.means(i) / sums.weights(i)
BLAS.syr(-sums.weights(i), 
Vectors.fromBreeze(mu).asInstanceOf[DenseVector],
  Matrices.fromBreeze(sums.sigmas(i)).asInstanceOf[DenseMatrix])
weights(i) = sums.weights(i) / sumWeights
gaussians(i) = new MultivariateGaussian(mu, sums.sigmas(i) / 
sums.weights(i))
i = i + 1
  }
{code}

... the matrix inversion (or, in reality, partial inversion... the inverse is 
not explicitly calculated) occurs during the creation of the 
MultivariateGaussian objects...  this code could be parallelized something like:

{code}
val (ws, gs) = sc.parallelize(0 until k).map{ i = 
  val mu = sums.means(i) / sums.weights(i)
  BLAS.syr(-sums.weights(i), 
Vectors.fromBreeze(mu).asInstanceOf[DenseVector],
Matrices.fromBreeze(sums.sigmas(i)).asInstanceOf[DenseMatrix])
  val weight = sums.weights(i) / sumWeights
  val gaussian = new MultivariateGaussian(mu, sums.sigmas(i) / 
sums.weights(i))
  (weight, gaussian)
  }.collect.unzip
  
  (0 until k).foreach{ i =
weights(i) = ws(i)
gaussians(i) = gs(i)
  }
{code}

... effectively distributing the computation of the k MutlivariateGaussians 
(and their weights).  

As for the threshold values for k / numFeatures... this is probably a function 
of cluster size and interconnect speed.  These thresholds should probably be 
optional parameters to GaussianMixture.  Personally, I would vote for the 
default behavior to not perform this parallelization, and let the user decide 
when the time is right to allow it.


 GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
 ---

 Key: SPARK-5016
 URL: https://issues.apache.org/jira/browse/SPARK-5016
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley

 If numFeatures or k are large, GMM EM should distribute the matrix inverse 
 computation for Gaussian initialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input

2015-02-05 Thread Travis Galoppo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14307196#comment-14307196
 ] 

Travis Galoppo commented on SPARK-5021:
---

[~MechCoder] It is probably better to get something working, submit a PR 
(perhaps mark it [WIP]) and work out the kinks in the review process.

 GaussianMixtureEM should be faster for SparseVector input
 -

 Key: SPARK-5021
 URL: https://issues.apache.org/jira/browse/SPARK-5021
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Manoj Kumar

 GaussianMixtureEM currently converts everything to dense vectors.  It would 
 be nice if it were faster for SparseVectors (running in time linear in the 
 number of non-zero values).
 However, this may not be too important since clustering should rarely be done 
 in high dimensions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input

2015-02-04 Thread Travis Galoppo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14305658#comment-14305658
 ] 

Travis Galoppo commented on SPARK-5021:
---

Why not something like:
{code}
private def vectorMean(x: IndexedSeq[BV[Double]]): BV[Double] = {
  val v = x(0) match {
case _: BSV[Double] = BSV.zeros[Double](x(0).length)
case _: BDV[Double] = BDV.zeros[Double](x(0).length)  
  }
  x.foreach(xi = v += xi)
  v / x.length.toDouble
}
{code}

...where BV, BSV, BDV are breeze vector, sparse vector, and dense vector, 
respectively...


 GaussianMixtureEM should be faster for SparseVector input
 -

 Key: SPARK-5021
 URL: https://issues.apache.org/jira/browse/SPARK-5021
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Manoj Kumar

 GaussianMixtureEM currently converts everything to dense vectors.  It would 
 be nice if it were faster for SparseVectors (running in time linear in the 
 number of non-zero values).
 However, this may not be too important since clustering should rarely be done 
 in high dimensions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input

2015-02-04 Thread Travis Galoppo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14305775#comment-14305775
 ] 

Travis Galoppo commented on SPARK-5021:
---

For the vectorMean function, the resulting vector may well be considerably more 
dense than the input vectors; however, the computed means may become more 
sparse with each iteration if the clusters are represented through density in 
different regions of the input vector.  Although this does have me thinking... 
since the assignments are soft, it is likely that very few vector entries will 
become zero... I'm not sure what the tolerance is for zero entries, but the 
soft nature of the assignments may undermine the performance benefit of working 
with sparse vectors.



 GaussianMixtureEM should be faster for SparseVector input
 -

 Key: SPARK-5021
 URL: https://issues.apache.org/jira/browse/SPARK-5021
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Manoj Kumar

 GaussianMixtureEM currently converts everything to dense vectors.  It would 
 be nice if it were faster for SparseVectors (running in time linear in the 
 number of non-zero values).
 However, this may not be too important since clustering should rarely be done 
 in high dimensions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input

2015-02-04 Thread Travis Galoppo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14305775#comment-14305775
 ] 

Travis Galoppo edited comment on SPARK-5021 at 2/4/15 7:23 PM:
---

For the vectorMean function, the resulting vector may well be considerably more 
dense than the input vectors (it is called only once, with a set of random 
vectors); however, the computed means may become more sparse with each 
iteration if the clusters are represented through density in different regions 
of the input vector.  Although this does have me thinking... since the 
assignments are soft, it is likely that very few vector entries will become 
zero... I'm not sure what the tolerance is for zero entries, but the soft 
nature of the assignments may undermine the performance benefit of working with 
sparse vectors.




was (Author: tgaloppo):
For the vectorMean function, the resulting vector may well be considerably more 
dense than the input vectors; however, the computed means may become more 
sparse with each iteration if the clusters are represented through density in 
different regions of the input vector.  Although this does have me thinking... 
since the assignments are soft, it is likely that very few vector entries will 
become zero... I'm not sure what the tolerance is for zero entries, but the 
soft nature of the assignments may undermine the performance benefit of working 
with sparse vectors.



 GaussianMixtureEM should be faster for SparseVector input
 -

 Key: SPARK-5021
 URL: https://issues.apache.org/jira/browse/SPARK-5021
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Manoj Kumar

 GaussianMixtureEM currently converts everything to dense vectors.  It would 
 be nice if it were faster for SparseVectors (running in time linear in the 
 number of non-zero values).
 However, this may not be too important since clustering should rarely be done 
 in high dimensions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input

2015-02-04 Thread Travis Galoppo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14305171#comment-14305171
 ] 

Travis Galoppo commented on SPARK-5021:
---

[~MechCoder] You may be making things harder on yourself than necessary.  The 
current code maps the incoming vectors to dense breeze vectors, but you can 
simply map them to generic breeze vectors... ie

(GaussianMixture.scala: line 126) val breezeData = data.map(u = 
u.toBreeze.toDenseVector).cache()
=
val breezeData = data.map(_.toBreeze).cache()

then genericize everything expecting a dense breeze vector/matrix to expect 
just a generic vector/matrix... when the time finally arrives where the cases 
must be separated, you can match on the variable, ie:

def foo(x: BreezeVector) = {
  x match {
case dx: DenseBreezeVector = // do dense vector calculation
case sx: SparseBreezeVector = // do sparse vector calculation
  }
}
...

I know this is kind of high level... but it could avoid a lot of dual-path code.


 GaussianMixtureEM should be faster for SparseVector input
 -

 Key: SPARK-5021
 URL: https://issues.apache.org/jira/browse/SPARK-5021
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Manoj Kumar

 GaussianMixtureEM currently converts everything to dense vectors.  It would 
 be nice if it were faster for SparseVectors (running in time linear in the 
 number of non-zero values).
 However, this may not be too important since clustering should rarely be done 
 in high dimensions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5013) User guide for Gaussian Mixture Model

2015-02-03 Thread Travis Galoppo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14303522#comment-14303522
 ] 

Travis Galoppo commented on SPARK-5013:
---

Great! I will submit a PR soon.

 User guide for Gaussian Mixture Model
 -

 Key: SPARK-5013
 URL: https://issues.apache.org/jira/browse/SPARK-5013
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib
Reporter: Xiangrui Meng
Assignee: Travis Galoppo

 Add GMM user guide with code examples in Scala/Java/Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5013) User guide for Gaussian Mixture Model

2015-01-31 Thread Travis Galoppo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14299907#comment-14299907
 ] 

Travis Galoppo commented on SPARK-5013:
---

Does this amount to adding a description and code examples to 
docs/mllib-clustering.md ?
Please assign to me and I will get started on this.  I can finalize when the 
python API is merged.


 User guide for Gaussian Mixture Model
 -

 Key: SPARK-5013
 URL: https://issues.apache.org/jira/browse/SPARK-5013
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib
Reporter: Xiangrui Meng

 Add GMM user guide with code examples in Scala/Java/Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input

2015-01-29 Thread Travis Galoppo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14297634#comment-14297634
 ] 

Travis Galoppo commented on SPARK-5021:
---

[~josephkb] This ticket is marked as affecting version 1.2.0 ... this should be 
1.3.0 ?

 GaussianMixtureEM should be faster for SparseVector input
 -

 Key: SPARK-5021
 URL: https://issues.apache.org/jira/browse/SPARK-5021
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Assignee: Manoj Kumar

 GaussianMixtureEM currently converts everything to dense vectors.  It would 
 be nice if it were faster for SparseVectors (running in time linear in the 
 number of non-zero values).
 However, this may not be too important since clustering should rarely be done 
 in high dimensions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5400) Rename GaussianMixtureEM to GaussianMixture

2015-01-29 Thread Travis Galoppo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14297613#comment-14297613
 ] 

Travis Galoppo commented on SPARK-5400:
---

Please assign to me and I will make the name change


 Rename GaussianMixtureEM to GaussianMixture
 ---

 Key: SPARK-5400
 URL: https://issues.apache.org/jira/browse/SPARK-5400
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Minor

 GaussianMixtureEM is following the old naming convention of including the 
 optimization algorithm name in the class title.  We should probably rename it 
 to GaussianMixture so that it can use other optimization algorithms in the 
 future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input

2015-01-29 Thread Travis Galoppo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14297622#comment-14297622
 ] 

Travis Galoppo commented on SPARK-5021:
---

[~MechCoder] The documentation for GMM is not yet completed (see SPARK-5013) 
... the python interface is still being completed (SPARK-5012) and then the 
documentation can be completed.  In the mean time, I might be able to answer 
your questions around the GMM code...


 GaussianMixtureEM should be faster for SparseVector input
 -

 Key: SPARK-5021
 URL: https://issues.apache.org/jira/browse/SPARK-5021
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Assignee: Manoj Kumar

 GaussianMixtureEM currently converts everything to dense vectors.  It would 
 be nice if it were faster for SparseVectors (running in time linear in the 
 number of non-zero values).
 However, this may not be too important since clustering should rarely be done 
 in high dimensions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5400) Rename GaussianMixtureEM to GaussianMixture

2015-01-24 Thread Travis Galoppo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290971#comment-14290971
 ] 

Travis Galoppo commented on SPARK-5400:
---

Hmm.  This has me thinking in a different direction.  We could generalize the 
expectation-maximization algorithm to work with any mixture model supporting a 
set of necessary likelihood compute/update methods... then we could ask for, 
e.g., new ExpectationMaximization[GaussianMixtureModel].  This would 
de-couple the model and the algorithm, and could open the door for the 
implementation to be applied to (for instance) tomographic image reconstruction 
(which seems like a great fit for Spark given the volume of data involved).


 Rename GaussianMixtureEM to GaussianMixture
 ---

 Key: SPARK-5400
 URL: https://issues.apache.org/jira/browse/SPARK-5400
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Minor

 GaussianMixtureEM is following the old naming convention of including the 
 optimization algorithm name in the class title.  We should probably rename it 
 to GaussianMixture so that it can use other optimization algorithms in the 
 future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5019) Update GMM API to use MultivariateGaussian

2015-01-17 Thread Travis Galoppo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14281543#comment-14281543
 ] 

Travis Galoppo commented on SPARK-5019:
---

This ticket is currently stalling SPARK-5012.

 Update GMM API to use MultivariateGaussian
 --

 Key: SPARK-5019
 URL: https://issues.apache.org/jira/browse/SPARK-5019
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Priority: Blocker

 The GaussianMixtureModel API should expose MultivariateGaussian instances 
 instead of the means and covariances.  This should be fixed as soon as 
 possible to stabilize the API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5012) Python API for Gaussian Mixture Model

2015-01-15 Thread Travis Galoppo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278947#comment-14278947
 ] 

Travis Galoppo commented on SPARK-5012:
---

This will probably be affected by SPARK-5019


 Python API for Gaussian Mixture Model
 -

 Key: SPARK-5012
 URL: https://issues.apache.org/jira/browse/SPARK-5012
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Xiangrui Meng
Assignee: Meethu Mathew

 Add Python API for the Scala implementation of GMM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5019) Update GMM API to use MultivariateGaussian

2015-01-14 Thread Travis Galoppo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276943#comment-14276943
 ] 

Travis Galoppo commented on SPARK-5019:
---

I have a patch prepared for this; it is generally the same as [~lewuathe]'s 
patch, but takes into account recent changes with MultivariateGaussian and 
completely removes the mu/sigma parameters from GaussainMixtureModel (with code 
updates reflecting such in GaussianMixtureModelEM and the test suite).


 Update GMM API to use MultivariateGaussian
 --

 Key: SPARK-5019
 URL: https://issues.apache.org/jira/browse/SPARK-5019
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Priority: Blocker

 The GaussianMixtureModel API should expose MultivariateGaussian instances 
 instead of the means and covariances.  This should be fixed as soon as 
 possible to stabilize the API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5019) Update GMM API to use MultivariateGaussian

2015-01-12 Thread Travis Galoppo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273565#comment-14273565
 ] 

Travis Galoppo commented on SPARK-5019:
---

[~lewuathe] Are you still interested in working on this ticket? SPARK-5018 is 
now complete.

 Update GMM API to use MultivariateGaussian
 --

 Key: SPARK-5019
 URL: https://issues.apache.org/jira/browse/SPARK-5019
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Priority: Blocker

 The GaussianMixtureModel API should expose MultivariateGaussian instances 
 instead of the means and covariances.  This should be fixed as soon as 
 possible to stabilize the API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5018) Make MultivariateGaussian public

2015-01-06 Thread Travis Galoppo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266907#comment-14266907
 ] 

Travis Galoppo commented on SPARK-5018:
---

Please assign this ticket to me.


 Make MultivariateGaussian public
 

 Key: SPARK-5018
 URL: https://issues.apache.org/jira/browse/SPARK-5018
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Priority: Critical

 MultivariateGaussian is currently private[ml], but it would be a useful 
 public class.  This JIRA will require defining a good public API for 
 distributions.
 This JIRA will be needed for finalizing the GaussianMixtureModel API, which 
 should expose MultivariateGaussian instances instead of the means and 
 covariances.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5019) Update GMM API to use MultivariateGaussian

2015-01-06 Thread Travis Galoppo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266909#comment-14266909
 ] 

Travis Galoppo commented on SPARK-5019:
---

This really can't be completed until MultivariateGaussian is made public 
(SPARK-5018).

 Update GMM API to use MultivariateGaussian
 --

 Key: SPARK-5019
 URL: https://issues.apache.org/jira/browse/SPARK-5019
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Priority: Blocker

 The GaussianMixtureModel API should expose MultivariateGaussian instances 
 instead of the means and covariances.  This should be fixed as soon as 
 possible to stabilize the API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5019) Update GMM API to use MultivariateGaussian

2015-01-06 Thread Travis Galoppo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267061#comment-14267061
 ] 

Travis Galoppo edited comment on SPARK-5019 at 1/7/15 12:24 AM:


No problem,[~lewuathe] ... I have just started work on SPARK-5018.  If you 
would like to re-visit this ticket once that one is complete, that would be 
great!



was (Author: tgaloppo):
No problem,@lewuathe ... I have just started work on SPARK-5018.  If you would 
like to re-visit this ticket once that one is complete, that would be great!


 Update GMM API to use MultivariateGaussian
 --

 Key: SPARK-5019
 URL: https://issues.apache.org/jira/browse/SPARK-5019
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Priority: Blocker

 The GaussianMixtureModel API should expose MultivariateGaussian instances 
 instead of the means and covariances.  This should be fixed as soon as 
 possible to stabilize the API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5012) Python API for Gaussian Mixture Model

2014-12-30 Thread Travis Galoppo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14261413#comment-14261413
 ] 

Travis Galoppo commented on SPARK-5012:
---

[~mengxr] I'd be happy to.

 Python API for Gaussian Mixture Model
 -

 Key: SPARK-5012
 URL: https://issues.apache.org/jira/browse/SPARK-5012
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Xiangrui Meng

 Add Python API for the Scala implementation of GMM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5012) Python API for Gaussian Mixture Model

2014-12-30 Thread Travis Galoppo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14261931#comment-14261931
 ] 

Travis Galoppo commented on SPARK-5012:
---

[~mengxr] Can this be reassigned to [~MeethuMathew] 

I will focus efforts on other improvements to the implementation.

 Python API for Gaussian Mixture Model
 -

 Key: SPARK-5012
 URL: https://issues.apache.org/jira/browse/SPARK-5012
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Xiangrui Meng
Assignee: Travis Galoppo

 Add Python API for the Scala implementation of GMM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4156) Add expectation maximization for Gaussian mixture models to MLLib clustering

2014-12-11 Thread Travis Galoppo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242494#comment-14242494
 ] 

Travis Galoppo commented on SPARK-4156:
---

[~MeethuMathew] This would be great! If possible, please issue a pull request 
against my repo and I will merge it in as soon as possible.


 Add expectation maximization for Gaussian mixture models to MLLib clustering
 

 Key: SPARK-4156
 URL: https://issues.apache.org/jira/browse/SPARK-4156
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Travis Galoppo
Assignee: Travis Galoppo

 As an additional clustering algorithm, implement expectation maximization for 
 Gaussian mixture models



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4156) Add expectation maximization for Gaussian mixture models to MLLib clustering

2014-12-03 Thread Travis Galoppo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233107#comment-14233107
 ] 

Travis Galoppo commented on SPARK-4156:
---

I have modified the cluster initialization strategy to derive an initial 
covariance matrix from the sample points used to initialize the clusters; this 
initial covariance matrix has the element-wise variance of the sample points on 
the diagonal.  The final computed covariance matrix is not constrained to be 
diagonal.

I tested  this with the S1 dataset [~MeethuMathew] referenced above; while it 
does fix the problem of effectively finding no clusters, I find that the 
results are still better when the input is scaled as I mentioned above.  It 
might be worthwhile to allow the user to provide a pre-initialized model to 
accomodate various initialization strategies, and provide the current 
functionality as a default. Thoughts?

Also, I have fixed the defect in DenseGmmEM whereby it was ignoring the delta 
parameter.


 Add expectation maximization for Gaussian mixture models to MLLib clustering
 

 Key: SPARK-4156
 URL: https://issues.apache.org/jira/browse/SPARK-4156
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Travis Galoppo
Assignee: Travis Galoppo

 As an additional clustering algorithm, implement expectation maximization for 
 Gaussian mixture models



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4156) Add expectation maximization for Gaussian mixture models to MLLib clustering

2014-12-02 Thread Travis Galoppo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14231474#comment-14231474
 ] 

Travis Galoppo commented on SPARK-4156:
---

Ok, I looked into this.  This is the result of using unit covariance matrices 
for initialization; specifically, the numbers in the input files are quite 
large, and [more importantly, I reckon] vary by relatively large amounts, thus 
the initial unit covariance matrices are poor choices, driving the 
probabilities to ~zero.

I tested the S1 dataset after scaling the inputs by 10, and the algorithm 
yielded:

w=0.018651 mu=[1.4005351951422986,5.560161272092209] sigma=
0.0047916181666818325  1.8492627979416199E-4  
1.8492627979416199E-4  0.011135224999325288   

w=0.070139 mu=[3.9826648305512444,4.048416241679408] sigma=
0.08975122201635877   0.011161215961635662  
0.011161215961635662  0.07281211382882091   

w=0.203390 mu=[4.50966114011736,8.335671907946685] sigma=
3.3435755029681820.16780915524083184  
0.16780915524083184  0.1983579752119624   

w=0.061357 mu=[8.243819479262187,7.299054596484072] sigma=
0.059502423358168244  -0.01288330287962225  
-0.01288330287962225  0.08306975793088611   

w=0.068116 mu=[3.2082470765623987,1.6153321811600052] sigma=
0.13661341675065408-0.004671801905049122  
-0.004671801905049122  0.1184668732856653 

w=0.015480 mu=[6.032605151728542,5.76477595221249] sigma=
0.006257088363533114  -0.01541684245322017  
-0.01541684245322017  0.11177862390275095   

w=0.069246 mu=[8.599898790732793,5.47222558625928] sigma=
0.083345775599170220.0025980740480378017  
0.0025980740480378017  0.10560039597455884

w=0.066601 mu=[1.675642401646793,3.4768887461230293] sigma=
0.06718419616465754-0.001992742042064677  
-0.001992742042064677  0.08394612669156842

w=0.050884 mu=[1.4034421425114039,5.586799889184816] sigma=
0.18839808914440148-0.017016991559440697  
-0.017016991559440697  0.09967868623594711

w=0.067257 mu=[6.180341749904763,3.9855165348399026] sigma=
0.111625017355422070.0023201319648720187  
0.0023201319648720187  0.09177325542363057

w=0.070096 mu=[5.078726203553804,1.756463619639961] sigma=
0.07852242299631484  0.03291628699789406  
0.03291628699789406  0.08050080528055803  

w=0.015951 mu=[5.989248184898113,5.729903049835485] sigma=
0.06204977226748554   0.008716828781302866  
0.008716828781302866  0.003116768910125245  

w=0.128860 mu=[8.274797410035061,2.390551639925522] sigma=
0.10976751308928101  -0.186908554330941  
-0.186908554330941   0.7759289399492513  

w=0.065259 mu=[3.3783618332560876,5.622632293334024] sigma=
0.10109765051996433  0.0320694359617697   
0.0320694359617697   0.03873645329222697  

w=0.028714 mu=[6.146091367146795,5.732902319554125] sigma=
0.23893543994099530.023579597914199724  
0.023579597914199724  0.1377941370353355

Multiplying the MU values back by 10 they show pretty good fidelity to the 
truth values in s1-cb.txt provided on the source website for the dataset; 
unfortunately, I do not see the original weight and covariance values used to 
generate the data.

Of course it would be easier to use if the scaling step was not necessary; I 
can modify the cluster initialization to use a covariance estimated from a 
sample and see how it works out.  What strategy did you use for initializing 
clusters in your implementation?

cc: [~MeethuMathew]

 Add expectation maximization for Gaussian mixture models to MLLib clustering
 

 Key: SPARK-4156
 URL: https://issues.apache.org/jira/browse/SPARK-4156
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Travis Galoppo
Assignee: Travis Galoppo

 As an additional clustering algorithm, implement expectation maximization for 
 Gaussian mixture models



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4156) Add expectation maximization for Gaussian mixture models to MLLib clustering

2014-12-02 Thread Travis Galoppo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14231682#comment-14231682
 ] 

Travis Galoppo commented on SPARK-4156:
---

I do have a bug in the DenseGmmEM example code... the delta value is ignored, 
so all runs are using the default value of 0.01.  I will fix ASAP.


 Add expectation maximization for Gaussian mixture models to MLLib clustering
 

 Key: SPARK-4156
 URL: https://issues.apache.org/jira/browse/SPARK-4156
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Travis Galoppo
Assignee: Travis Galoppo

 As an additional clustering algorithm, implement expectation maximization for 
 Gaussian mixture models



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3588) Gaussian Mixture Model clustering

2014-11-25 Thread Travis Galoppo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224287#comment-14224287
 ] 

Travis Galoppo commented on SPARK-3588:
---

Sorry about the duplicate effort; I did a search prior to my PR, but somehow 
missed this ticket.  I will gladly coordinate to improve my submission.

cc: [~mengxr] [~MeethuM] 

 Gaussian Mixture Model clustering
 -

 Key: SPARK-3588
 URL: https://issues.apache.org/jira/browse/SPARK-3588
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Meethu Mathew
Assignee: Meethu Mathew
 Attachments: GMMSpark.py


 Gaussian Mixture Models (GMM) is a popular technique for soft clustering. GMM 
 models the entire data set as a finite mixture of Gaussian distributions,each 
 parameterized by a mean vector µ ,a covariance matrix ∑ and  a mixture weight 
 π. In this technique, probability of  each point to belong to each cluster is 
 computed along with the cluster statistics.
 We have come up with an initial distributed implementation of GMM in pyspark 
 where the parameters are estimated using the  Expectation-Maximization 
 algorithm.Our current implementation considers diagonal covariance matrix for 
 each component.
 We did an initial benchmark study on a  2 node Spark standalone cluster setup 
 where each node config is 8 Cores,8 GB RAM, the spark version used is 1.0.0. 
 We also evaluated python version of k-means available in spark on the same 
 datasets.
 Below are the results from this benchmark study. The reported stats are 
 average from 10 runs.Tests were done on multiple datasets with varying number 
 of features and instances.
 ||nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;Dataset  
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;||nbsp;nbsp;nbsp;Gaussian
  mixture modelnbsp;nbsp;nbsp;nbsp;nbsp;|| 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;Kmeans(Python)nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;||
  
 |Instances|Dimensions |Avg time per iteration|Time for  100 iterations |Avg 
 time per iteration |Time for 100 iterations | 
 |0.7million| nbsp;nbsp;nbsp;13 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;|  
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;   7s 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 12min 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;   |  
 nbsp;nbsp;nbsp;nbsp; nbsp;nbsp; 13s  
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;  |  nbsp;nbsp;nbsp;nbsp;26min 
 nbsp;nbsp;nbsp;|
 |1.8million| nbsp;nbsp;nbsp;11 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;|   
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;  17s 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; | 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 29min 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;  |  
 nbsp;nbsp;nbsp;nbsp; nbsp;nbsp; 33s  
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;   |  nbsp;nbsp;nbsp;nbsp;53min 
 nbsp;nbsp;nbsp;  |
 |10million|nbsp;nbsp;nbsp;16 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;|  
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;  1.6min nbsp;nbsp;nbsp;nbsp;nbsp;   
  | nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 2.7hr 
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;   |  
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; 1.2min nbsp;nbsp;nbsp;nbsp;| 
  nbsp;nbsp;nbsp;nbsp;2hr nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;   
  |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4156) Add expectation maximization for Gaussian mixture models to MLLib clustering

2014-10-30 Thread Travis Galoppo (JIRA)
Travis Galoppo created SPARK-4156:
-

 Summary: Add expectation maximization for Gaussian mixture models 
to MLLib clustering
 Key: SPARK-4156
 URL: https://issues.apache.org/jira/browse/SPARK-4156
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Travis Galoppo
Priority: Minor


As an additional clustering algorithm, implement expectation maximization for 
Gaussian mixture models



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4156) Add expectation maximization for Gaussian mixture models to MLLib clustering

2014-10-30 Thread Travis Galoppo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14190624#comment-14190624
 ] 

Travis Galoppo commented on SPARK-4156:
---

Pull request #3022 issued with changes implementing GMM EM.  As this is my 
first contribution, I look forward to discussion of how to be a better 
contributor.


 Add expectation maximization for Gaussian mixture models to MLLib clustering
 

 Key: SPARK-4156
 URL: https://issues.apache.org/jira/browse/SPARK-4156
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Travis Galoppo
Priority: Minor

 As an additional clustering algorithm, implement expectation maximization for 
 Gaussian mixture models



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org