[jira] [Commented] (SPARK-6398) Improve utility of GaussianMixture for higer dimensional data
[ https://issues.apache.org/jira/browse/SPARK-6398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14367028#comment-14367028 ] Travis Galoppo commented on SPARK-6398: --- Please assign to me > Improve utility of GaussianMixture for higer dimensional data > - > > Key: SPARK-6398 > URL: https://issues.apache.org/jira/browse/SPARK-6398 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Travis Galoppo > > The current EM implementation for GaussianMixture protects itself from > numerical instability at the expense of utility in high dimensions. A few > options exist for extending utility into higher dimensions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6398) Improve utility of GaussianMixture for higer dimensional data
Travis Galoppo created SPARK-6398: - Summary: Improve utility of GaussianMixture for higer dimensional data Key: SPARK-6398 URL: https://issues.apache.org/jira/browse/SPARK-6398 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Travis Galoppo The current EM implementation for GaussianMixture protects itself from numerical instability at the expense of utility in high dimensions. A few options exist for extending utility into higher dimensions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
[ https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14328868#comment-14328868 ] Travis Galoppo edited comment on SPARK-5016 at 2/20/15 12:33 PM: - [~MechCoder] This falls directly from the (2*pi)^-(k/2) term in the pdf... (2*pi)^-k/2^ < eps log ~2*pi~ (2*pi)^-k/2^ < log ~2*pi~ eps -k/2 < log(eps) / log(2*pi) k > -2 * log(eps) / log(2*pi) was (Author: tgaloppo): [~MechCoder] This falls directly from the (2*pi)^-(k/2) term in the pdf... (2*pi)^-k/2^ < eps log ~2*pi~ (2*pi)^-k/2^ < log ~2*pi~ eps -k/2 < log(eps) / log(2*pi) k > -2 * log(eps) * log(2*pi) > GaussianMixtureEM should distribute matrix inverse for large numFeatures, k > --- > > Key: SPARK-5016 > URL: https://issues.apache.org/jira/browse/SPARK-5016 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley > > If numFeatures or k are large, GMM EM should distribute the matrix inverse > computation for Gaussian initialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
[ https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14328868#comment-14328868 ] Travis Galoppo edited comment on SPARK-5016 at 2/20/15 12:32 PM: - [~MechCoder] This falls directly from the (2*pi)^-(k/2) term in the pdf... (2*pi)^-k/2^ < eps log ~2*pi~ (2*pi)^-k/2^ < log ~2*pi~ eps -k/2 < log(eps) / log(2*pi) k > -2 * log(eps) * log(2*pi) was (Author: tgaloppo): @mechcoder This falls directly from the (2*pi)^-(k/2) term in the pdf... (2*pi)^-k/2^ < eps log ~2*pi~ (2*pi)^-k/2^ < log ~2*pi~ eps -k/2 < log(eps) / log(2*pi) k > -2 * log(eps) * log(2*pi) > GaussianMixtureEM should distribute matrix inverse for large numFeatures, k > --- > > Key: SPARK-5016 > URL: https://issues.apache.org/jira/browse/SPARK-5016 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley > > If numFeatures or k are large, GMM EM should distribute the matrix inverse > computation for Gaussian initialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
[ https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14328868#comment-14328868 ] Travis Galoppo commented on SPARK-5016: --- @mechcoder This falls directly from the (2*pi)^-(k/2) term in the pdf... (2*pi)^-k/2^ < eps log ~2*pi~ (2*pi)^-k/2^ < log ~2*pi~ eps -k/2 < log(eps) / log(2*pi) k > -2 * log(eps) * log(2*pi) > GaussianMixtureEM should distribute matrix inverse for large numFeatures, k > --- > > Key: SPARK-5016 > URL: https://issues.apache.org/jira/browse/SPARK-5016 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley > > If numFeatures or k are large, GMM EM should distribute the matrix inverse > computation for Gaussian initialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
[ https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14326609#comment-14326609 ] Travis Galoppo commented on SPARK-5016: --- [~josephkb] Let me see what I can find; I have seen a lot of papers around the issue of high dimensional clustering when the number of samples is relatively small (so the solutions revolve around regularization, dimensionality reduction, etc)... I think here we can assume the user has a copious amount of data (why else use Spark!?!) ... I'll see what I can find. > GaussianMixtureEM should distribute matrix inverse for large numFeatures, k > --- > > Key: SPARK-5016 > URL: https://issues.apache.org/jira/browse/SPARK-5016 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley > > If numFeatures or k are large, GMM EM should distribute the matrix inverse > computation for Gaussian initialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
[ https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14326229#comment-14326229 ] Travis Galoppo commented on SPARK-5016: --- [~josephkb] My previous comment got me thinking about how to make the algorithm usable in higher dimensions,,, the underflow problem is caused by the addition of EPSILON to every likelihood value computed; this is done to avoid some numerical gotchas... but EPSILON is determined such that 1.0 + (EPSILON / 2) == 1.0, which dominates the densities as dimension increases. We could derive a smaller epsilon value based on the maximum density that we expect to see, eg, such that x + (EPSILON / 2) == x, where x = (2 * pi)^-(k/2) ... this, of course, is somewhat simplified because it "assumes" the covariance matrix has determinant of 1, but it would lead to a lower epsilon value and likely extend the utility of the algorithm into higher dimensions ... and likely make this ticket more relevant. > GaussianMixtureEM should distribute matrix inverse for large numFeatures, k > --- > > Key: SPARK-5016 > URL: https://issues.apache.org/jira/browse/SPARK-5016 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley > > If numFeatures or k are large, GMM EM should distribute the matrix inverse > computation for Gaussian initialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
[ https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14323412#comment-14323412 ] Travis Galoppo commented on SPARK-5016: --- Realistically, I think it will be very difficult to realize any performance increase from this modification. In particular, the algorithm simply will not work well in high enough dimension to make it worthwhile (from the numFeatures perspective, anyway) ... consider that the density of a Multivariate Gaussian will underflow EPSILON *at the mean* when numFeatures > -2 * log(EPSILON) / log(2*pi) ... this means 40 features will underflow 2.2204e-16 (eps in Octave on my laptop), and 131 features would underflow 1e-52; as the pdf approaches EPS, it will assign points uniformly to all clusters... so it breaks. These are not particularly large matrices ... I'm guessing the SVD time is too small to make the extra communication worthwhile. At a minimum, I would suggest some solid benchmarking to make sure this is a real improvement. > GaussianMixtureEM should distribute matrix inverse for large numFeatures, k > --- > > Key: SPARK-5016 > URL: https://issues.apache.org/jira/browse/SPARK-5016 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley > > If numFeatures or k are large, GMM EM should distribute the matrix inverse > computation for Gaussian initialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
[ https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14321554#comment-14321554 ] Travis Galoppo commented on SPARK-5016: --- Right, unless each reducer is computing the likelihood for all clusters (just to update a single cluster)... essentially doing k times as much work as is currently done.. which brings me back to my feeling of awkwardness. > GaussianMixtureEM should distribute matrix inverse for large numFeatures, k > --- > > Key: SPARK-5016 > URL: https://issues.apache.org/jira/browse/SPARK-5016 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley > > If numFeatures or k are large, GMM EM should distribute the matrix inverse > computation for Gaussian initialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
[ https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14321538#comment-14321538 ] Travis Galoppo commented on SPARK-5016: --- @mechcoder I may well be missing something simple here... but the sums for each cluster are not independent... you need the sums of the likelihoods from each to compute the partial assignments (see https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala#L217)... so it seems to me there would be an additional communication step involved in this. Again, I may be missing something simple. > GaussianMixtureEM should distribute matrix inverse for large numFeatures, k > --- > > Key: SPARK-5016 > URL: https://issues.apache.org/jira/browse/SPARK-5016 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley > > If numFeatures or k are large, GMM EM should distribute the matrix inverse > computation for Gaussian initialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
[ https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14321099#comment-14321099 ] Travis Galoppo commented on SPARK-5016: --- Hmm. I'm having trouble conceptualizing how to use aggregateByKey here; the breezeData RDD is not keyed. We could have a keyed RDD of expectation sums (with a little rework), but each entry in the breezeData RDD would need to be operated on by each reducer (which seems awkward?)... or I'm way off? > GaussianMixtureEM should distribute matrix inverse for large numFeatures, k > --- > > Key: SPARK-5016 > URL: https://issues.apache.org/jira/browse/SPARK-5016 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley > > If numFeatures or k are large, GMM EM should distribute the matrix inverse > computation for Gaussian initialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
[ https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14312269#comment-14312269 ] Travis Galoppo commented on SPARK-5016: --- The k Gaussians are updated with code that right now looks like {code} var i = 0 while (i < k) { val mu = sums.means(i) / sums.weights(i) BLAS.syr(-sums.weights(i), Vectors.fromBreeze(mu).asInstanceOf[DenseVector], Matrices.fromBreeze(sums.sigmas(i)).asInstanceOf[DenseMatrix]) weights(i) = sums.weights(i) / sumWeights gaussians(i) = new MultivariateGaussian(mu, sums.sigmas(i) / sums.weights(i)) i = i + 1 } {code} ... the matrix inversion (or, in reality, partial inversion... the inverse is not explicitly calculated) occurs during the creation of the MultivariateGaussian objects... this code could be parallelized something like: {code} val (ws, gs) = sc.parallelize(0 until k).map{ i => val mu = sums.means(i) / sums.weights(i) BLAS.syr(-sums.weights(i), Vectors.fromBreeze(mu).asInstanceOf[DenseVector], Matrices.fromBreeze(sums.sigmas(i)).asInstanceOf[DenseMatrix]) val weight = sums.weights(i) / sumWeights val gaussian = new MultivariateGaussian(mu, sums.sigmas(i) / sums.weights(i)) (weight, gaussian) }.collect.unzip (0 until k).foreach{ i => weights(i) = ws(i) gaussians(i) = gs(i) } {code} ... effectively distributing the computation of the k MutlivariateGaussians (and their weights). As for the threshold values for k / numFeatures... this is probably a function of cluster size and interconnect speed. These thresholds should probably be optional parameters to GaussianMixture. Personally, I would vote for the default behavior to not perform this parallelization, and let the user decide when the time is right to allow it. > GaussianMixtureEM should distribute matrix inverse for large numFeatures, k > --- > > Key: SPARK-5016 > URL: https://issues.apache.org/jira/browse/SPARK-5016 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley > > If numFeatures or k are large, GMM EM should distribute the matrix inverse > computation for Gaussian initialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input
[ https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14307196#comment-14307196 ] Travis Galoppo commented on SPARK-5021: --- [~MechCoder] It is probably better to get something working, submit a PR (perhaps mark it [WIP]) and work out the kinks in the review process. > GaussianMixtureEM should be faster for SparseVector input > - > > Key: SPARK-5021 > URL: https://issues.apache.org/jira/browse/SPARK-5021 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Manoj Kumar > > GaussianMixtureEM currently converts everything to dense vectors. It would > be nice if it were faster for SparseVectors (running in time linear in the > number of non-zero values). > However, this may not be too important since clustering should rarely be done > in high dimensions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input
[ https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14305775#comment-14305775 ] Travis Galoppo edited comment on SPARK-5021 at 2/4/15 7:23 PM: --- For the vectorMean function, the resulting vector may well be considerably more dense than the input vectors (it is called only once, with a set of random vectors); however, the computed means may become more sparse with each iteration if the clusters are represented through density in different regions of the input vector. Although this does have me thinking... since the assignments are soft, it is likely that very few vector entries will become zero... I'm not sure what the tolerance is for zero entries, but the soft nature of the assignments may undermine the performance benefit of working with sparse vectors. was (Author: tgaloppo): For the vectorMean function, the resulting vector may well be considerably more dense than the input vectors; however, the computed means may become more sparse with each iteration if the clusters are represented through density in different regions of the input vector. Although this does have me thinking... since the assignments are soft, it is likely that very few vector entries will become zero... I'm not sure what the tolerance is for zero entries, but the soft nature of the assignments may undermine the performance benefit of working with sparse vectors. > GaussianMixtureEM should be faster for SparseVector input > - > > Key: SPARK-5021 > URL: https://issues.apache.org/jira/browse/SPARK-5021 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Manoj Kumar > > GaussianMixtureEM currently converts everything to dense vectors. It would > be nice if it were faster for SparseVectors (running in time linear in the > number of non-zero values). > However, this may not be too important since clustering should rarely be done > in high dimensions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input
[ https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14305775#comment-14305775 ] Travis Galoppo commented on SPARK-5021: --- For the vectorMean function, the resulting vector may well be considerably more dense than the input vectors; however, the computed means may become more sparse with each iteration if the clusters are represented through density in different regions of the input vector. Although this does have me thinking... since the assignments are soft, it is likely that very few vector entries will become zero... I'm not sure what the tolerance is for zero entries, but the soft nature of the assignments may undermine the performance benefit of working with sparse vectors. > GaussianMixtureEM should be faster for SparseVector input > - > > Key: SPARK-5021 > URL: https://issues.apache.org/jira/browse/SPARK-5021 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Manoj Kumar > > GaussianMixtureEM currently converts everything to dense vectors. It would > be nice if it were faster for SparseVectors (running in time linear in the > number of non-zero values). > However, this may not be too important since clustering should rarely be done > in high dimensions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input
[ https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14305658#comment-14305658 ] Travis Galoppo commented on SPARK-5021: --- Why not something like: {code} private def vectorMean(x: IndexedSeq[BV[Double]]): BV[Double] = { val v = x(0) match { case _: BSV[Double] => BSV.zeros[Double](x(0).length) case _: BDV[Double] => BDV.zeros[Double](x(0).length) } x.foreach(xi => v += xi) v / x.length.toDouble } {code} ...where BV, BSV, BDV are breeze vector, sparse vector, and dense vector, respectively... > GaussianMixtureEM should be faster for SparseVector input > - > > Key: SPARK-5021 > URL: https://issues.apache.org/jira/browse/SPARK-5021 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Manoj Kumar > > GaussianMixtureEM currently converts everything to dense vectors. It would > be nice if it were faster for SparseVectors (running in time linear in the > number of non-zero values). > However, this may not be too important since clustering should rarely be done > in high dimensions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input
[ https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14305171#comment-14305171 ] Travis Galoppo commented on SPARK-5021: --- [~MechCoder] You may be making things harder on yourself than necessary. The current code maps the incoming vectors to dense breeze vectors, but you can simply map them to generic breeze vectors... ie (GaussianMixture.scala: line 126) val breezeData = data.map(u => u.toBreeze.toDenseVector).cache() => val breezeData = data.map(_.toBreeze).cache() then genericize everything expecting a dense breeze vector/matrix to expect just a generic vector/matrix... when the time finally arrives where the cases must be separated, you can match on the variable, ie: def foo(x: BreezeVector) = { x match { case dx: DenseBreezeVector => // do dense vector calculation case sx: SparseBreezeVector => // do sparse vector calculation } } ... I know this is kind of high level... but it could avoid a lot of dual-path code. > GaussianMixtureEM should be faster for SparseVector input > - > > Key: SPARK-5021 > URL: https://issues.apache.org/jira/browse/SPARK-5021 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Manoj Kumar > > GaussianMixtureEM currently converts everything to dense vectors. It would > be nice if it were faster for SparseVectors (running in time linear in the > number of non-zero values). > However, this may not be too important since clustering should rarely be done > in high dimensions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5013) User guide for Gaussian Mixture Model
[ https://issues.apache.org/jira/browse/SPARK-5013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303522#comment-14303522 ] Travis Galoppo commented on SPARK-5013: --- Great! I will submit a PR soon. > User guide for Gaussian Mixture Model > - > > Key: SPARK-5013 > URL: https://issues.apache.org/jira/browse/SPARK-5013 > Project: Spark > Issue Type: Documentation > Components: Documentation, MLlib >Reporter: Xiangrui Meng >Assignee: Travis Galoppo > > Add GMM user guide with code examples in Scala/Java/Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5013) User guide for Gaussian Mixture Model
[ https://issues.apache.org/jira/browse/SPARK-5013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14299907#comment-14299907 ] Travis Galoppo commented on SPARK-5013: --- Does this amount to adding a description and code examples to docs/mllib-clustering.md ? Please assign to me and I will get started on this. I can finalize when the python API is merged. > User guide for Gaussian Mixture Model > - > > Key: SPARK-5013 > URL: https://issues.apache.org/jira/browse/SPARK-5013 > Project: Spark > Issue Type: Documentation > Components: Documentation, MLlib >Reporter: Xiangrui Meng > > Add GMM user guide with code examples in Scala/Java/Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input
[ https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297634#comment-14297634 ] Travis Galoppo commented on SPARK-5021: --- [~josephkb] This ticket is marked as affecting version 1.2.0 ... this should be 1.3.0 ? > GaussianMixtureEM should be faster for SparseVector input > - > > Key: SPARK-5021 > URL: https://issues.apache.org/jira/browse/SPARK-5021 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley >Assignee: Manoj Kumar > > GaussianMixtureEM currently converts everything to dense vectors. It would > be nice if it were faster for SparseVectors (running in time linear in the > number of non-zero values). > However, this may not be too important since clustering should rarely be done > in high dimensions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input
[ https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297622#comment-14297622 ] Travis Galoppo commented on SPARK-5021: --- [~MechCoder] The documentation for GMM is not yet completed (see SPARK-5013) ... the python interface is still being completed (SPARK-5012) and then the documentation can be completed. In the mean time, I might be able to answer your questions around the GMM code... > GaussianMixtureEM should be faster for SparseVector input > - > > Key: SPARK-5021 > URL: https://issues.apache.org/jira/browse/SPARK-5021 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley >Assignee: Manoj Kumar > > GaussianMixtureEM currently converts everything to dense vectors. It would > be nice if it were faster for SparseVectors (running in time linear in the > number of non-zero values). > However, this may not be too important since clustering should rarely be done > in high dimensions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5400) Rename GaussianMixtureEM to GaussianMixture
[ https://issues.apache.org/jira/browse/SPARK-5400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297613#comment-14297613 ] Travis Galoppo commented on SPARK-5400: --- Please assign to me and I will make the name change > Rename GaussianMixtureEM to GaussianMixture > --- > > Key: SPARK-5400 > URL: https://issues.apache.org/jira/browse/SPARK-5400 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Priority: Minor > > GaussianMixtureEM is following the old naming convention of including the > optimization algorithm name in the class title. We should probably rename it > to GaussianMixture so that it can use other optimization algorithms in the > future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5400) Rename GaussianMixtureEM to GaussianMixture
[ https://issues.apache.org/jira/browse/SPARK-5400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14290971#comment-14290971 ] Travis Galoppo commented on SPARK-5400: --- Hmm. This has me thinking in a different direction. We could generalize the expectation-maximization algorithm to work with any mixture model supporting a set of necessary likelihood compute/update methods... then we could ask for, e.g., "new ExpectationMaximization[GaussianMixtureModel]". This would de-couple the model and the algorithm, and could open the door for the implementation to be applied to (for instance) tomographic image reconstruction (which seems like a great fit for Spark given the volume of data involved). > Rename GaussianMixtureEM to GaussianMixture > --- > > Key: SPARK-5400 > URL: https://issues.apache.org/jira/browse/SPARK-5400 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Priority: Minor > > GaussianMixtureEM is following the old naming convention of including the > optimization algorithm name in the class title. We should probably rename it > to GaussianMixture so that it can use other optimization algorithms in the > future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5012) Python API for Gaussian Mixture Model
[ https://issues.apache.org/jira/browse/SPARK-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14284704#comment-14284704 ] Travis Galoppo commented on SPARK-5012: --- [~MeethuMathew] SPARK-5019 has been completed. > Python API for Gaussian Mixture Model > - > > Key: SPARK-5012 > URL: https://issues.apache.org/jira/browse/SPARK-5012 > Project: Spark > Issue Type: New Feature > Components: MLlib, PySpark >Reporter: Xiangrui Meng >Assignee: Meethu Mathew > > Add Python API for the Scala implementation of GMM. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5019) Update GMM API to use MultivariateGaussian
[ https://issues.apache.org/jira/browse/SPARK-5019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14281543#comment-14281543 ] Travis Galoppo commented on SPARK-5019: --- This ticket is currently stalling SPARK-5012. > Update GMM API to use MultivariateGaussian > -- > > Key: SPARK-5019 > URL: https://issues.apache.org/jira/browse/SPARK-5019 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley >Priority: Blocker > > The GaussianMixtureModel API should expose MultivariateGaussian instances > instead of the means and covariances. This should be fixed as soon as > possible to stabilize the API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5012) Python API for Gaussian Mixture Model
[ https://issues.apache.org/jira/browse/SPARK-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14278947#comment-14278947 ] Travis Galoppo commented on SPARK-5012: --- This will probably be affected by SPARK-5019 > Python API for Gaussian Mixture Model > - > > Key: SPARK-5012 > URL: https://issues.apache.org/jira/browse/SPARK-5012 > Project: Spark > Issue Type: New Feature > Components: MLlib, PySpark >Reporter: Xiangrui Meng >Assignee: Meethu Mathew > > Add Python API for the Scala implementation of GMM. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5019) Update GMM API to use MultivariateGaussian
[ https://issues.apache.org/jira/browse/SPARK-5019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14276943#comment-14276943 ] Travis Galoppo commented on SPARK-5019: --- I have a patch prepared for this; it is generally the same as [~lewuathe]'s patch, but takes into account recent changes with MultivariateGaussian and completely removes the mu/sigma parameters from GaussainMixtureModel (with code updates reflecting such in GaussianMixtureModelEM and the test suite). > Update GMM API to use MultivariateGaussian > -- > > Key: SPARK-5019 > URL: https://issues.apache.org/jira/browse/SPARK-5019 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley >Priority: Blocker > > The GaussianMixtureModel API should expose MultivariateGaussian instances > instead of the means and covariances. This should be fixed as soon as > possible to stabilize the API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5019) Update GMM API to use MultivariateGaussian
[ https://issues.apache.org/jira/browse/SPARK-5019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273565#comment-14273565 ] Travis Galoppo commented on SPARK-5019: --- [~lewuathe] Are you still interested in working on this ticket? SPARK-5018 is now complete. > Update GMM API to use MultivariateGaussian > -- > > Key: SPARK-5019 > URL: https://issues.apache.org/jira/browse/SPARK-5019 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley >Priority: Blocker > > The GaussianMixtureModel API should expose MultivariateGaussian instances > instead of the means and covariances. This should be fixed as soon as > possible to stabilize the API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5019) Update GMM API to use MultivariateGaussian
[ https://issues.apache.org/jira/browse/SPARK-5019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14267061#comment-14267061 ] Travis Galoppo edited comment on SPARK-5019 at 1/7/15 12:24 AM: No problem,[~lewuathe] ... I have just started work on SPARK-5018. If you would like to re-visit this ticket once that one is complete, that would be great! was (Author: tgaloppo): No problem,@lewuathe ... I have just started work on SPARK-5018. If you would like to re-visit this ticket once that one is complete, that would be great! > Update GMM API to use MultivariateGaussian > -- > > Key: SPARK-5019 > URL: https://issues.apache.org/jira/browse/SPARK-5019 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley >Priority: Blocker > > The GaussianMixtureModel API should expose MultivariateGaussian instances > instead of the means and covariances. This should be fixed as soon as > possible to stabilize the API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5019) Update GMM API to use MultivariateGaussian
[ https://issues.apache.org/jira/browse/SPARK-5019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14267061#comment-14267061 ] Travis Galoppo commented on SPARK-5019: --- No problem,@lewuathe ... I have just started work on SPARK-5018. If you would like to re-visit this ticket once that one is complete, that would be great! > Update GMM API to use MultivariateGaussian > -- > > Key: SPARK-5019 > URL: https://issues.apache.org/jira/browse/SPARK-5019 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley >Priority: Blocker > > The GaussianMixtureModel API should expose MultivariateGaussian instances > instead of the means and covariances. This should be fixed as soon as > possible to stabilize the API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5019) Update GMM API to use MultivariateGaussian
[ https://issues.apache.org/jira/browse/SPARK-5019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14266909#comment-14266909 ] Travis Galoppo commented on SPARK-5019: --- This really can't be completed until MultivariateGaussian is made public (SPARK-5018). > Update GMM API to use MultivariateGaussian > -- > > Key: SPARK-5019 > URL: https://issues.apache.org/jira/browse/SPARK-5019 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley >Priority: Blocker > > The GaussianMixtureModel API should expose MultivariateGaussian instances > instead of the means and covariances. This should be fixed as soon as > possible to stabilize the API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5018) Make MultivariateGaussian public
[ https://issues.apache.org/jira/browse/SPARK-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14266907#comment-14266907 ] Travis Galoppo commented on SPARK-5018: --- Please assign this ticket to me. > Make MultivariateGaussian public > > > Key: SPARK-5018 > URL: https://issues.apache.org/jira/browse/SPARK-5018 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley >Priority: Critical > > MultivariateGaussian is currently private[ml], but it would be a useful > public class. This JIRA will require defining a good public API for > distributions. > This JIRA will be needed for finalizing the GaussianMixtureModel API, which > should expose MultivariateGaussian instances instead of the means and > covariances. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5012) Python API for Gaussian Mixture Model
[ https://issues.apache.org/jira/browse/SPARK-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14261931#comment-14261931 ] Travis Galoppo commented on SPARK-5012: --- [~mengxr] Can this be reassigned to [~MeethuMathew] I will focus efforts on other improvements to the implementation. > Python API for Gaussian Mixture Model > - > > Key: SPARK-5012 > URL: https://issues.apache.org/jira/browse/SPARK-5012 > Project: Spark > Issue Type: New Feature > Components: MLlib, PySpark >Reporter: Xiangrui Meng >Assignee: Travis Galoppo > > Add Python API for the Scala implementation of GMM. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5012) Python API for Gaussian Mixture Model
[ https://issues.apache.org/jira/browse/SPARK-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14261413#comment-14261413 ] Travis Galoppo commented on SPARK-5012: --- [~mengxr] I'd be happy to. > Python API for Gaussian Mixture Model > - > > Key: SPARK-5012 > URL: https://issues.apache.org/jira/browse/SPARK-5012 > Project: Spark > Issue Type: New Feature > Components: MLlib, PySpark >Reporter: Xiangrui Meng > > Add Python API for the Scala implementation of GMM. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4156) Add expectation maximization for Gaussian mixture models to MLLib clustering
[ https://issues.apache.org/jira/browse/SPARK-4156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242494#comment-14242494 ] Travis Galoppo commented on SPARK-4156: --- [~MeethuMathew] This would be great! If possible, please issue a pull request against my repo and I will merge it in as soon as possible. > Add expectation maximization for Gaussian mixture models to MLLib clustering > > > Key: SPARK-4156 > URL: https://issues.apache.org/jira/browse/SPARK-4156 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Travis Galoppo >Assignee: Travis Galoppo > > As an additional clustering algorithm, implement expectation maximization for > Gaussian mixture models -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4156) Add expectation maximization for Gaussian mixture models to MLLib clustering
[ https://issues.apache.org/jira/browse/SPARK-4156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14233107#comment-14233107 ] Travis Galoppo commented on SPARK-4156: --- I have modified the cluster initialization strategy to derive an initial covariance matrix from the sample points used to initialize the clusters; this initial covariance matrix has the element-wise variance of the sample points on the diagonal. The final computed covariance matrix is not constrained to be diagonal. I tested this with the S1 dataset [~MeethuMathew] referenced above; while it does "fix" the problem of effectively finding no clusters, I find that the results are still better when the input is scaled as I mentioned above. It might be worthwhile to allow the user to provide a pre-initialized model to accomodate various initialization strategies, and provide the current functionality as a default. Thoughts? Also, I have fixed the defect in DenseGmmEM whereby it was ignoring the delta parameter. > Add expectation maximization for Gaussian mixture models to MLLib clustering > > > Key: SPARK-4156 > URL: https://issues.apache.org/jira/browse/SPARK-4156 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Travis Galoppo >Assignee: Travis Galoppo > > As an additional clustering algorithm, implement expectation maximization for > Gaussian mixture models -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4156) Add expectation maximization for Gaussian mixture models to MLLib clustering
[ https://issues.apache.org/jira/browse/SPARK-4156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231682#comment-14231682 ] Travis Galoppo commented on SPARK-4156: --- I do have a bug in the DenseGmmEM example code... the delta value is ignored, so all runs are using the default value of 0.01. I will fix ASAP. > Add expectation maximization for Gaussian mixture models to MLLib clustering > > > Key: SPARK-4156 > URL: https://issues.apache.org/jira/browse/SPARK-4156 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Travis Galoppo >Assignee: Travis Galoppo > > As an additional clustering algorithm, implement expectation maximization for > Gaussian mixture models -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4156) Add expectation maximization for Gaussian mixture models to MLLib clustering
[ https://issues.apache.org/jira/browse/SPARK-4156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231474#comment-14231474 ] Travis Galoppo commented on SPARK-4156: --- Ok, I looked into this. This is the result of using unit covariance matrices for initialization; specifically, the numbers in the input files are quite large, and [more importantly, I reckon] vary by relatively large amounts, thus the initial unit covariance matrices are poor choices, driving the probabilities to ~zero. I tested the S1 dataset after scaling the inputs by 10, and the algorithm yielded: w=0.018651 mu=[1.4005351951422986,5.560161272092209] sigma= 0.0047916181666818325 1.8492627979416199E-4 1.8492627979416199E-4 0.011135224999325288 w=0.070139 mu=[3.9826648305512444,4.048416241679408] sigma= 0.08975122201635877 0.011161215961635662 0.011161215961635662 0.07281211382882091 w=0.203390 mu=[4.50966114011736,8.335671907946685] sigma= 3.3435755029681820.16780915524083184 0.16780915524083184 0.1983579752119624 w=0.061357 mu=[8.243819479262187,7.299054596484072] sigma= 0.059502423358168244 -0.01288330287962225 -0.01288330287962225 0.08306975793088611 w=0.068116 mu=[3.2082470765623987,1.6153321811600052] sigma= 0.13661341675065408-0.004671801905049122 -0.004671801905049122 0.1184668732856653 w=0.015480 mu=[6.032605151728542,5.76477595221249] sigma= 0.006257088363533114 -0.01541684245322017 -0.01541684245322017 0.11177862390275095 w=0.069246 mu=[8.599898790732793,5.47222558625928] sigma= 0.083345775599170220.0025980740480378017 0.0025980740480378017 0.10560039597455884 w=0.066601 mu=[1.675642401646793,3.4768887461230293] sigma= 0.06718419616465754-0.001992742042064677 -0.001992742042064677 0.08394612669156842 w=0.050884 mu=[1.4034421425114039,5.586799889184816] sigma= 0.18839808914440148-0.017016991559440697 -0.017016991559440697 0.09967868623594711 w=0.067257 mu=[6.180341749904763,3.9855165348399026] sigma= 0.111625017355422070.0023201319648720187 0.0023201319648720187 0.09177325542363057 w=0.070096 mu=[5.078726203553804,1.756463619639961] sigma= 0.07852242299631484 0.03291628699789406 0.03291628699789406 0.08050080528055803 w=0.015951 mu=[5.989248184898113,5.729903049835485] sigma= 0.06204977226748554 0.008716828781302866 0.008716828781302866 0.003116768910125245 w=0.128860 mu=[8.274797410035061,2.390551639925522] sigma= 0.10976751308928101 -0.186908554330941 -0.186908554330941 0.7759289399492513 w=0.065259 mu=[3.3783618332560876,5.622632293334024] sigma= 0.10109765051996433 0.0320694359617697 0.0320694359617697 0.03873645329222697 w=0.028714 mu=[6.146091367146795,5.732902319554125] sigma= 0.23893543994099530.023579597914199724 0.023579597914199724 0.1377941370353355 Multiplying the MU values back by 10 they show pretty good fidelity to the truth values in s1-cb.txt provided on the source website for the dataset; unfortunately, I do not see the original weight and covariance values used to generate the data. Of course it would be easier to use if the scaling step was not necessary; I can modify the cluster initialization to use a covariance estimated from a sample and see how it works out. What strategy did you use for initializing clusters in your implementation? cc: [~MeethuMathew] > Add expectation maximization for Gaussian mixture models to MLLib clustering > > > Key: SPARK-4156 > URL: https://issues.apache.org/jira/browse/SPARK-4156 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Travis Galoppo >Assignee: Travis Galoppo > > As an additional clustering algorithm, implement expectation maximization for > Gaussian mixture models -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3588) Gaussian Mixture Model clustering
[ https://issues.apache.org/jira/browse/SPARK-3588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224287#comment-14224287 ] Travis Galoppo commented on SPARK-3588: --- Sorry about the duplicate effort; I did a search prior to my PR, but somehow missed this ticket. I will gladly coordinate to improve my submission. cc: [~mengxr] [~MeethuM] > Gaussian Mixture Model clustering > - > > Key: SPARK-3588 > URL: https://issues.apache.org/jira/browse/SPARK-3588 > Project: Spark > Issue Type: New Feature > Components: MLlib, PySpark >Reporter: Meethu Mathew >Assignee: Meethu Mathew > Attachments: GMMSpark.py > > > Gaussian Mixture Models (GMM) is a popular technique for soft clustering. GMM > models the entire data set as a finite mixture of Gaussian distributions,each > parameterized by a mean vector µ ,a covariance matrix ∑ and a mixture weight > π. In this technique, probability of each point to belong to each cluster is > computed along with the cluster statistics. > We have come up with an initial distributed implementation of GMM in pyspark > where the parameters are estimated using the Expectation-Maximization > algorithm.Our current implementation considers diagonal covariance matrix for > each component. > We did an initial benchmark study on a 2 node Spark standalone cluster setup > where each node config is 8 Cores,8 GB RAM, the spark version used is 1.0.0. > We also evaluated python version of k-means available in spark on the same > datasets. > Below are the results from this benchmark study. The reported stats are > average from 10 runs.Tests were done on multiple datasets with varying number > of features and instances. > || Dataset > || Gaussian > mixture model || > Kmeans(Python) || > > |Instances|Dimensions |Avg time per iteration|Time for 100 iterations |Avg > time per iteration |Time for 100 iterations | > |0.7million| 13 > | > 7s > | > 12min > | > 13s > | 26min > | > |1.8million| 11 > | > 17s > | > 29min > | > 33s > | 53min > | > |10million| 16 > | > 1.6min > | 2.7hr > | > 1.2min | > 2hr > | -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4156) Add expectation maximization for Gaussian mixture models to MLLib clustering
[ https://issues.apache.org/jira/browse/SPARK-4156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14190624#comment-14190624 ] Travis Galoppo commented on SPARK-4156: --- Pull request #3022 issued with changes implementing GMM EM. As this is my first contribution, I look forward to discussion of how to be a better contributor. > Add expectation maximization for Gaussian mixture models to MLLib clustering > > > Key: SPARK-4156 > URL: https://issues.apache.org/jira/browse/SPARK-4156 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Travis Galoppo >Priority: Minor > > As an additional clustering algorithm, implement expectation maximization for > Gaussian mixture models -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4156) Add expectation maximization for Gaussian mixture models to MLLib clustering
Travis Galoppo created SPARK-4156: - Summary: Add expectation maximization for Gaussian mixture models to MLLib clustering Key: SPARK-4156 URL: https://issues.apache.org/jira/browse/SPARK-4156 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Travis Galoppo Priority: Minor As an additional clustering algorithm, implement expectation maximization for Gaussian mixture models -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org