[ https://issues.apache.org/jira/browse/SPARK-3588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224287#comment-14224287 ]
Travis Galoppo commented on SPARK-3588: --------------------------------------- Sorry about the duplicate effort; I did a search prior to my PR, but somehow missed this ticket. I will gladly coordinate to improve my submission. cc: [~mengxr] [~MeethuM] > Gaussian Mixture Model clustering > --------------------------------- > > Key: SPARK-3588 > URL: https://issues.apache.org/jira/browse/SPARK-3588 > Project: Spark > Issue Type: New Feature > Components: MLlib, PySpark > Reporter: Meethu Mathew > Assignee: Meethu Mathew > Attachments: GMMSpark.py > > > Gaussian Mixture Models (GMM) is a popular technique for soft clustering. GMM > models the entire data set as a finite mixture of Gaussian distributions,each > parameterized by a mean vector µ ,a covariance matrix ∑ and a mixture weight > π. In this technique, probability of each point to belong to each cluster is > computed along with the cluster statistics. > We have come up with an initial distributed implementation of GMM in pyspark > where the parameters are estimated using the Expectation-Maximization > algorithm.Our current implementation considers diagonal covariance matrix for > each component. > We did an initial benchmark study on a 2 node Spark standalone cluster setup > where each node config is 8 Cores,8 GB RAM, the spark version used is 1.0.0. > We also evaluated python version of k-means available in spark on the same > datasets. > Below are the results from this benchmark study. The reported stats are > average from 10 runs.Tests were done on multiple datasets with varying number > of features and instances. > || Dataset > || Gaussian > mixture model || > Kmeans(Python) || > > |Instances|Dimensions |Avg time per iteration|Time for 100 iterations |Avg > time per iteration |Time for 100 iterations | > |0.7million| 13 > | > 7s > | > 12min > | > 13s > | 26min > | > |1.8million| 11 > | > 17s > | > 29min > | > 33s > | 53min > | > |10million| 16 > | > 1.6min > | 2.7hr > | > 1.2min | > 2hr > | -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org