[ https://issues.apache.org/jira/browse/SPARK-3588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224091#comment-14224091 ]
Meethu Mathew commented on SPARK-3588: -------------------------------------- [~mengxr] We have completed the pyspark implementation which is available at https://github.com/FlytxtRnD/GMM. We are in the process of porting the code to Scala and were planning to create a PR once the coding and test cases are completed. By "merging" do you mean to merge the tickets or the implementations? Kindly explain how the merge would be done. Will our work be a duplicate effort if we continue with our scala implementation? Could you please suggest the next course of action? > Gaussian Mixture Model clustering > --------------------------------- > > Key: SPARK-3588 > URL: https://issues.apache.org/jira/browse/SPARK-3588 > Project: Spark > Issue Type: New Feature > Components: MLlib, PySpark > Reporter: Meethu Mathew > Assignee: Meethu Mathew > Attachments: GMMSpark.py > > > Gaussian Mixture Models (GMM) is a popular technique for soft clustering. GMM > models the entire data set as a finite mixture of Gaussian distributions,each > parameterized by a mean vector µ ,a covariance matrix ∑ and a mixture weight > π. In this technique, probability of each point to belong to each cluster is > computed along with the cluster statistics. > We have come up with an initial distributed implementation of GMM in pyspark > where the parameters are estimated using the Expectation-Maximization > algorithm.Our current implementation considers diagonal covariance matrix for > each component. > We did an initial benchmark study on a 2 node Spark standalone cluster setup > where each node config is 8 Cores,8 GB RAM, the spark version used is 1.0.0. > We also evaluated python version of k-means available in spark on the same > datasets. > Below are the results from this benchmark study. The reported stats are > average from 10 runs.Tests were done on multiple datasets with varying number > of features and instances. > || Dataset > || Gaussian > mixture model || > Kmeans(Python) || > > |Instances|Dimensions |Avg time per iteration|Time for 100 iterations |Avg > time per iteration |Time for 100 iterations | > |0.7million| 13 > | > 7s > | > 12min > | > 13s > | 26min > | > |1.8million| 11 > | > 17s > | > 29min > | > 33s > | 53min > | > |10million| 16 > | > 1.6min > | 2.7hr > | > 1.2min | > 2hr > | -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org