[jira] [Updated] (SPARK-3588) Gaussian Mixture Model clustering

Meethu Mathew (JIRA) Thu, 18 Sep 2014 03:48:54 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-3588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Meethu Mathew updated SPARK-3588:
---------------------------------
    Description: 
Gaussian Mixture Models (GMM) is a popular technique for soft clustering. GMM 
models the entire data set as a finite mixture of Gaussian distributions,each 
parameterized by a mean vector µ ,a covariance matrix ∑ and  a mixture weight 
π. In this technique, probability of  each point to belong to each cluster is 
computed along with the cluster statistics.

We have come up with an initial distributed implementation of GMM in pyspark 
where the parameters are estimated using the  Expectation-Maximization 
algorithm.Our current implementation considers diagonal covariance matrix for 
each component.

We did an initial benchmark study on a  2 node Spark standalone cluster setup 
where each node config is 8 Cores,8 GB RAM, the spark version used is 1.0.0. We 
also evaluated python version of k-means available in spark on the same 
datasets.
Below are the results from this benchmark study. The reported stats are average 
from 10 runs.Tests were done on multiple datasets with varying number of 
features and instances.

||&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Dataset  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;||&nbsp;&nbsp;&nbsp;Gaussian
 mixture model&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|| 
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Kmeans(Python)&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;||
         

|Instances|Dimensions |Avg time per iteration|Time for  100 iterations |Avg 
time per iteration |Time for 100 iterations | 

|0.7million| &nbsp;&nbsp;&nbsp;13 
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;   7s 
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;     | 
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;     12min 
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;   |  &nbsp;&nbsp;&nbsp;&nbsp; 
&nbsp;&nbsp;     13s  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  |  
&nbsp;&nbsp;&nbsp;&nbsp;    26min &nbsp;&nbsp;&nbsp;    |

|1.8million| &nbsp;&nbsp;&nbsp;11 
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|   
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  17s 
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;     | 
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;     29min 
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  |  
&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;     33s  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;   |  &nbsp;&nbsp;&nbsp;&nbsp;    53min 
&nbsp;&nbsp;&nbsp;  |

|10million|&nbsp;&nbsp;&nbsp;16 
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  1.6min &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;    
| &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;     2.7hr 
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;   |  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;     1.2min &nbsp;&nbsp;&nbsp;&nbsp;    |  
&nbsp;&nbsp;&nbsp;&nbsp;    2hr &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;    |

  was:
Gaussian Mixture Models (GMM) is a popular technique for soft clustering. GMM 
models the entire data set as a finite mixture of Gaussian distributions,each 
parameterized by a mean vector µ ,a covariance matrix ∑ and  a mixture weight 
π. In this technique, probability of  each point to belong to each cluster is 
computed along with the cluster statistics.

We have come up with an initial distributed implementation of GMM in pyspark 
where the parameters are estimated using the  Expectation-Maximization 
algorithm.Our current implementation considers diagonal covariance matrix for 
each component.

We did an initial benchmark study on a  2 node Spark standalone cluster setup 
where each node config is(8 Cores,8 GB RAM) and the spark version used is 
1.0.0. We also evaluated python version of k-means available in spark on the 
same datasets.
Below are the results from this benchmark study. The reported stats are average 
from 10 runs.Tests were done on multiple datasets with varying number of 
features and instances.

||&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Dataset  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;||&nbsp;&nbsp;&nbsp;Gaussian
 mixture model&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|| 
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Kmeans(Python)&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;||
         

|Instances|Dimensions |Avg time per iteration|Time for  100 iterations |Avg 
time per iteration |Time for 100 iterations | 

|0.7million| &nbsp;&nbsp;&nbsp;13 
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;   7s 
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;     | 
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;     12min 
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;   |  &nbsp;&nbsp;&nbsp;&nbsp; 
&nbsp;&nbsp;     13s  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  |  
&nbsp;&nbsp;&nbsp;&nbsp;    26min &nbsp;&nbsp;&nbsp;    |

|1.8million| &nbsp;&nbsp;&nbsp;11 
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|   
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  17s 
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;     | 
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;     29min 
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  |  
&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;     33s  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;   |  &nbsp;&nbsp;&nbsp;&nbsp;    53min 
&nbsp;&nbsp;&nbsp;  |

|10million|&nbsp;&nbsp;&nbsp;16 
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  1.6min &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;    
| &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;     2.7hr 
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;   |  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;     1.2min &nbsp;&nbsp;&nbsp;&nbsp;    |  
&nbsp;&nbsp;&nbsp;&nbsp;    2hr &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;    |


> Gaussian Mixture Model clustering
> ---------------------------------
>
>                 Key: SPARK-3588
>                 URL: https://issues.apache.org/jira/browse/SPARK-3588
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib, PySpark
>            Reporter: Meethu Mathew
>
> Gaussian Mixture Models (GMM) is a popular technique for soft clustering. GMM 
> models the entire data set as a finite mixture of Gaussian distributions,each 
> parameterized by a mean vector µ ,a covariance matrix ∑ and  a mixture weight 
> π. In this technique, probability of  each point to belong to each cluster is 
> computed along with the cluster statistics.
> We have come up with an initial distributed implementation of GMM in pyspark 
> where the parameters are estimated using the  Expectation-Maximization 
> algorithm.Our current implementation considers diagonal covariance matrix for 
> each component.
> We did an initial benchmark study on a  2 node Spark standalone cluster setup 
> where each node config is 8 Cores,8 GB RAM, the spark version used is 1.0.0. 
> We also evaluated python version of k-means available in spark on the same 
> datasets.
> Below are the results from this benchmark study. The reported stats are 
> average from 10 runs.Tests were done on multiple datasets with varying number 
> of features and instances.
> ||&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Dataset  
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;||&nbsp;&nbsp;&nbsp;Gaussian
>  mixture model&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|| 
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Kmeans(Python)&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;||
>          
> |Instances|Dimensions |Avg time per iteration|Time for  100 iterations |Avg 
> time per iteration |Time for 100 iterations | 
> |0.7million| &nbsp;&nbsp;&nbsp;13 
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|  
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;   7s 
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;     | 
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;     12min 
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;   |  
> &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;     13s  
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  |  &nbsp;&nbsp;&nbsp;&nbsp;    26min 
> &nbsp;&nbsp;&nbsp;    |
> |1.8million| &nbsp;&nbsp;&nbsp;11 
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|   
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  17s 
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;     | 
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;     29min 
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  |  
> &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;     33s  
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;   |  &nbsp;&nbsp;&nbsp;&nbsp;    53min 
> &nbsp;&nbsp;&nbsp;  |
> |10million|&nbsp;&nbsp;&nbsp;16 
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|  
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  1.6min &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;   
>  | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;     2.7hr 
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;   |  
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;     1.2min &nbsp;&nbsp;&nbsp;&nbsp;    | 
>  &nbsp;&nbsp;&nbsp;&nbsp;    2hr &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;   
>  |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3588) Gaussian Mixture Model clustering

Reply via email to