[ 
https://issues.apache.org/jira/browse/SPARK-4156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231474#comment-14231474
 ] 

Travis Galoppo commented on SPARK-4156:
---------------------------------------

Ok, I looked into this.  This is the result of using unit covariance matrices 
for initialization; specifically, the numbers in the input files are quite 
large, and [more importantly, I reckon] vary by relatively large amounts, thus 
the initial unit covariance matrices are poor choices, driving the 
probabilities to ~zero.

I tested the S1 dataset after scaling the inputs by 100000, and the algorithm 
yielded:

w=0.018651 mu=[1.4005351951422986,5.560161272092209] sigma=
0.0047916181666818325  1.8492627979416199E-4  
1.8492627979416199E-4  0.011135224999325288   

w=0.070139 mu=[3.9826648305512444,4.048416241679408] sigma=
0.08975122201635877   0.011161215961635662  
0.011161215961635662  0.07281211382882091   

w=0.203390 mu=[4.50966114011736,8.335671907946685] sigma=
3.343575502968182    0.16780915524083184  
0.16780915524083184  0.1983579752119624   

w=0.061357 mu=[8.243819479262187,7.299054596484072] sigma=
0.059502423358168244  -0.01288330287962225  
-0.01288330287962225  0.08306975793088611   

w=0.068116 mu=[3.2082470765623987,1.6153321811600052] sigma=
0.13661341675065408    -0.004671801905049122  
-0.004671801905049122  0.1184668732856653     

w=0.015480 mu=[6.032605151728542,5.76477595221249] sigma=
0.006257088363533114  -0.01541684245322017  
-0.01541684245322017  0.11177862390275095   

w=0.069246 mu=[8.599898790732793,5.47222558625928] sigma=
0.08334577559917022    0.0025980740480378017  
0.0025980740480378017  0.10560039597455884    

w=0.066601 mu=[1.675642401646793,3.4768887461230293] sigma=
0.06718419616465754    -0.001992742042064677  
-0.001992742042064677  0.08394612669156842    

w=0.050884 mu=[1.4034421425114039,5.586799889184816] sigma=
0.18839808914440148    -0.017016991559440697  
-0.017016991559440697  0.09967868623594711    

w=0.067257 mu=[6.180341749904763,3.9855165348399026] sigma=
0.11162501735542207    0.0023201319648720187  
0.0023201319648720187  0.09177325542363057    

w=0.070096 mu=[5.078726203553804,1.756463619639961] sigma=
0.07852242299631484  0.03291628699789406  
0.03291628699789406  0.08050080528055803  

w=0.015951 mu=[5.989248184898113,5.729903049835485] sigma=
0.06204977226748554   0.008716828781302866  
0.008716828781302866  0.003116768910125245  

w=0.128860 mu=[8.274797410035061,2.390551639925522] sigma=
0.10976751308928101  -0.186908554330941  
-0.186908554330941   0.7759289399492513  

w=0.065259 mu=[3.3783618332560876,5.622632293334024] sigma=
0.10109765051996433  0.0320694359617697   
0.0320694359617697   0.03873645329222697  

w=0.028714 mu=[6.146091367146795,5.732902319554125] sigma=
0.2389354399409953    0.023579597914199724  
0.023579597914199724  0.1377941370353355    

Multiplying the MU values back by 100000 they show pretty good fidelity to the 
truth values in s1-cb.txt provided on the source website for the dataset; 
unfortunately, I do not see the original weight and covariance values used to 
generate the data.

Of course it would be easier to use if the scaling step was not necessary; I 
can modify the cluster initialization to use a covariance estimated from a 
sample and see how it works out.  What strategy did you use for initializing 
clusters in your implementation?

cc: [~MeethuMathew]

> Add expectation maximization for Gaussian mixture models to MLLib clustering
> ----------------------------------------------------------------------------
>
>                 Key: SPARK-4156
>                 URL: https://issues.apache.org/jira/browse/SPARK-4156
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Travis Galoppo
>            Assignee: Travis Galoppo
>
> As an additional clustering algorithm, implement expectation maximization for 
> Gaussian mixture models



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to