[ https://issues.apache.org/jira/browse/SPARK-4156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231474#comment-14231474 ]
Travis Galoppo commented on SPARK-4156: --------------------------------------- Ok, I looked into this. This is the result of using unit covariance matrices for initialization; specifically, the numbers in the input files are quite large, and [more importantly, I reckon] vary by relatively large amounts, thus the initial unit covariance matrices are poor choices, driving the probabilities to ~zero. I tested the S1 dataset after scaling the inputs by 100000, and the algorithm yielded: w=0.018651 mu=[1.4005351951422986,5.560161272092209] sigma= 0.0047916181666818325 1.8492627979416199E-4 1.8492627979416199E-4 0.011135224999325288 w=0.070139 mu=[3.9826648305512444,4.048416241679408] sigma= 0.08975122201635877 0.011161215961635662 0.011161215961635662 0.07281211382882091 w=0.203390 mu=[4.50966114011736,8.335671907946685] sigma= 3.343575502968182 0.16780915524083184 0.16780915524083184 0.1983579752119624 w=0.061357 mu=[8.243819479262187,7.299054596484072] sigma= 0.059502423358168244 -0.01288330287962225 -0.01288330287962225 0.08306975793088611 w=0.068116 mu=[3.2082470765623987,1.6153321811600052] sigma= 0.13661341675065408 -0.004671801905049122 -0.004671801905049122 0.1184668732856653 w=0.015480 mu=[6.032605151728542,5.76477595221249] sigma= 0.006257088363533114 -0.01541684245322017 -0.01541684245322017 0.11177862390275095 w=0.069246 mu=[8.599898790732793,5.47222558625928] sigma= 0.08334577559917022 0.0025980740480378017 0.0025980740480378017 0.10560039597455884 w=0.066601 mu=[1.675642401646793,3.4768887461230293] sigma= 0.06718419616465754 -0.001992742042064677 -0.001992742042064677 0.08394612669156842 w=0.050884 mu=[1.4034421425114039,5.586799889184816] sigma= 0.18839808914440148 -0.017016991559440697 -0.017016991559440697 0.09967868623594711 w=0.067257 mu=[6.180341749904763,3.9855165348399026] sigma= 0.11162501735542207 0.0023201319648720187 0.0023201319648720187 0.09177325542363057 w=0.070096 mu=[5.078726203553804,1.756463619639961] sigma= 0.07852242299631484 0.03291628699789406 0.03291628699789406 0.08050080528055803 w=0.015951 mu=[5.989248184898113,5.729903049835485] sigma= 0.06204977226748554 0.008716828781302866 0.008716828781302866 0.003116768910125245 w=0.128860 mu=[8.274797410035061,2.390551639925522] sigma= 0.10976751308928101 -0.186908554330941 -0.186908554330941 0.7759289399492513 w=0.065259 mu=[3.3783618332560876,5.622632293334024] sigma= 0.10109765051996433 0.0320694359617697 0.0320694359617697 0.03873645329222697 w=0.028714 mu=[6.146091367146795,5.732902319554125] sigma= 0.2389354399409953 0.023579597914199724 0.023579597914199724 0.1377941370353355 Multiplying the MU values back by 100000 they show pretty good fidelity to the truth values in s1-cb.txt provided on the source website for the dataset; unfortunately, I do not see the original weight and covariance values used to generate the data. Of course it would be easier to use if the scaling step was not necessary; I can modify the cluster initialization to use a covariance estimated from a sample and see how it works out. What strategy did you use for initializing clusters in your implementation? cc: [~MeethuMathew] > Add expectation maximization for Gaussian mixture models to MLLib clustering > ---------------------------------------------------------------------------- > > Key: SPARK-4156 > URL: https://issues.apache.org/jira/browse/SPARK-4156 > Project: Spark > Issue Type: New Feature > Components: MLlib > Reporter: Travis Galoppo > Assignee: Travis Galoppo > > As an additional clustering algorithm, implement expectation maximization for > Gaussian mixture models -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org