[ https://issues.apache.org/jira/browse/MAHOUT-4?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12586287#action_12586287 ]
Isabel Drost commented on MAHOUT-4: ----------------------------------- Adding the comments sent to the list here as well for further reference. > ---------------------------- > So here is a short write-up in my words, please feel free to > fill any gaps/errors found I will try to do so from my perspective, maybe others can add their views. > Expectation Maximization for clustering > ----------------------------------------------------- > Let > z = unobserved data, clusters in our case. > y = observed data, points in our case. > > p(y1|z1) + p(y2|z1) + p(y3|z1) + p(y4|z1) = 1 > p(z1) + p(z2) = 1 Looks correct to me. > E-Step. > ------ > M-Step > ------ I could not find an error in neither of the two steps so far. > Questions > ========= > 1. When and how do we re-compute the cluster centers ? EM does not work with explicit cluster centers. In kmeans you iterate two steps: Assigning points to centers and recomputing the centers. In EM you again iterate two steps: Computing the probabilities for each point belonging to the clusters (so you do not assign them hard to one cluster, you only say with probability P it belongs to clusters i to k), in the second step you recompute the parameters of each cluster - the cluster center is influenced by each point but only weighted by its probability of belonging to this cluster. > 2. As per my understanding points and clusters are simply labels with some > conditional probability assigned to them. A distance metric like one > used in K-means is nowhere involved, is that correct ? Yes and no: Technically no, conceptually, your computation for the probability of assigning a point to a cluster should be based on the point's distance to the cluster. I hope I did not cause more confusion than helping you. Maybe others can correct me or clarify what I left unclear... Isabel > Simple prototype for Expectation Maximization (EM) > -------------------------------------------------- > > Key: MAHOUT-4 > URL: https://issues.apache.org/jira/browse/MAHOUT-4 > Project: Mahout > Issue Type: New Feature > Reporter: Ankur > Attachments: Mahout_EM.patch > > > Create a simple prototype implementing Expectation Maximization - EM that > demonstrates the algorithm functionality given a set of (user, click-url) > data. > The prototype should be functionally complete and should serve as a basis for > the Map-Reduce version of the EM algorithm. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.