[ https://issues.apache.org/jira/browse/MAHOUT-276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jeff Eastman resolved MAHOUT-276. --------------------------------- Resolution: Fixed Fix Version/s: 0.2 r907842 made the above changes so I'm closing this issue > Alpha_0 mixture parameter is not implemented correctly in Dirichlet > ------------------------------------------------------------------- > > Key: MAHOUT-276 > URL: https://issues.apache.org/jira/browse/MAHOUT-276 > Project: Mahout > Issue Type: Bug > Components: Clustering > Affects Versions: 0.2 > Reporter: Jeff Eastman > Assignee: Jeff Eastman > Fix For: 0.2 > > > I looked over the R reference code and alpha_0 is used in two places, not one > as in the current implementation: > - in state initialization "beta = rbeta(K, 1, alpha_0)" [K is the number of > models] > - during state update "beta[k] = rbeta(1, 1 + counts[k], alpha_0 + > N-counts[k])" [N is the cardinality of the sample vector and counts > corresponds to totalCounts in the implementation] > The value of beta[k] is then used in the Dirichlet distribution calculation > which results in the mixture probabilities pi[i], for the iteration: > other = 1 # product accumulator > for (k in 1:K) { > pi[k] = beta[k] * other; # beta_k * prod_{n<k} beta_n > other = other * (1-beta[k]) > } > Alpha_0 determines the probability a point will go into an empty cluster, > mostly during the first iteration. During the first iteration, the total > counts of all prior clusters are zero. Thus the Beta calculation that drives > the Dirichlet distribution that determines the mixture probabilities > degenerates to beta = rBeta(1, alpha_0). Clusters that end up with points for > the next iteration will overwhelm the small constants (alpha_0, 1) and > subsequent new mixture probabilities will derive from beta ~= rBeta(count, > total) which is the current implementation. All empty clusters will > subsequently be driven by beta ~= rBeta(1, total) as alpha_0 is insignificant > and count is 0. > The current implementation ends up using beta = rBeta(alpha_0/k, alpha_0) as > initial values during all iterations because the counts are all initialized > to alpha_0/k. Close but no cigar. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.