[ https://issues.apache.org/jira/browse/MAHOUT-276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830740#action_12830740 ]
Jeff Eastman commented on MAHOUT-276: ------------------------------------- The fix involves adding alpha_0 as an argument to rDirichlet and using it in the rBeta arguments: {code} public static Vector rDirichlet(Vector totalCounts, double alpha_0) { Vector pi = totalCounts.like(); double total = totalCounts.zSum(); double remainder = 1.0; for (int k = 0; k < pi.size(); k++) { double countK = totalCounts.get(k); total -= countK; double betaK = rBeta(1 + countK, Math.max(0, alpha_0 + total)); double piK = betaK * remainder; pi.set(k, piK); remainder -= piK; } return pi; } {code} And making some small changes to DirichletState and DirichletMapper to pass in the specified alpha_0 value. It also changes DirichletState's initialization of prior model totalCounts to become 0.0 vs alpha_0/k > Alpha_0 mixture parameter is not implemented correctly in Dirichlet > ------------------------------------------------------------------- > > Key: MAHOUT-276 > URL: https://issues.apache.org/jira/browse/MAHOUT-276 > Project: Mahout > Issue Type: Bug > Components: Clustering > Affects Versions: 0.2 > Reporter: Jeff Eastman > Assignee: Jeff Eastman > > I looked over the R reference code and alpha_0 is used in two places, not one > as in the current implementation: > - in state initialization "beta = rbeta(K, 1, alpha_0)" [K is the number of > models] > - during state update "beta[k] = rbeta(1, 1 + counts[k], alpha_0 + > N-counts[k])" [N is the cardinality of the sample vector and counts > corresponds to totalCounts in the implementation] > The value of beta[k] is then used in the Dirichlet distribution calculation > which results in the mixture probabilities pi[i], for the iteration: > other = 1 # product accumulator > for (k in 1:K) { > pi[k] = beta[k] * other; # beta_k * prod_{n<k} beta_n > other = other * (1-beta[k]) > } > Alpha_0 determines the probability a point will go into an empty cluster, > mostly during the first iteration. During the first iteration, the total > counts of all prior clusters are zero. Thus the Beta calculation that drives > the Dirichlet distribution that determines the mixture probabilities > degenerates to beta = rBeta(1, alpha_0). Clusters that end up with points for > the next iteration will overwhelm the small constants (alpha_0, 1) and > subsequent new mixture probabilities will derive from beta ~= rBeta(count, > total) which is the current implementation. All empty clusters will > subsequently be driven by beta ~= rBeta(1, total) as alpha_0 is insignificant > and count is 0. > The current implementation ends up using beta = rBeta(alpha_0/k, alpha_0) as > initial values during all iterations because the counts are all initialized > to alpha_0/k. Close but no cigar. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.