[ 
https://issues.apache.org/jira/browse/MAHOUT-276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830740#action_12830740
 ] 

Jeff Eastman commented on MAHOUT-276:
-------------------------------------

The fix involves adding alpha_0 as an argument to rDirichlet and using it in 
the rBeta arguments:
{code}
  public static Vector rDirichlet(Vector totalCounts, double alpha_0) {
    Vector pi = totalCounts.like();
    double total = totalCounts.zSum();
    double remainder = 1.0;
    for (int k = 0; k < pi.size(); k++) {
      double countK = totalCounts.get(k);
      total -= countK;
      double betaK = rBeta(1 + countK, Math.max(0, alpha_0 + total));
      double piK = betaK * remainder;
      pi.set(k, piK);
      remainder -= piK;
    }
    return pi;
  }
{code}

And making some small changes to DirichletState and DirichletMapper to pass in 
the specified alpha_0 value. It also changes DirichletState's initialization of 
prior model totalCounts to become 0.0 vs alpha_0/k


> Alpha_0 mixture parameter is not implemented correctly in Dirichlet
> -------------------------------------------------------------------
>
>                 Key: MAHOUT-276
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-276
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.2
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>
> I looked over the R reference code and alpha_0 is used in two places, not one 
> as in the current implementation:
> - in state initialization "beta = rbeta(K, 1, alpha_0)" [K is the number of 
> models]
> - during state update "beta[k] = rbeta(1, 1 + counts[k], alpha_0 + 
> N-counts[k])" [N is the cardinality of the sample vector and counts 
> corresponds to totalCounts in the implementation]
> The value of beta[k] is then used in the Dirichlet distribution calculation 
> which results in the mixture probabilities pi[i], for the iteration:
>    other = 1                                     # product accumulator
>    for (k in 1:K) {
>      pi[k] = beta[k] * other;                    # beta_k * prod_{n<k} beta_n
>      other = other * (1-beta[k])
>      }
> Alpha_0 determines the probability a point will go into an empty cluster, 
> mostly during the first iteration.  During the first iteration, the total 
> counts of all prior clusters are zero. Thus the Beta calculation that drives 
> the Dirichlet distribution that determines the mixture probabilities 
> degenerates to beta = rBeta(1, alpha_0). Clusters that end up with points for 
> the next iteration will overwhelm the small constants (alpha_0, 1) and 
> subsequent new mixture probabilities will derive from beta ~=  rBeta(count, 
> total) which is the current implementation. All empty clusters will 
> subsequently be driven by beta ~= rBeta(1, total) as alpha_0 is insignificant 
> and count is 0.
> The current implementation ends up using beta = rBeta(alpha_0/k, alpha_0) as 
> initial values during all iterations because the counts are all initialized 
> to alpha_0/k. Close but no cigar.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to