[jira] [Commented] (MAHOUT-684) Topics regularization for LDA

Jake Mannix (JIRA) Sat, 30 Apr 2011 22:53:47 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13027431#comment-13027431
 ]


Jake Mannix commented on MAHOUT-684:
------------------------------------

Hi Vasil,

  I've been trying to incorporate this patch with the patch I have on 
MAHOUT-682 (similar to what you've got in MAHOUT-683), but in addition to 
getting tripped up on all the static methods (which are not so great for unit 
testing and break encapsulation pretty badly), the LDADriver#writeNewAlpha() 
seems to do very strange things: it first loads the entire LDAState up with 
createState(), then it iterates over the entire HDFS-serialized intermediate 
state (which should also be the same as what is iterated over in createState(), 
right?), finds the digammaGamma vector, then does some cool estimation of the 
new alpha stuff, and then creates a SequenceFileWriter to write the entire 
state back out again (but now with the newly estimated alpha).  The IO-behavior 
of this seems pretty atrocious.

  I'd really like to get this new alpha-estimation stuff in, it looks great, 
but we've got to clean up the way we're reading/writing state to HDFS.  At the 
bare minimum, we should read the intermediate state once after every iteration, 
and write it back out (with the new alpha) once.  Better than that: use 
multiple Paths, multiple outputs (although this is yet again something that the 
Hadoop 0.20 API is not compatible with - you have to go back to the deprecated 
o.a.m.mapred codebase to do this, just like for doing map-side joins, ARG!).

  Do you think you could help me incorporate this algorithm improvement into a 
patch once I've got MAHOUT-682 merged in to trunk?

> Topics regularization for LDA
> -----------------------------
>
>                 Key: MAHOUT-684
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-684
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>            Reporter: Vasil Vasilev
>            Priority: Minor
>              Labels: LDA.
>         Attachments: MAHOUT-684.patch
>
>
> Implementation provided for the alpha parameters estimation as described in 
> the paper of Blei, Ng and Jordan 
> (http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf).
> Remark: there is a mistake in the last formula in A.4.2 (the signs are 
> wrong). The correct version is described here: 
> http://www.cs.cmu.edu/~jch1/research/dirichlet/dirichlet.pdf (page 6).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-684) Topics regularization for LDA

Reply via email to