[
https://issues.apache.org/jira/browse/MAHOUT-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13027431#comment-13027431
]
Jake Mannix commented on MAHOUT-684:
------------------------------------
Hi Vasil,
I've been trying to incorporate this patch with the patch I have on
MAHOUT-682 (similar to what you've got in MAHOUT-683), but in addition to
getting tripped up on all the static methods (which are not so great for unit
testing and break encapsulation pretty badly), the LDADriver#writeNewAlpha()
seems to do very strange things: it first loads the entire LDAState up with
createState(), then it iterates over the entire HDFS-serialized intermediate
state (which should also be the same as what is iterated over in createState(),
right?), finds the digammaGamma vector, then does some cool estimation of the
new alpha stuff, and then creates a SequenceFileWriter to write the entire
state back out again (but now with the newly estimated alpha). The IO-behavior
of this seems pretty atrocious.
I'd really like to get this new alpha-estimation stuff in, it looks great,
but we've got to clean up the way we're reading/writing state to HDFS. At the
bare minimum, we should read the intermediate state once after every iteration,
and write it back out (with the new alpha) once. Better than that: use
multiple Paths, multiple outputs (although this is yet again something that the
Hadoop 0.20 API is not compatible with - you have to go back to the deprecated
o.a.m.mapred codebase to do this, just like for doing map-side joins, ARG!).
Do you think you could help me incorporate this algorithm improvement into a
patch once I've got MAHOUT-682 merged in to trunk?
> Topics regularization for LDA
> -----------------------------
>
> Key: MAHOUT-684
> URL: https://issues.apache.org/jira/browse/MAHOUT-684
> Project: Mahout
> Issue Type: Improvement
> Components: Clustering
> Reporter: Vasil Vasilev
> Priority: Minor
> Labels: LDA.
> Attachments: MAHOUT-684.patch
>
>
> Implementation provided for the alpha parameters estimation as described in
> the paper of Blei, Ng and Jordan
> (http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf).
> Remark: there is a mistake in the last formula in A.4.2 (the signs are
> wrong). The correct version is described here:
> http://www.cs.cmu.edu/~jch1/research/dirichlet/dirichlet.pdf (page 6).
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira