Re: LDA in Mahout

Ted Dunning Thu, 06 Jan 2011 08:23:30 -0800

The topics in LDA are not the same as topics in normal parlance.  They are
abstract, internal probabilistic distributions.


That said, your suggestion is a reasonable one.  If you use the LDA topic
distribution for each document as a feature vector for a supervised model
then it is pretty easy to argue that LDA distributions that give better
model performance are better at capturing content.  The supervised step is
necessary, however, since there is not guarantee that the LDA topics will
have a simple relationship to human assigned categories.

On Wed, Jan 5, 2011 at 11:57 PM, Neal Richter <[email protected]> wrote:

> What about gauging it's ability to predict the topics of labeled data?
>
> 1) Grab RSS feeds of blog posts and use the tags as labels
> 2) Delicious bookmarks & their content versus user tags
> 3) other examples abound...
>
> On Tue, Jan 4, 2011 at 10:33 AM, Jake Mannix <[email protected]>
> wrote:
>
> > Saying we have hashing is different than saying we know what will happen
> to
> > an algorithm once its running over hashed features (as the continuing
> work
> > on our Stochastic SVD demonstrates).
> >
> > I can certainly try to run LDA over a hashed vector set, but I'm not sure
> > what criteria for correctness / quality of the topic model I should use
> if
> > I
> > do.
> >
> >  -jake
> >
> > On Jan 4, 2011 7:21 AM, "Robin Anil" <[email protected]> wrote:
> >
> > We already have the second part - the hashing trick. Thanks to Ted, and
> he
> > has a mechanism to partially reverse engineer the feature as well. You
> > might
> > be able to drop it directly in the job itself or even vectorize and then
> > run
> > LDA.
> >
> > Robin
> >
> > On Tue, Jan 4, 2011 at 8:44 PM, Jake Mannix <[email protected]>
> wrote:
> > >
> > Hey Robin, > > Vowp...
> >
>

Re: LDA in Mahout

Reply via email to