Re: LDA in Mahout

Ted Dunning Thu, 06 Jan 2011 11:25:25 -0800

On Thu, Jan 6, 2011 at 11:08 AM, Neal Richter <[email protected]> wrote:


> > That said, your suggestion is a reasonable one.  If you use the LDA topic
> > distribution for each document as a feature vector for a supervised model
> > then it is pretty easy to argue that LDA distributions that give better
> > model performance are better at capturing content.  The supervised step
> is
> > necessary, however, since there is not guarantee that the LDA topics will
> > have a simple relationship to human assigned categories.
> >
>
> If one thinks of the LDA outputting a distribution of topics for a given
> document... then at some point a real decision is made to output N topic
> labels... it looks like a classifier now.
>

It is true that anything that takes features in and produces scores can be
called a classifier.  The key distinction is whether the model is derived in
a supervised or unsupervised fashion.  Models derived by supervised training
can be evaluated by holding out training data.  It is common (and sloppy) to
refer to models derived with supervised learning as classifiers and models
derived with unsupervised learning as clustering.  In this nomenclature, LDA
is a clustering algorithm.

Models derived by unsupervised training cannot be evaluated against held out
data that has assigned labels because there is no reason that the
unsupervised results should correlate in an obvious way to the desired
labels.

There are various figures of merit for unsupervised models, but the one that
I prefer is "how useful is the output of the unsupervised model?". The
simplest model of utility laying around in this case is whether the
unsupervised model produces features that can be used to build a supervised
model.  An unsupervised model that cannot be so used is not providing us
with usable information.  Hopefully, the supervised model is very simple,
possibly even just a one-to-one rearrangement of unsupervised scores.




>
> I'm suggesting that one can do a classification accuracy test of the LDA
> predicted label set with a set of human generated labels from tagging data.
>
> 1) Document.DataVector
> 2) Document.LabelsVector
> 3) Run LDA on Document.DataVector to generate
> Document.ExtractedTopicsVector
>
> Compute accuracy by comparing Document.LabelsVector
> to Document.ExtractedTopicsVector
>

My point is exactly that this evaluation will lead to nonsense.  The size of
the extracted topics vector isn't even necessarily the same as the size of
the labels vector.  There is also no guarantee that it would be in the same
order.


>
> There will be misses if the human labeled/tagged term or phrase does not
> exist within the document's text or metadata. LDA can't see these unless
> some augmentation/inference step is run on the document vector prior to LDA
> input.
>

Actually, you will likely get worse than random results.

What you need is one extra step where you build a supervised classifier
using the extracted topics vector to predict the label.  The accuracy of
this supervised classifier is a measure of how well the extracted topics
encodes the information in the labels.

Re: LDA in Mahout

Reply via email to