On Thu, Jan 6, 2011 at 11:08 AM, Neal Richter <[email protected]> wrote:
> > That said, your suggestion is a reasonable one. If you use the LDA topic > > distribution for each document as a feature vector for a supervised model > > then it is pretty easy to argue that LDA distributions that give better > > model performance are better at capturing content. The supervised step > is > > necessary, however, since there is not guarantee that the LDA topics will > > have a simple relationship to human assigned categories. > > > > If one thinks of the LDA outputting a distribution of topics for a given > document... then at some point a real decision is made to output N topic > labels... it looks like a classifier now. > It is true that anything that takes features in and produces scores can be called a classifier. The key distinction is whether the model is derived in a supervised or unsupervised fashion. Models derived by supervised training can be evaluated by holding out training data. It is common (and sloppy) to refer to models derived with supervised learning as classifiers and models derived with unsupervised learning as clustering. In this nomenclature, LDA is a clustering algorithm. Models derived by unsupervised training cannot be evaluated against held out data that has assigned labels because there is no reason that the unsupervised results should correlate in an obvious way to the desired labels. There are various figures of merit for unsupervised models, but the one that I prefer is "how useful is the output of the unsupervised model?". The simplest model of utility laying around in this case is whether the unsupervised model produces features that can be used to build a supervised model. An unsupervised model that cannot be so used is not providing us with usable information. Hopefully, the supervised model is very simple, possibly even just a one-to-one rearrangement of unsupervised scores. > > I'm suggesting that one can do a classification accuracy test of the LDA > predicted label set with a set of human generated labels from tagging data. > > 1) Document.DataVector > 2) Document.LabelsVector > 3) Run LDA on Document.DataVector to generate > Document.ExtractedTopicsVector > > Compute accuracy by comparing Document.LabelsVector > to Document.ExtractedTopicsVector > My point is exactly that this evaluation will lead to nonsense. The size of the extracted topics vector isn't even necessarily the same as the size of the labels vector. There is also no guarantee that it would be in the same order. > > There will be misses if the human labeled/tagged term or phrase does not > exist within the document's text or metadata. LDA can't see these unless > some augmentation/inference step is run on the document vector prior to LDA > input. > Actually, you will likely get worse than random results. What you need is one extra step where you build a supervised classifier using the extracted topics vector to predict the label. The accuracy of this supervised classifier is a measure of how well the extracted topics encodes the information in the labels.
