Err, I screwed up the post of this. I was trying to use nabble, which I've already decided to swear off.
On Mon, Jan 11, 2010 at 2:23 AM, Grant Ingersoll <[email protected]>wrote: > 1. The LDA implementation (and seems to be reinforced by my reading on > topic models in general) is that the topic themselves don't have "names". > I > can see why this is difficult (in some ways, your summarizing a summary), > but am curious whether anyone has done any work on such a thing as w/o > them > it still requires a fair amount by the human to infer what the topics > are. > I suppose you could just pick the top few terms, but seems like a common > phrase or something would go further. Also, I believe someone in the past > mentioned some more recent work by Blei and Lafferty (Blei and Lafferty. > Visualizing Topics with Multi-Word Expressions. stat (2009) vol. 1050 pp. > 6) It's a big problem. David Blei's students Jonathan Chang and Jordan Boyd-Graber have another paper out called "Reading Tea Leaves: How Humans Interpret Topic Models" at NIPS this year that I haven't had a chance to read yet that might shed some light. Usually the "top-k" words serve as a pretty good summary of a topic, particularly if you've stop-worded out useless words. In some sense, I've come to believe that assigning a label to a topic reifies it more than it really deserves to be. Topics are in a lot of ways like eigenvectors/eigenfaces; you don't really assign a name (or even a visual word) to the fourth eigenface, even if it looks like it might be smiling a little bit... -- David On Sun, Jan 10, 2010 at 8:32 PM, dlwh <[email protected]> wrote: > > > Robin Anil wrote: >> >> http://www.lucidimagination.com/search/document/3ae15062f35420cf/lda_for_multi_label_classification_was_mahout_book >> >> <http://www.lucidimagination.com/search/document/3ae15062f35420cf/lda_for_multi_label_classification_was_mahout_book>David >> gave me a very nice paper which talks about tag-document correlation. If >> you >> start with named labels, it does end up being naive bayes classifier. >> > > One caveat on this: it reduces to NB only when there is exactly one observed > label per document. Otherwise you have to do some kind of inference to > figure out which words belong to which label. > > > Robin Anil wrote: >> >> On Mon, Jan 11, 2010 at 2:23 AM, Grant Ingersoll >> <[email protected]>wrote: >> >>> A couple of things strike me about LDA, and I wanted to hear others >>> thoughts: >>> >>> 1. The LDA implementation (and seems to be reinforced by my reading on >>> topic models in general) is that the topic themselves don't have "names". >>> I >>> can see why this is difficult (in some ways, your summarizing a summary), >>> but am curious whether anyone has done any work on such a thing as w/o >>> them >>> it still requires a fair amount by the human to infer what the topics >>> are. >>> I suppose you could just pick the top few terms, but seems like a common >>> phrase or something would go further. Also, I believe someone in the >>> past >>> mentioned some more recent work by Blei and Lafferty (Blei and Lafferty. >>> Visualizing Topics with Multi-Word Expressions. stat (2009) vol. 1050 pp. >>> 6) >>> to alleviate that. >> >> It's a big problem. David Blei's students Jonathan Chang and Jordan >> Boyd-Graber have another paper out called "Reading Tea Leaves: How Humans >> Interpret Topic Models" at NIPS this year that I haven't had a chance to >> read yet that might shed some light. Usually the "top-k" words serve as a >> pretty good summary of a topic, particularly if you've stop-worded out >> useless words. >> >>> >>> 2. We get the words in the topic, but how do we know which documents have >>> those topics? I think, based on reading the paper, that the answer is >>> "You >>> don't get to know", but I'm not sure. >>> >> If I am correct, You do get to know based on the words in the document >> which of those un-labelled topics are in the documents with an affinity >> score to eacj. You can sort it or do some form of testing to filter out >> the >> ones with significance. >> > > So, the output of what we have implemented at the moment doesn't give you > p(topic|document), but this is actually really easy, and could be done in > about 20 minutes-hour. LDAInference (called in the Mapper--which is > basically the E-Step) does all of the necessary work to learn > p(topic|document), but it then just outputs sufficient statistics for > p(word|topic). If instead we had a different Mapper to output > <DOC-ID,p(topic|document) \forall topic>, you'd have that. > > That much is probably about 20 lines of logical code, along with the usual > mess of hadoop boiler plate. If you want it, I'll code it up. > > -- David > > -- > View this message in context: > http://old.nabble.com/More-LDA-Questions-tp27102356p27105825.html > Sent from the Mahout User List mailing list archive at Nabble.com. > >
