Re: More LDA Questions

David Hall Sun, 10 Jan 2010 20:39:54 -0800

Err, I screwed up the post of this. I was trying to use nabble, which
I've already decided to swear off.


On Mon, Jan 11, 2010 at 2:23 AM, Grant Ingersoll <[email protected]>wrote:
> 1. The LDA implementation (and seems to be reinforced by my reading on
> topic models in general) is that the topic themselves don't have "names".
> I
> can see why this is difficult (in some ways, your summarizing a summary),
> but am curious whether anyone has done any work on such a thing as w/o
> them
> it still requires a fair amount by the human to infer what the topics
> are.
>  I suppose you could just pick the top few terms, but seems like a common
> phrase or something would go further.  Also, I believe someone in the past
> mentioned some more recent work by Blei and Lafferty (Blei and Lafferty.
> Visualizing Topics with Multi-Word Expressions. stat (2009) vol. 1050 pp.
> 6)

It's a big problem. David Blei's students Jonathan Chang and Jordan
Boyd-Graber have another paper out called "Reading Tea Leaves: How
Humans Interpret Topic Models" at NIPS this year that I haven't had a
chance to
read yet that might shed some light. Usually the "top-k" words serve as a
pretty good summary of a topic, particularly if you've stop-worded out
useless words.

In some sense, I've come to believe that assigning a label to a topic
reifies it more than it really deserves to be. Topics are in a lot of
ways like eigenvectors/eigenfaces; you don't really assign a name (or
even a visual word) to the fourth eigenface, even if it looks like it
might be smiling a little bit...

-- David

On Sun, Jan 10, 2010 at 8:32 PM, dlwh <[email protected]> wrote:
>
>
> Robin Anil wrote:
>>
>> http://www.lucidimagination.com/search/document/3ae15062f35420cf/lda_for_multi_label_classification_was_mahout_book
>>
>> <http://www.lucidimagination.com/search/document/3ae15062f35420cf/lda_for_multi_label_classification_was_mahout_book>David
>> gave me a very nice paper which talks about tag-document correlation. If
>> you
>> start with named labels, it does end up being naive bayes classifier.
>>
>
> One caveat on this: it reduces to NB only when there is exactly one observed
> label per document. Otherwise you have to do some kind of inference to
> figure out which words belong to which label.
>
>
> Robin Anil wrote:
>>
>> On Mon, Jan 11, 2010 at 2:23 AM, Grant Ingersoll
>> <[email protected]>wrote:
>>
>>> A couple of things strike me about LDA, and I wanted to hear others
>>> thoughts:
>>>
>>> 1. The LDA implementation (and seems to be reinforced by my reading on
>>> topic models in general) is that the topic themselves don't have "names".
>>> I
>>> can see why this is difficult (in some ways, your summarizing a summary),
>>> but am curious whether anyone has done any work on such a thing as w/o
>>> them
>>> it still requires a fair amount by the human to infer what the topics
>>> are.
>>>  I suppose you could just pick the top few terms, but seems like a common
>>> phrase or something would go further.  Also, I believe someone in the
>>> past
>>> mentioned some more recent work by Blei and Lafferty (Blei and Lafferty.
>>> Visualizing Topics with Multi-Word Expressions. stat (2009) vol. 1050 pp.
>>> 6)
>>> to alleviate that.
>>
>> It's a big problem. David Blei's students Jonathan Chang and Jordan
>> Boyd-Graber have another paper out called "Reading Tea Leaves: How Humans
>> Interpret Topic Models" at NIPS this year that I haven't had a chance to
>> read yet that might shed some light. Usually the "top-k" words serve as a
>> pretty good summary of a topic, particularly if you've stop-worded out
>> useless words.
>>
>>>
>>> 2. We get the words in the topic, but how do we know which documents have
>>> those topics?  I think, based on reading the paper, that the answer is
>>> "You
>>> don't get to know", but I'm not sure.
>>>
>> If I am correct, You do get to know based on the words in the document
>>  which of those un-labelled topics are in the documents with an affinity
>> score to eacj. You can sort it or do some form of testing to filter out
>> the
>> ones with significance.
>>
>
> So, the output of what we have implemented at the moment doesn't give you
> p(topic|document), but this is actually really easy, and could be done in
> about 20 minutes-hour. LDAInference (called in the Mapper--which is
> basically the E-Step) does all of the necessary work to learn
> p(topic|document), but it then just outputs sufficient statistics for
> p(word|topic). If instead we had a different Mapper to output
> <DOC-ID,p(topic|document) \forall topic>, you'd have that.
>
> That much is probably about 20 lines of logical code, along with the usual
> mess of hadoop boiler plate. If you want it, I'll code it up.
>
> -- David
>
> --
> View this message in context: 
> http://old.nabble.com/More-LDA-Questions-tp27102356p27105825.html
> Sent from the Mahout User List mailing list archive at Nabble.com.
>
>

Re: More LDA Questions

Reply via email to