On Mon, Jun 7, 2010 at 5:06 AM, Avishay Livne1 <[email protected]> wrote: > I modified > $MAHOUT_HOME/utils/src/main/java/org/apache/mahout/clustering/lda/LDAPrintTopics.java > so the score is printed along each word., but the interpretation of the > scores is somewhat obscure. > I see values in the range of -8 to +6. I assumed the values should > represent P(word | topic) or log(P(word | topic)) but these values are of > different range. > How should I interpret these values? Is there a simple way to retrieve P > (word | topic)?
Sorry about that. The scores are log p(word|topic) + constant, because they're normalized online during the E-step, and so the serialized values don't need to be serialized. You can normalize them by computing the log-sum of all of those values and subtracting. > > Thanks, > Avishay. > > > > From: Avishay Livne1/Haifa/i...@ibmil > > To: [email protected] > > Date: 06/06/2010 03:16 PM > > Subject: extract p(doc|topic) from LDA > > > > > > > > Hi, > > I'm trying to use LDA for a collaborative filtering task, where I need to > predict the rating a user (document) will give to a movie (word). > I ran LDA and constructed T topics, but I can only print the most frequent > words (movies) per topic. > Is it possible to extract p(documet|topic) or p(word|topic) from LDA's > output? (document = new user, word = movie). > > Best regards, > Avishay > > > > >
