Re: Scoring a document using LDA topics

2011-11-29 Thread Sujit Pal
Hi Stephen, We precompute a variant of P(z,d) during indexing, and do the first 3 steps. The resulting documents are ordered by payload score, which is basically z in our case. We don't currently care about P(t,z) but it seems like a good thing to have for disambiguation purposes. So anyway, I ha

Re: Scoring a document using LDA topics

2011-11-29 Thread Stephen Thomas
Sujit, Thanks for your reply, and the link to your blog post, which was helpful and got me thinking about Payloads. I still have one more question. I need to be able to compute the Sim(query q, doc d) similarity function, which is defined below: Sim (query q, doc d) = sum_{t in q} sum_{z} P(t, z

Re: Scoring a document using LDA topics

2011-11-28 Thread Sujit Pal
Hi Stephen, We are doing something similar, and we store as a multifield with each document as (d,z) pairs where we store the z's (scores) as payloads for each d (topic). We have had to build a custom similarity which implements the scorePayload function. So to find docs for a given d (topic), we

Scoring a document using LDA topics

2011-11-28 Thread Stephen Thomas
List, I am trying to incorporate the Latent Dirichlet Allocation (LDA) topic model into Lucene. Briefly, the LDA model extracts topics (distribution over words) from a set of documents, and then represents each document with topic vectors. For example, documents could be represented as: d1 = (0,