Hi Stephen, We precompute a variant of P(z,d) during indexing, and do the first 3 steps. The resulting documents are ordered by payload score, which is basically z in our case. We don't currently care about P(t,z) but it seems like a good thing to have for disambiguation purposes.
So anyway, I have never done what you are looking to do, but I guess the approach you have outlined would be the one you would use to do this. Although there may be performance issues where you have a large number of topic matches. An alternative - since you need to know the P(t,z) (the probability of the terms in the query being in a particular topic), and each PayloadTermQuery in the BooleanQuery corresponds to a z (topic), perhaps you could boost each clauses by P(t,z)? -sujit On Tue, 2011-11-29 at 10:50 -0500, Stephen Thomas wrote: > Sujit, > > Thanks for your reply, and the link to your blog post, which was > helpful and got me thinking about Payloads. > > I still have one more question. I need to be able to compute the > Sim(query q, doc d) similarity function, which is defined below: > > Sim (query q, doc d) = sum_{t in q} sum_{z} P(t, z) * P(z, d) > > So, I'm guessing that the only what to do this is to do the following: > > - At index time, store the (flattened) topics as a payload for each > documen, as you suggest in your blog > > - At query time, find out which topics are in the query > - Construct a BooleanQuery, consisting of one PayloadTermQuery per > topic in the query > - Search on the BooleanQuery. This essentially tells me which > documents have the topics in the query > - Iterate over the TopDocs returns by the search. For each doc, get > the full payload, unflatten it, and use it to compute Sim(query q, doc > d). > - Reorder the results based on the Sim(query q, doc d) results. > > Is this the best way? I can't see a way to compute the Sim() metric at > any other time, because in scorePayload(), we don't have access to the > full payload, nor to the query. > > Thanks again, > Steve > > > On Mon, Nov 28, 2011 at 1:51 PM, Sujit Pal <sujit....@comcast.net> wrote: > > Hi Stephen, > > > > We are doing something similar, and we store as a multifield with each > > document as (d,z) pairs where we store the z's (scores) as payloads for > > each d (topic). We have had to build a custom similarity which > > implements the scorePayload function. So to find docs for a given d > > (topic), we do a simple PayloadTermQuery and the docs come back in > > descending order of z. Simple boolean term queries also work. We turn > > off norms (in the ctor for the PayloadTermQuery) to get scores that are > > identical to the d values. > > > > I wrote about this sometime back...maybe this would help you. > > http://sujitpal.blogspot.com/2011/01/payloads-with-solr.html > > > > -sujit > > > > On Mon, 2011-11-28 at 12:29 -0500, Stephen Thomas wrote: > >> List, > >> > >> I am trying to incorporate the Latent Dirichlet Allocation (LDA) topic > >> model into Lucene. Briefly, the LDA model extracts topics > >> (distribution over words) from a set of documents, and then represents > >> each document with topic vectors. For example, documents could be > >> represented as: > >> > >> d1 = (0, 0.5, 0, 0.5) > >> > >> d2 = (1, 0, 0, 0) > >> > >> This means that document d1 contains topics 2 and 4, and document d2 > >> contains topic 1. I.e., > >> > >> P(z1, d1) = 0 > >> P(z2, d1) = 0.5 > >> P(z3, d1) = 0 > >> P(z4, d1) = 0.5 > >> P(z1, d2) = 1 > >> P(z2, d2) = 0 > >> ... > >> > >> Also, topics are represented by the probability that a term appears in > >> that topic, so we also have a set of vectors: > >> > >> z1 = (0, 0, .02, ...) > >> > >> meaning that topic z1 does not contain terms 1 or 2, but does contain > >> term 3. I.e., > >> > >> P(t1, z1) = 0 > >> P(t2, z1) = 0 > >> P(t3, z1) = .02 > >> ... > >> > >> Then, the similarity between a query and a document is computed as: > >> > >> Sim (query q, doc d) = sum_{t in q} sum_{z} P(t, z) * P(z, d) > >> > >> Basically, for each term in the query, and each topic in existence, > >> see how relevant that term is in that topic, and how relevant that > >> topic is in the document. > >> > >> > >> I've been thinking about how to do this in Lucene. Assume I already > >> have the topics and the topic vectors for each document. I know that I > >> need to write my own Similarity class that extends DefaultSimilarity. > >> I need to override tf(), queryNorm(), coord(), and computeNorm() to > >> all return a constant 1, so that they have no effect. Then, I can > >> override idf() to compute the Sim equation above. Seems simple enough. > >> However, I have a few practical issues: > >> > >> > >> - Storing the topic vectors for each document. Can I store this in the > >> index somehow? If so, how do I retrieve it later in my > >> CustomSimilarity class? > >> > >> - Changing the Boolean model. Instead of only computing the similarity > >> on a documents that contain any of the terms in the query (the default > >> behavior), I need to compute the similarity on all of the documents. > >> (This is the whole idea behind LDA: you don't need an exact term > >> match for there to be a similarity.) I understand that this will > >> result in a performance hit, but I do not see a way around it. > >> > >> - Turning off fieldNorm(). How can I set the field norm for each doc > >> to a constant 1? > >> > >> > >> Any help is greatly appreciated. > >> > >> Steve > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org