Hi Stephen,

We precompute a variant of P(z,d) during indexing, and do the first 3
steps. The resulting documents are ordered by payload score, which is
basically z in our case. We don't currently care about P(t,z) but it
seems like a good thing to have for disambiguation purposes.

So anyway, I have never done what you are looking to do, but I guess the
approach you have outlined would be the one you would use to do this.
Although there may be performance issues where you have a large number
of topic matches.

An alternative - since you need to know the P(t,z) (the probability of
the terms in the query being in a particular topic), and each
PayloadTermQuery in the BooleanQuery corresponds to a z (topic), perhaps
you could boost each clauses by P(t,z)?

-sujit

On Tue, 2011-11-29 at 10:50 -0500, Stephen Thomas wrote:
> Sujit,
> 
> Thanks for your reply, and the link to your blog post, which was
> helpful and got me thinking about Payloads.
> 
> I still have one more question. I need to be able to compute the
> Sim(query q, doc d) similarity function, which is defined below:
> 
> Sim (query q, doc d) = sum_{t in q} sum_{z} P(t, z) * P(z, d)
> 
> So, I'm guessing that the only what to do this is to do the following:
> 
> - At index time, store the (flattened) topics as a payload for each
> documen, as you suggest in your blog
> 
> - At query time, find out which topics are in the query
> - Construct a BooleanQuery, consisting of one PayloadTermQuery per
> topic in the query
> - Search on the BooleanQuery. This essentially tells me which
> documents have the topics in the query
> - Iterate over the TopDocs returns by the search. For each doc, get
> the full payload, unflatten it, and use it to compute Sim(query q, doc
> d).
> - Reorder the results based on the Sim(query q, doc d) results.
> 
> Is this the best way? I can't see a way to compute the Sim() metric at
> any other time, because in scorePayload(), we don't have access to the
> full payload, nor to the query.
> 
> Thanks again,
> Steve
> 
> 
> On Mon, Nov 28, 2011 at 1:51 PM, Sujit Pal <sujit....@comcast.net> wrote:
> > Hi Stephen,
> >
> > We are doing something similar, and we store as a multifield with each
> > document as (d,z) pairs where we store the z's (scores) as payloads for
> > each d (topic). We have had to build a custom similarity which
> > implements the scorePayload function. So to find docs for a given d
> > (topic), we do a simple PayloadTermQuery and the docs come back in
> > descending order of z. Simple boolean term queries also work. We turn
> > off norms (in the ctor for the PayloadTermQuery) to get scores that are
> > identical to the d values.
> >
> > I wrote about this sometime back...maybe this would help you.
> > http://sujitpal.blogspot.com/2011/01/payloads-with-solr.html
> >
> > -sujit
> >
> > On Mon, 2011-11-28 at 12:29 -0500, Stephen Thomas wrote:
> >> List,
> >>
> >> I am trying to incorporate the Latent Dirichlet Allocation (LDA) topic
> >> model into Lucene. Briefly, the LDA model extracts topics
> >> (distribution over words) from a set of documents, and then represents
> >> each document with topic vectors. For example, documents could be
> >> represented as:
> >>
> >> d1 = (0,  0.5, 0, 0.5)
> >>
> >> d2 = (1, 0, 0, 0)
> >>
> >> This means that document d1 contains topics 2 and 4, and document d2
> >> contains topic 1. I.e.,
> >>
> >> P(z1, d1) = 0
> >> P(z2, d1) = 0.5
> >> P(z3, d1) = 0
> >> P(z4, d1) = 0.5
> >> P(z1, d2) = 1
> >> P(z2, d2) = 0
> >> ...
> >>
> >> Also, topics are represented by the probability that a term appears in
> >> that topic, so we also have a set of vectors:
> >>
> >> z1 = (0, 0, .02, ...)
> >>
> >> meaning that topic z1 does not contain terms 1 or 2, but does contain
> >> term 3. I.e.,
> >>
> >> P(t1, z1) = 0
> >> P(t2, z1) = 0
> >> P(t3, z1) = .02
> >> ...
> >>
> >> Then, the similarity between a query and a document is computed as:
> >>
> >> Sim (query q, doc d) = sum_{t in q} sum_{z} P(t, z) * P(z, d)
> >>
> >> Basically, for each term in the query, and each topic in existence,
> >> see how relevant that term is in that topic, and how relevant that
> >> topic is in the document.
> >>
> >>
> >> I've been thinking about how to do this in Lucene. Assume I already
> >> have the topics and the topic vectors for each document. I know that I
> >> need to write my own Similarity class that extends DefaultSimilarity.
> >> I need to override tf(), queryNorm(), coord(), and computeNorm() to
> >> all return a constant 1, so that they have no effect. Then, I can
> >> override idf() to compute the Sim equation above. Seems simple enough.
> >> However, I have a few practical issues:
> >>
> >>
> >> - Storing the topic vectors for each document. Can I store this in the
> >> index somehow? If so, how do I retrieve it later in my
> >> CustomSimilarity class?
> >>
> >> - Changing the Boolean model. Instead of only computing the similarity
> >> on a documents that contain any of the terms in the query (the default
> >> behavior), I need to compute the similarity on all of the documents.
> >> (This is the whole  idea behind LDA: you don't need an exact term
> >> match for there to be a similarity.) I understand that this will
> >> result in a performance hit, but I do not see a way around it.
> >>
> >> - Turning off fieldNorm(). How can I set the field norm for each doc
> >> to a constant 1?
> >>
> >>
> >> Any help is greatly appreciated.
> >>
> >> Steve
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to