jian chen <[EMAIL PROTECTED]> writes:

> Just to continue this discussion. I think right now Lucene's retrieval
> algorithm is based purely on Vector Space Model, which is simple and
> efficient.

As I understand it, it's indeed a tf-idf vector space approach, except
that the queries are structured and as such, the tf-idf weights are
totaled as a straight cosine among siblings of a BooleanQuery, but
other query nodes may do things differently, for example, I haven't
read it but I assume PhraseQueries require all terms present and
adjacent to contribute to the score.

There is also a document-specific boost factor in the equation which
is essentially a hook for document things like recency, PageRank, etc
etc.

You can tweak this by defining custom Similarity classes which can say
what the tf, idf, norm, and boost mean.  You can also affect the
term normalization at the query end in BooleanScorer (I think? through
the sumOfSquares method?).

We've implemented something kind of like the Similarity class but
based on a model which decsribes a larger family of "similarity
functions".  (For the curious or similarly IR-geeky, it's from Justin
Zobel's paper from a few years ago in SIGIR Forum.)  Essentially I
need more general hooks than the Lucene Similarity provides.  I think
those hooks might exist, but I'm not sure I know which classes they're
in.

I'm also interested in things like relevance feedback which can affect
term weights as well as adding terms to the query... just how many
places in the code do I have to subclass or change?

It's clear that if I'm interested in a completely different model like
language modeling the IndexReader is the way to go.  In which case,
what parts of the Lucene class structure should I adapt to maintain
the incremental-results-return, inverted list skips, and other
features which make the inverted search fast?

Ian



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to