jian chen <[EMAIL PROTECTED]> writes: > Just to continue this discussion. I think right now Lucene's retrieval > algorithm is based purely on Vector Space Model, which is simple and > efficient.
As I understand it, it's indeed a tf-idf vector space approach, except that the queries are structured and as such, the tf-idf weights are totaled as a straight cosine among siblings of a BooleanQuery, but other query nodes may do things differently, for example, I haven't read it but I assume PhraseQueries require all terms present and adjacent to contribute to the score. There is also a document-specific boost factor in the equation which is essentially a hook for document things like recency, PageRank, etc etc. You can tweak this by defining custom Similarity classes which can say what the tf, idf, norm, and boost mean. You can also affect the term normalization at the query end in BooleanScorer (I think? through the sumOfSquares method?). We've implemented something kind of like the Similarity class but based on a model which decsribes a larger family of "similarity functions". (For the curious or similarly IR-geeky, it's from Justin Zobel's paper from a few years ago in SIGIR Forum.) Essentially I need more general hooks than the Lucene Similarity provides. I think those hooks might exist, but I'm not sure I know which classes they're in. I'm also interested in things like relevance feedback which can affect term weights as well as adding terms to the query... just how many places in the code do I have to subclass or change? It's clear that if I'm interested in a completely different model like language modeling the IndexReader is the way to go. In which case, what parts of the Lucene class structure should I adapt to maintain the incremental-results-return, inverted list skips, and other features which make the inverted search fast? Ian --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]