Doug Cutting wrote:
Chuck Williams wrote:
Another issue will likely be the tf() and idf() computations. I have a similar desired relevance ranking and was not getting what I wanted due to the idf() term dominating the score. [ ... ]
Chuck has made a series of criticisms of the DefaultSimilarity implementation. Unfortunately it is difficult to quickly evaluate these, as it requires relevance judgements. But, still, we should consider modifying DefaultSimilarity for the 2.0 release if there are easy improvements to be had. But how do we decide what's better?
Perhaps we should perform a formal or semi-formal evaluation of various Similarity implementations on a reference collection. For example, for a formal evalution we might use one the TREC Web collections, which have associated queries and relevance judgements. Or, less formally, we could use a crawl of the ~5M pages in DMOZ (I would be glad to collect these using Nutch).
This could work as follows:
-- Different folks could download and index a reference collection, offering demonstration search systems. We would provide complete code. These would differ only in their Similarity implementation. All implementations would use the same Analyzer and search only a single field.
-- These folks could then announce their candiate implementations and let others run queries against them, via HTTP. Different Similarity implementations could thus be publicly and interactively compared.
-- Hopefully a consensus, or at least a healthy majority, would agree on which was the best implementation and we could make that the default for Lucene 2.0.
Are there folks (e.g., Chuck) who would be willing to play this game?
I can prob play the game and offer resources, esp if disk space needed is not many GB...1GB is fine. I'm just not clear on how many people you need participating - one person per Similarity proposal? I do not have a Similarity proposal myself...
Should we make it more formal, using, e.g., TREC? Does anyone have other ideas how we should decide how to modify DefaultSimilarity?
Doug
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]