Hi, With a well mixed distributed set of indices not having distributed/global IDF won't hurt much. But what if one has a not so well mixed up set of shards? One might want to apply rules when assigning documents to shards in order to group certain types of documents into only a subset of all shards instead of having them spread across all shards. Doing such careful sharding might allow the searcher to be smarter about which shards to search based on the query of client running the query, etc.
Thus, I've run through comments on SOLR-303 to see what has been said about distributed IDF. Here is what I extracted: "## I'm not quite sure about GlobalCollectionStat. Is its purpose just to normalize weights from the shards?" "It's to make a distributed search score the same as it would if everything was in a single index. idf (inverse document frequency) is part of the scoring, so that component essentially does a distributed idf." "...distributed idf... this has a performance cost, and should matter little in a well mixed index." So, I'd like to see what it would take to add distributed IDF info to Solr's distributed search. Here are some questions to get the discussion going: - Is anyone already working on it? - Does anyone plan on working on it in the very near future? - Does anyone already have thoughts how and where dist. idf could be plugged in? - There is a mention of dist idf and performance cost up there - any idea how costly dist idf would be? Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch