When I’ve enabled global exact IDF in Solr, the speed penalty was about 10X. Back in 1995, Infoseek figured out how to do that with no speed penalty. They patented it, but that patent expired several years ago. I’ll try and hunt it down.
Short version, from each shard return the number of docs and the df for each term. When combining results, add all the DF, add all the NUMDOCS, divide, and you have the global IDF. This is constant for the whole result list. Each shard already needs that info for local score, so it shouldn’t be extra work. When does this matter? When the relevant documents for a term are mostly on one shard, either intentionally or accidentally. Let’s say we have a news search and all the stories for August 2024 are on one shard. The term “kamala” will be much more common on that shard, giving a lower IDF, but…the relevant documents are probably on that shard. So the best documents have a lower score using local IDF. This also shows up with lots of shards or small shards, because there will be uneven distribution of docs. When I retired from LexisNexis, we had a cluster with 320 shards. I’m sure that had some interesting IDF behavior. I wrote up how we did this in a Java distributed search layer for Ultraseek: https://observer.wunderwood.org/2007/04/04/progressive-reranking/ There is some earlier discussion here: https://solr-user.lucene.apache.narkive.com/zNa1Hn4p/single-call-for-distributed-idf I don’t think there is a Jira issue for this. I think that is all the unfinished business since putting Solr 1.3 into production at Netflix. Pretty darned good job everybody. Huge thanks to all the contributors and committers who have put in years of effort over that time. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog)