Unfinished Business: Fast Global IDF

Walter Underwood Tue, 27 Aug 2024 12:01:59 -0700

When I’ve enabled global exact IDF in Solr, the speed penalty was about 10X. 
Back in 1995, Infoseek figured out how to do that with no speed penalty. They 
patented it, but that patent expired several years ago. I’ll try and hunt it 
down.


Short version, from each shard return the number of docs and the df for each 
term. When combining results, add all the DF, add all the NUMDOCS, divide, and 
you have the global IDF. This is constant for the whole result list. Each shard 
already needs that info for local score, so it shouldn’t be extra work.

When does this matter? When the relevant documents for a term are mostly on one 
shard, either intentionally or accidentally. Let’s say we have a news search 
and all the stories for August 2024 are on one shard. The term “kamala” will be 
much more common on that shard, giving a lower IDF, but…the relevant documents 
are probably on that shard. So the best documents have a lower score using 
local IDF.

This also shows up with lots of shards or small shards, because there will be 
uneven distribution of docs. When I retired from LexisNexis, we had a cluster 
with 320 shards. I’m sure that had some interesting IDF behavior.

I wrote up how we did this in a Java distributed search layer for Ultraseek: 
https://observer.wunderwood.org/2007/04/04/progressive-reranking/

There is some earlier discussion here: 
https://solr-user.lucene.apache.narkive.com/zNa1Hn4p/single-call-for-distributed-idf

I don’t think there is a Jira issue for this.

I think that is all the unfinished business since putting Solr 1.3 into 
production at Netflix. Pretty darned good job everybody. Huge thanks to all the 
contributors and committers who have put in years of effort over that time.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

Unfinished Business: Fast Global IDF

Reply via email to