Re: Unfinished Business: Fast Global IDF

Walter Underwood Wed, 28 Aug 2024 10:31:20 -0700

I’ve never been in that part of the code, but it feels like it could have a 
small biast radius. We already have an interface for global IDF, so calculating 
it differently shouldn’t be huge. It does need a change in the shard response 
format.


It wouldn’t hurt to return DF in the response to regular clients. That would 
help with distributed search across collections, clusters, or even different 
kinds of engines. We did that ages ago at Verity with a SOAP interface (yuk). 

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Aug 27, 2024, at 8:10 PM, David Smiley <dsmi...@apache.org> wrote:
> 
> Thanks for sharing Walter!  I hope someone enterprising tackles it.
> It'd be nice to have global IDF by default without having to go enable
> something that adds a performance risk.
> 
> I'm sure you have many career stories to tell.  If you find yourself
> at Acadia National Park hiking & backpacking, as you like to do, shoot
> me a message. :-D
> 
> ~ David
> 
> On Tue, Aug 27, 2024 at 3:01 PM Walter Underwood <wun...@wunderwood.org> 
> wrote:
>> 
>> When I’ve enabled global exact IDF in Solr, the speed penalty was about 10X. 
>> Back in 1995, Infoseek figured out how to do that with no speed penalty. 
>> They patented it, but that patent expired several years ago. I’ll try and 
>> hunt it down.
>> 
>> Short version, from each shard return the number of docs and the df for each 
>> term. When combining results, add all the DF, add all the NUMDOCS, divide, 
>> and you have the global IDF. This is constant for the whole result list. 
>> Each shard already needs that info for local score, so it shouldn’t be extra 
>> work.
>> 
>> When does this matter? When the relevant documents for a term are mostly on 
>> one shard, either intentionally or accidentally. Let’s say we have a news 
>> search and all the stories for August 2024 are on one shard. The term 
>> “kamala” will be much more common on that shard, giving a lower IDF, but…the 
>> relevant documents are probably on that shard. So the best documents have a 
>> lower score using local IDF.
>> 
>> This also shows up with lots of shards or small shards, because there will 
>> be uneven distribution of docs. When I retired from LexisNexis, we had a 
>> cluster with 320 shards. I’m sure that had some interesting IDF behavior.
>> 
>> I wrote up how we did this in a Java distributed search layer for Ultraseek: 
>> https://observer.wunderwood.org/2007/04/04/progressive-reranking/
>> 
>> There is some earlier discussion here: 
>> https://solr-user.lucene.apache.narkive.com/zNa1Hn4p/single-call-for-distributed-idf
>> 
>> I don’t think there is a Jira issue for this.
>> 
>> I think that is all the unfinished business since putting Solr 1.3 into 
>> production at Netflix. Pretty darned good job everybody. Huge thanks to all 
>> the contributors and committers who have put in years of effort over that 
>> time.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
> For additional commands, e-mail: dev-h...@solr.apache.org
>

Re: Unfinished Business: Fast Global IDF

Reply via email to