Actually, I believe that the Lucene scoring function is based on *Okapi BM25* (BM is an abbreviation of best matching) which is based on the probabilistic retrieval <https://en.m.wikipedia.org/wiki/Probabilistic_relevance_model>framework developed in the 1970s and 1980s by Stephen E. Robertson <https://en.m.wikipedia.org/wiki/Stephen_E._Robertson>, Karen Spärck Jones <https://en.m.wikipedia.org/wiki/Karen_Sp%C3%A4rck_Jones>, and others.
There are several interpretations for IDF and slight variations on its formula. In the original BM25 derivation, the IDF component is derived from the Binary Independence Model <https://en.m.wikipedia.org/wiki/Binary_Independence_Model>. Info from: https://en.m.wikipedia.org/wiki/Okapi_BM25 You could calculate an ideal score, but that can change every time a > document is added to or deleted from the index, because of idf. So the > ideal score isn’t a useful mental model. > > Essentially, you need to tell your users to worry about something that > matters. The absolute value of the score does not matter. > While I understand the concern, quite often BM25 scores are used post retrieval (in 2-stage retrieval/ranking systems) to fuel learning-to-rank models that often transform the score into [0,1] using some normalization function that often involves estimating a max score by looking at the score distribution. J On Mon, Dec 19, 2022 at 11:31 AM Walter Underwood <wun...@wunderwood.org> wrote: > That article is copied from the old wiki, so it is much earlier than 2019, > more like 2009. Unfortunately, the links to the email discussion are all > dead, but the issues in the article are still true. > > If you really want to go down that path, you might be able to do it with a > similarity class that implements a probabilistic relevance model. I’d start > the literature search with this Google query. > > probablistic information retrieval > <https://www.google.com/search?client=safari&rls=en&q=probablistic+information+retrieval&ie=UTF-8&oe=UTF-8> > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > > On Dec 18, 2022, at 2:47 AM, Mikhail Khludnev <m...@apache.org> wrote: > > Thanks for replym Walter. > Recently Robert commented on PR with the link > https://cwiki.apache.org/confluence/display/LUCENE/ScoresAsPercentages it > gives arguments against my proposal. Honestly, I'm still in doubt. > > On Tue, Dec 6, 2022 at 8:15 PM Walter Underwood <wun...@wunderwood.org> > wrote: > >> As you point out, this is a probabilistic relevance model. Lucene uses a >> vector space model. >> >> A probabilistic model gives an estimate of how relevant each document is >> to the query. Unfortunately, their overall relevance isn’t as good as a >> vector space model. >> >> You could calculate an ideal score, but that can change every time a >> document is added to or deleted from the index, because of idf. So the >> ideal score isn’t a useful mental model. >> >> Essentially, you need to tell your users to worry about something that >> matters. The absolute value of the score does not matter. >> >> wunder >> Walter Underwood >> wun...@wunderwood.org >> http://observer.wunderwood.org/ (my blog) >> >> On Dec 5, 2022, at 11:02 PM, Mikhail Khludnev <m...@apache.org> wrote: >> >> Hello dev! >> Users are interested in the meaning of absolute value of the score, but >> we always reply that it's just relative value. Maximum score of matched >> docs is not an answer. >> Ultimately we need to measure how much sense a query has in the index. >> e.g. [jet OR propulsion OR spider] query should be measured like >> nonsense, because the best matching docs have much lower scores than >> hypothetical (and assuming absent) doc matching [jet AND propulsion AND >> spider]. >> Could it be a method that returns the maximum possible score if all query >> terms would match. Something like stubbing postings on virtual all_matching >> doc with average stats like tf and field length and kicks scorers in? It >> reminds me something about probabilistic retrieval, but not much. Is there >> anything like this already? >> >> -- >> Sincerely yours >> Mikhail Khludnev >> >> >> > > -- > Sincerely yours > Mikhail Khludnev > > >