Fuzzy Search Scoring Adjustment

Eastlack, Kainoa Wed, 23 Sep 2020 10:59:23 -0700

When performing a fuzzy search inside a BooleanQuery, it looks like the
default behavior is to score all fuzzy matches separately and then sum them
up to get an aggregate score. However, I need it to instead score based on
the maximum of each distinct match it might find, rather than the sum of
them, to avoid overly inflated scores in some circumstances.


For example, consider a query for "Bstn~2" and four documents containing
"Boston", "Basin", "Boston Basin", and "Boston Boston Basin". The query
might respectively score them as 1, 1, 2, and 3 (or something like that,
depending on the scorer used, of course). However, I need it to instead
score them as 1, 1, 1, and 2, since that's the count of just the most
frequent unique fuzzy match in each document.

Ideally I'd like to use a built in mechanism for achieving this, but if
it's not available, a way to extend the BooleanQuery, BooleanWeight, and/or
BooleanScorer classes to have slightly different scoring logic but
otherwise function exactly the same would also work, but all of those are
either final classes or have no public constructor, effectively making it
impossible to reuse their logic directly, as near as I can tell.

If anyone has any ideas of how to approach this, it would be very helpful.

Thanks,
Kainoa

Fuzzy Search Scoring Adjustment

Reply via email to