Surprising interactions between MultiFloatFunctions and the query function

Joel Westberg Mon, 03 Jul 2023 16:38:10 -0700

Hi Solr devs!

I've identified some surprising behavior with how MultiFloat functions
like *max
*and *sum *interact with QueryValueSource and wanted to get some second
opinions before I open a bug ticket. I suspect this is a Lucene issue, but
starting here as Solr is my entry-point to this problem. This issue is
present in (at least) Solr 7 as well as the latest Solr 9.2 release.



*Examples*
In the examples below I have an index consisting of these two docs:
*[*
*  {"id":"A", "i_d":1}, *
*  {"id":"B", "i_d":2}*
*] *

I'm running a set of queries using *q=*:*&defType=edismax* and applying a
*boost* parameter.

Query 1: *boost=query({!lucene v="id:A^=10"}, 1)*
Observed scores for the two documents in this case comes out to A=10, B=1
as is expected. B is not scored by the query function, but the default
value is 1, so it gets the score 1*1.

Query 2: *boost=max(0, query({!lucene v="id:A^=10"}, 1))*
Here I've added a *max(0, ...) *wrapper around the same query function as
above. In this case, the observed scores for the two documents come out to
A=10, *B=0*. This is surprising, as I would normally expect *max(0, 1)=1*.

Query 3: *boost=sum(i_d, query({!lucene v="id:A^=10"}, 1))*
Adding in a *sum* here, we get the scores *A=11, B=3* which is what we
expect (*MatchAll(1) * (2+1)=3*).

Query 4: *boost=**max(1, sum(i_d, query({!lucene v="id:A^=10"}, 1)))*
Wrapping Query 3 in a max function (and a bit closer to my actual use case)
to ensure that we do not multiply by anything less than *1* we get the
following scores: A=11, *B=1*.

Results 2 and 4 were very surprising, and difficult to detect and
understand.

*Root cause*
Tracing this issue down through the code, it seems to stem from
MaxFloatFunction.func
<https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MaxFloatFunction.java#L39>
checking
if each component part (in this case const(1) and query(..)) scores the
given doc rather than simply retrieving the score, and QueryDocValues.exists
<https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/QueryValueSource.java#L141>
returning
*false* for any document not matched by the query (regardless of the
default value).

It is also surprising that the implementation of SumFloatFunction.exists
<https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MultiFloatFunction.java#L52>
is
implemented as *allExists* rather than *anyExists, *which is why Query 4
breaks and completely ignores the *i_d* score component. I expected that
*sum* would skip any of its value sources that do not apply to the given
doc being scored, and simply summing up the rest.

*Workaround*
A relatively straightforward workaround from the query writing side is to
not rely on the default value of the QueryFunction and instead always do
*max(<default_value>, query(...)).*
*TL;DR:*
Wanted to get a temperature check on what parts of this might make sense to
open a bug on (if any) and in which project?

I have no idea how many things may break deep inside Lucene if this
behavior were to change, given that it appears to have been there for a
very long time, so perhaps some new Solr-specific value functions and some
docs is the thing to do?


Thanks in advance,
Joel Westberg

Surprising interactions between MultiFloatFunctions and the query function

Reply via email to