Re: [ Question ] Text Field with IndexOptions.DOCS don't use skip data due to SlowImpactsEnum with Lucene 9.12.0 compared to 9.11.1

Adrien Grand Wed, 02 Apr 2025 13:54:06 -0700

> It was fast because (once the collector has filled its priority queue),
we'd check the (constant) impacts to find the first block that's strictly
better than the min competitive score. Since all scores are equal, that
would quickly skip to the end


This sounds correct to me. To expand on your analysis, SlowImpactsEnum
computes the max possible score as SimScorer.score(freq =
Integer.MAX_VALUE, norm = 1), while scores are computed as
SimScorer.score(freq = 1, norm = 1). Assuming BM25Similarity, this
over-estimates scores by a factor of ~2, so skipping works much much worse
(and not at all when querying a single term).

Even though Lucene101PostingsFormat no longer uses SlowImpactsEnum, it
still has the same issue of computing the max possible score with
freq=MAX_VALUE instead of freq=1 when term frequencies are disabled.

> We can address this in OpenSearch by just wrapping things in a
ConstantScoreQuery if frequencies are not enabled for the field.

In case you missed it, this would produce different scores, as scores on a
field that disables term frequencies and norms still have an IDF component,
or the ConstantScoreQuery would need to be boosted by this IDF-only score.

> I'm wondering if it would make sense to push a scoreMode override into
TermQuery's createWeight. If the field doesn't have frequencies, would it
make sense to do a similar scoreMode override to ConstantScoreQuery (i.e.
COMPLETE -> COMPLETE_NO_SCORES, everything else -> TOP_DOCS)?

If you want to do something like this, we should wrap the TermWeight under
a ConstantScoreWeight (with the proper constant score) so that we don't
pass a ScoreMode that doesn't require scores and then have the collector
call TermScorer#score(). Looking at the test framework, it should fail
queries that try to do this:
https://github.com/apache/lucene/blob/525bf34bfdfc16cc220d326d2cf30541f1afef29/lucene/test-framework/src/java/org/apache/lucene/tests/search/AssertingScorer.java#L119
.

In any case, I'd be keen to fix our ImpactsEnums to return a maximum
possible term frequency of 1 instead of Integer.MAX_VALUE when frequencies
are not indexed, which should also address the problem you're observing.


On Wed, Apr 2, 2025 at 9:04 PM Michael Froh <msf...@gmail.com> wrote:

> Hey,
>
> Full disclosure -- I sit at a desk next to Aniketh, so we chatted about
> this one in the real world.
>
> Our current working theory is as follows:
>
> We're using the TOP_SCORES score mode on both the old and new code paths.
> On the old code path, we were returning a BlockImpactsDocsEnum, even though
> we didn't have frequencies. I'm guessing that the impacts were constant
> across all blocks, since we have no frequencies and no norms. It was fast
> because (once the collector has filled its priority queue), we'd check the
> (constant) impacts to find the first block that's strictly better than the
> min competitive score. Since all scores are equal, that would quickly skip
> to the end. On the new code path, we always return the SlowImpactsEnum,
> since we don't have frequencies. Once we fill up the collector's priority
> queue, we're not able to do any impact checks to try to find a
> competitive block, so we iterate through all the doc IDs for the term.
>
> The best solution would probably be to flip the score mode from TOP_SCORES
> to TOP_DOCS, and avoid looking at impacts altogether, since we don't have
> frequencies anyway. Then early termination logic will kick in, which is
> arguably even better than the previous fast skipping logic. We can address
> this in OpenSearch by just wrapping things in a ConstantScoreQuery if
> frequencies are not enabled for the field.
>
> I'm wondering if it would make sense to push a scoreMode override into
> TermQuery's createWeight. If the field doesn't have frequencies, would it
> make sense to do a similar scoreMode override to ConstantScoreQuery (i.e.
> COMPLETE -> COMPLETE_NO_SCORES, everything else -> TOP_DOCS)?
>
> Thanks,
> Froh
>
>
> On Wed, Apr 2, 2025 at 11:29 AM ANIKETH JAIN <checkanik...@gmail.com>
> wrote:
>
>> Hey folks,
>>
>> While investigating a regression in OpenSearch versions 2.17.1 ( Lucene
>> 9.11.1 ) and 2.18.0 ( Lucene 9.12.0 ) for simple Term Query in Big5
>> workload over process.name field, I noticed that the new
>> Lucene912PostingsReader creates the ImpactsEnum by wrapping SlowImpactsEnum
>> over postings when a field only has IndexOptions.DOCS
>>
>> curl -X POST "http://localhost:9200/big5/_search"; -H "Content-Type:
>> application/json" -d '{ "query": { "term": { "process.name": "kernel" }
>> } }'
>>
>>
>> Lucene912PostingsReader ->> ImpactsEnum impacts(FieldInfo fieldInfo,
>> BlockTermState state, int flags) Has an extra check on *indexHasFreqs*
>>
>> if (state.docFreq >= BLOCK_SIZE
>>     && indexHasFreqs
>>     && (indexHasPositions == false
>>         || PostingsEnum.featureRequested(flags, PostingsEnum.POSITIONS) == 
>> false)) {
>>   return new BlockImpactsDocsEnum(fieldInfo, (IntBlockTermState) state);
>> }
>>
>>
>> Whereas Lucene99PostingsReader creates the faster BlockImpactsDocsEnum
>> for fields with IndexOptions.DOCS and only creates the SlowImpactsEnum when
>> document frequency is less than 128 ( Block size )
>>
>>
>> Lucene99PostingsReader ->> ImpactsEnum impacts(FieldInfo fieldInfo,
>> BlockTermState state, int flags)
>>
>> if (state.docFreq <= BLOCK_SIZE) {
>>   // no skip data
>>   return new SlowImpactsEnum(postings(fieldInfo, state, null, flags));
>> }
>>
>>
>> if (indexHasPositions == false
>>     || PostingsEnum.featureRequested(flags, PostingsEnum.POSITIONS) == 
>> false) {
>>   return new BlockImpactsDocsEnum(fieldInfo, (IntBlockTermState) state);
>> }
>>
>>
>>
>> Since Lucene 9.12.0 wraps a SlowImpactsEnum which has a no-op for
>> advanceShallow method, the Term Query is never able to skip data when
>> called from the bulk scorer via DISI#nextDoc() Whereas the advanceShallow
>> gets used in Lucene 9.11.1 and skips over a lot of docs resulting in faster
>> completion.
>> The difference with 116 million docs of Big5 index is >200ms in Lucene
>> 9.12.0 to <=5ms in Lucene 9.11.1
>>
>> I tried reindexing the process.name into another index but with
>> docs_and_freqs enabled and the query latency came back to normal since it
>> uses BlockImpactsDocsEnum as its ImpactsEnum.
>>
>> Is this a bug in the 912 postings reader ? Or is it not possible to use
>> the BlockImpactsDocsEnum with the new postings format ?
>>
>>
>> Thanks,
>> Aniketh
>>
>

-- 
Adrien

Re: [ Question ] Text Field with IndexOptions.DOCS don't use skip data due to SlowImpactsEnum with Lucene 9.12.0 compared to 9.11.1

Reply via email to