Thanks for explaining the alternatives and root cause.

Will create a bug and move the discussion to GitHub.

On Wed, 2 Apr, 2025, 13:53 Adrien Grand, <jpou...@gmail.com> wrote:

> > It was fast because (once the collector has filled its priority queue),
> we'd check the (constant) impacts to find the first block that's strictly
> better than the min competitive score. Since all scores are equal, that
> would quickly skip to the end
>
> This sounds correct to me. To expand on your analysis, SlowImpactsEnum
> computes the max possible score as SimScorer.score(freq =
> Integer.MAX_VALUE, norm = 1), while scores are computed as
> SimScorer.score(freq = 1, norm = 1). Assuming BM25Similarity, this
> over-estimates scores by a factor of ~2, so skipping works much much worse
> (and not at all when querying a single term).
>
> Even though Lucene101PostingsFormat no longer uses SlowImpactsEnum, it
> still has the same issue of computing the max possible score with
> freq=MAX_VALUE instead of freq=1 when term frequencies are disabled.
>
> > We can address this in OpenSearch by just wrapping things in a
> ConstantScoreQuery if frequencies are not enabled for the field.
>
> In case you missed it, this would produce different scores, as scores on a
> field that disables term frequencies and norms still have an IDF component,
> or the ConstantScoreQuery would need to be boosted by this IDF-only score.
>
> > I'm wondering if it would make sense to push a scoreMode override into
> TermQuery's createWeight. If the field doesn't have frequencies, would it
> make sense to do a similar scoreMode override to ConstantScoreQuery (i.e.
> COMPLETE -> COMPLETE_NO_SCORES, everything else -> TOP_DOCS)?
>
> If you want to do something like this, we should wrap the TermWeight under
> a ConstantScoreWeight (with the proper constant score) so that we don't
> pass a ScoreMode that doesn't require scores and then have the collector
> call TermScorer#score(). Looking at the test framework, it should fail
> queries that try to do this:
> https://github.com/apache/lucene/blob/525bf34bfdfc16cc220d326d2cf30541f1afef29/lucene/test-framework/src/java/org/apache/lucene/tests/search/AssertingScorer.java#L119
> .
>
> In any case, I'd be keen to fix our ImpactsEnums to return a maximum
> possible term frequency of 1 instead of Integer.MAX_VALUE when frequencies
> are not indexed, which should also address the problem you're observing.
>
>
> On Wed, Apr 2, 2025 at 9:04 PM Michael Froh <msf...@gmail.com> wrote:
>
>> Hey,
>>
>> Full disclosure -- I sit at a desk next to Aniketh, so we chatted about
>> this one in the real world.
>>
>> Our current working theory is as follows:
>>
>> We're using the TOP_SCORES score mode on both the old and new code paths.
>> On the old code path, we were returning a BlockImpactsDocsEnum, even though
>> we didn't have frequencies. I'm guessing that the impacts were constant
>> across all blocks, since we have no frequencies and no norms. It was fast
>> because (once the collector has filled its priority queue), we'd check the
>> (constant) impacts to find the first block that's strictly better than the
>> min competitive score. Since all scores are equal, that would quickly skip
>> to the end. On the new code path, we always return the SlowImpactsEnum,
>> since we don't have frequencies. Once we fill up the collector's priority
>> queue, we're not able to do any impact checks to try to find a
>> competitive block, so we iterate through all the doc IDs for the term.
>>
>> The best solution would probably be to flip the score mode from
>> TOP_SCORES to TOP_DOCS, and avoid looking at impacts altogether, since we
>> don't have frequencies anyway. Then early termination logic will kick in,
>> which is arguably even better than the previous fast skipping logic. We can
>> address this in OpenSearch by just wrapping things in a ConstantScoreQuery
>> if frequencies are not enabled for the field.
>>
>> I'm wondering if it would make sense to push a scoreMode override into
>> TermQuery's createWeight. If the field doesn't have frequencies, would it
>> make sense to do a similar scoreMode override to ConstantScoreQuery (i.e.
>> COMPLETE -> COMPLETE_NO_SCORES, everything else -> TOP_DOCS)?
>>
>> Thanks,
>> Froh
>>
>>
>> On Wed, Apr 2, 2025 at 11:29 AM ANIKETH JAIN <checkanik...@gmail.com>
>> wrote:
>>
>>> Hey folks,
>>>
>>> While investigating a regression in OpenSearch versions 2.17.1 ( Lucene
>>> 9.11.1 ) and 2.18.0 ( Lucene 9.12.0 ) for simple Term Query in Big5
>>> workload over process.name field, I noticed that the new
>>> Lucene912PostingsReader creates the ImpactsEnum by wrapping SlowImpactsEnum
>>> over postings when a field only has IndexOptions.DOCS
>>>
>>> curl -X POST "http://localhost:9200/big5/_search"; -H "Content-Type:
>>> application/json" -d '{ "query": { "term": { "process.name": "kernel" }
>>> } }'
>>>
>>>
>>> Lucene912PostingsReader ->> ImpactsEnum impacts(FieldInfo fieldInfo,
>>> BlockTermState state, int flags) Has an extra check on *indexHasFreqs*
>>>
>>> if (state.docFreq >= BLOCK_SIZE
>>>     && indexHasFreqs
>>>     && (indexHasPositions == false
>>>         || PostingsEnum.featureRequested(flags, PostingsEnum.POSITIONS) == 
>>> false)) {
>>>   return new BlockImpactsDocsEnum(fieldInfo, (IntBlockTermState) state);
>>> }
>>>
>>>
>>> Whereas Lucene99PostingsReader creates the faster BlockImpactsDocsEnum
>>> for fields with IndexOptions.DOCS and only creates the SlowImpactsEnum when
>>> document frequency is less than 128 ( Block size )
>>>
>>>
>>> Lucene99PostingsReader ->> ImpactsEnum impacts(FieldInfo fieldInfo,
>>> BlockTermState state, int flags)
>>>
>>> if (state.docFreq <= BLOCK_SIZE) {
>>>   // no skip data
>>>   return new SlowImpactsEnum(postings(fieldInfo, state, null, flags));
>>> }
>>>
>>>
>>> if (indexHasPositions == false
>>>     || PostingsEnum.featureRequested(flags, PostingsEnum.POSITIONS) == 
>>> false) {
>>>   return new BlockImpactsDocsEnum(fieldInfo, (IntBlockTermState) state);
>>> }
>>>
>>>
>>>
>>> Since Lucene 9.12.0 wraps a SlowImpactsEnum which has a no-op for
>>> advanceShallow method, the Term Query is never able to skip data when
>>> called from the bulk scorer via DISI#nextDoc() Whereas the advanceShallow
>>> gets used in Lucene 9.11.1 and skips over a lot of docs resulting in faster
>>> completion.
>>> The difference with 116 million docs of Big5 index is >200ms in Lucene
>>> 9.12.0 to <=5ms in Lucene 9.11.1
>>>
>>> I tried reindexing the process.name into another index but with
>>> docs_and_freqs enabled and the query latency came back to normal since it
>>> uses BlockImpactsDocsEnum as its ImpactsEnum.
>>>
>>> Is this a bug in the 912 postings reader ? Or is it not possible to use
>>> the BlockImpactsDocsEnum with the new postings format ?
>>>
>>>
>>> Thanks,
>>> Aniketh
>>>
>>
>
> --
> Adrien
>

Reply via email to