Created https://github.com/apache/lucene/issues/14445





On Thu, Apr 3, 2025 at 5:59 AM ANIKETH JAIN <checkanik...@gmail.com> wrote:

> Thanks for explaining the alternatives and root cause.
>
> Will create a bug and move the discussion to GitHub.
>
> On Wed, 2 Apr, 2025, 13:53 Adrien Grand, <jpou...@gmail.com> wrote:
>
>> > It was fast because (once the collector has filled its priority queue),
>> we'd check the (constant) impacts to find the first block that's strictly
>> better than the min competitive score. Since all scores are equal, that
>> would quickly skip to the end
>>
>> This sounds correct to me. To expand on your analysis, SlowImpactsEnum
>> computes the max possible score as SimScorer.score(freq =
>> Integer.MAX_VALUE, norm = 1), while scores are computed as
>> SimScorer.score(freq = 1, norm = 1). Assuming BM25Similarity, this
>> over-estimates scores by a factor of ~2, so skipping works much much worse
>> (and not at all when querying a single term).
>>
>> Even though Lucene101PostingsFormat no longer uses SlowImpactsEnum, it
>> still has the same issue of computing the max possible score with
>> freq=MAX_VALUE instead of freq=1 when term frequencies are disabled.
>>
>> > We can address this in OpenSearch by just wrapping things in a
>> ConstantScoreQuery if frequencies are not enabled for the field.
>>
>> In case you missed it, this would produce different scores, as scores on
>> a field that disables term frequencies and norms still have an IDF
>> component, or the ConstantScoreQuery would need to be boosted by this
>> IDF-only score.
>>
>> > I'm wondering if it would make sense to push a scoreMode override into
>> TermQuery's createWeight. If the field doesn't have frequencies, would it
>> make sense to do a similar scoreMode override to ConstantScoreQuery (i.e.
>> COMPLETE -> COMPLETE_NO_SCORES, everything else -> TOP_DOCS)?
>>
>> If you want to do something like this, we should wrap the TermWeight
>> under a ConstantScoreWeight (with the proper constant score) so that we
>> don't pass a ScoreMode that doesn't require scores and then have the
>> collector call TermScorer#score(). Looking at the test framework, it should
>> fail queries that try to do this:
>> https://github.com/apache/lucene/blob/525bf34bfdfc16cc220d326d2cf30541f1afef29/lucene/test-framework/src/java/org/apache/lucene/tests/search/AssertingScorer.java#L119
>> .
>>
>> In any case, I'd be keen to fix our ImpactsEnums to return a maximum
>> possible term frequency of 1 instead of Integer.MAX_VALUE when frequencies
>> are not indexed, which should also address the problem you're observing.
>>
>>
>> On Wed, Apr 2, 2025 at 9:04 PM Michael Froh <msf...@gmail.com> wrote:
>>
>>> Hey,
>>>
>>> Full disclosure -- I sit at a desk next to Aniketh, so we chatted about
>>> this one in the real world.
>>>
>>> Our current working theory is as follows:
>>>
>>> We're using the TOP_SCORES score mode on both the old and new code
>>> paths. On the old code path, we were returning a BlockImpactsDocsEnum, even
>>> though we didn't have frequencies. I'm guessing that the impacts were
>>> constant across all blocks, since we have no frequencies and no norms. It
>>> was fast because (once the collector has filled its priority queue), we'd
>>> check the (constant) impacts to find the first block that's strictly better
>>> than the min competitive score. Since all scores are equal, that would
>>> quickly skip to the end. On the new code path, we always return the
>>> SlowImpactsEnum, since we don't have frequencies. Once we fill up the
>>> collector's priority queue, we're not able to do any impact checks to try
>>> to find a competitive block, so we iterate through all the doc IDs for the
>>> term.
>>>
>>> The best solution would probably be to flip the score mode from
>>> TOP_SCORES to TOP_DOCS, and avoid looking at impacts altogether, since we
>>> don't have frequencies anyway. Then early termination logic will kick in,
>>> which is arguably even better than the previous fast skipping logic. We can
>>> address this in OpenSearch by just wrapping things in a ConstantScoreQuery
>>> if frequencies are not enabled for the field.
>>>
>>> I'm wondering if it would make sense to push a scoreMode override into
>>> TermQuery's createWeight. If the field doesn't have frequencies, would it
>>> make sense to do a similar scoreMode override to ConstantScoreQuery (i.e.
>>> COMPLETE -> COMPLETE_NO_SCORES, everything else -> TOP_DOCS)?
>>>
>>> Thanks,
>>> Froh
>>>
>>>
>>> On Wed, Apr 2, 2025 at 11:29 AM ANIKETH JAIN <checkanik...@gmail.com>
>>> wrote:
>>>
>>>> Hey folks,
>>>>
>>>> While investigating a regression in OpenSearch versions 2.17.1 ( Lucene
>>>> 9.11.1 ) and 2.18.0 ( Lucene 9.12.0 ) for simple Term Query in Big5
>>>> workload over process.name field, I noticed that the new
>>>> Lucene912PostingsReader creates the ImpactsEnum by wrapping SlowImpactsEnum
>>>> over postings when a field only has IndexOptions.DOCS
>>>>
>>>> curl -X POST "http://localhost:9200/big5/_search"; -H "Content-Type:
>>>> application/json" -d '{ "query": { "term": { "process.name": "kernel"
>>>> } } }'
>>>>
>>>>
>>>> Lucene912PostingsReader ->> ImpactsEnum impacts(FieldInfo fieldInfo,
>>>> BlockTermState state, int flags) Has an extra check on *indexHasFreqs*
>>>>
>>>> if (state.docFreq >= BLOCK_SIZE
>>>>     && indexHasFreqs
>>>>     && (indexHasPositions == false
>>>>         || PostingsEnum.featureRequested(flags, PostingsEnum.POSITIONS) == 
>>>> false)) {
>>>>   return new BlockImpactsDocsEnum(fieldInfo, (IntBlockTermState) state);
>>>> }
>>>>
>>>>
>>>> Whereas Lucene99PostingsReader creates the faster BlockImpactsDocsEnum
>>>> for fields with IndexOptions.DOCS and only creates the SlowImpactsEnum when
>>>> document frequency is less than 128 ( Block size )
>>>>
>>>>
>>>> Lucene99PostingsReader ->> ImpactsEnum impacts(FieldInfo fieldInfo,
>>>> BlockTermState state, int flags)
>>>>
>>>> if (state.docFreq <= BLOCK_SIZE) {
>>>>   // no skip data
>>>>   return new SlowImpactsEnum(postings(fieldInfo, state, null, flags));
>>>> }
>>>>
>>>>
>>>> if (indexHasPositions == false
>>>>     || PostingsEnum.featureRequested(flags, PostingsEnum.POSITIONS) == 
>>>> false) {
>>>>   return new BlockImpactsDocsEnum(fieldInfo, (IntBlockTermState) state);
>>>> }
>>>>
>>>>
>>>>
>>>> Since Lucene 9.12.0 wraps a SlowImpactsEnum which has a no-op for
>>>> advanceShallow method, the Term Query is never able to skip data when
>>>> called from the bulk scorer via DISI#nextDoc() Whereas the advanceShallow
>>>> gets used in Lucene 9.11.1 and skips over a lot of docs resulting in faster
>>>> completion.
>>>> The difference with 116 million docs of Big5 index is >200ms in Lucene
>>>> 9.12.0 to <=5ms in Lucene 9.11.1
>>>>
>>>> I tried reindexing the process.name into another index but with
>>>> docs_and_freqs enabled and the query latency came back to normal since it
>>>> uses BlockImpactsDocsEnum as its ImpactsEnum.
>>>>
>>>> Is this a bug in the 912 postings reader ? Or is it not possible to use
>>>> the BlockImpactsDocsEnum with the new postings format ?
>>>>
>>>>
>>>> Thanks,
>>>> Aniketh
>>>>
>>>
>>
>> --
>> Adrien
>>
>

Reply via email to