Created https://github.com/apache/lucene/issues/14445
On Thu, Apr 3, 2025 at 5:59 AM ANIKETH JAIN <checkanik...@gmail.com> wrote: > Thanks for explaining the alternatives and root cause. > > Will create a bug and move the discussion to GitHub. > > On Wed, 2 Apr, 2025, 13:53 Adrien Grand, <jpou...@gmail.com> wrote: > >> > It was fast because (once the collector has filled its priority queue), >> we'd check the (constant) impacts to find the first block that's strictly >> better than the min competitive score. Since all scores are equal, that >> would quickly skip to the end >> >> This sounds correct to me. To expand on your analysis, SlowImpactsEnum >> computes the max possible score as SimScorer.score(freq = >> Integer.MAX_VALUE, norm = 1), while scores are computed as >> SimScorer.score(freq = 1, norm = 1). Assuming BM25Similarity, this >> over-estimates scores by a factor of ~2, so skipping works much much worse >> (and not at all when querying a single term). >> >> Even though Lucene101PostingsFormat no longer uses SlowImpactsEnum, it >> still has the same issue of computing the max possible score with >> freq=MAX_VALUE instead of freq=1 when term frequencies are disabled. >> >> > We can address this in OpenSearch by just wrapping things in a >> ConstantScoreQuery if frequencies are not enabled for the field. >> >> In case you missed it, this would produce different scores, as scores on >> a field that disables term frequencies and norms still have an IDF >> component, or the ConstantScoreQuery would need to be boosted by this >> IDF-only score. >> >> > I'm wondering if it would make sense to push a scoreMode override into >> TermQuery's createWeight. If the field doesn't have frequencies, would it >> make sense to do a similar scoreMode override to ConstantScoreQuery (i.e. >> COMPLETE -> COMPLETE_NO_SCORES, everything else -> TOP_DOCS)? >> >> If you want to do something like this, we should wrap the TermWeight >> under a ConstantScoreWeight (with the proper constant score) so that we >> don't pass a ScoreMode that doesn't require scores and then have the >> collector call TermScorer#score(). Looking at the test framework, it should >> fail queries that try to do this: >> https://github.com/apache/lucene/blob/525bf34bfdfc16cc220d326d2cf30541f1afef29/lucene/test-framework/src/java/org/apache/lucene/tests/search/AssertingScorer.java#L119 >> . >> >> In any case, I'd be keen to fix our ImpactsEnums to return a maximum >> possible term frequency of 1 instead of Integer.MAX_VALUE when frequencies >> are not indexed, which should also address the problem you're observing. >> >> >> On Wed, Apr 2, 2025 at 9:04 PM Michael Froh <msf...@gmail.com> wrote: >> >>> Hey, >>> >>> Full disclosure -- I sit at a desk next to Aniketh, so we chatted about >>> this one in the real world. >>> >>> Our current working theory is as follows: >>> >>> We're using the TOP_SCORES score mode on both the old and new code >>> paths. On the old code path, we were returning a BlockImpactsDocsEnum, even >>> though we didn't have frequencies. I'm guessing that the impacts were >>> constant across all blocks, since we have no frequencies and no norms. It >>> was fast because (once the collector has filled its priority queue), we'd >>> check the (constant) impacts to find the first block that's strictly better >>> than the min competitive score. Since all scores are equal, that would >>> quickly skip to the end. On the new code path, we always return the >>> SlowImpactsEnum, since we don't have frequencies. Once we fill up the >>> collector's priority queue, we're not able to do any impact checks to try >>> to find a competitive block, so we iterate through all the doc IDs for the >>> term. >>> >>> The best solution would probably be to flip the score mode from >>> TOP_SCORES to TOP_DOCS, and avoid looking at impacts altogether, since we >>> don't have frequencies anyway. Then early termination logic will kick in, >>> which is arguably even better than the previous fast skipping logic. We can >>> address this in OpenSearch by just wrapping things in a ConstantScoreQuery >>> if frequencies are not enabled for the field. >>> >>> I'm wondering if it would make sense to push a scoreMode override into >>> TermQuery's createWeight. If the field doesn't have frequencies, would it >>> make sense to do a similar scoreMode override to ConstantScoreQuery (i.e. >>> COMPLETE -> COMPLETE_NO_SCORES, everything else -> TOP_DOCS)? >>> >>> Thanks, >>> Froh >>> >>> >>> On Wed, Apr 2, 2025 at 11:29 AM ANIKETH JAIN <checkanik...@gmail.com> >>> wrote: >>> >>>> Hey folks, >>>> >>>> While investigating a regression in OpenSearch versions 2.17.1 ( Lucene >>>> 9.11.1 ) and 2.18.0 ( Lucene 9.12.0 ) for simple Term Query in Big5 >>>> workload over process.name field, I noticed that the new >>>> Lucene912PostingsReader creates the ImpactsEnum by wrapping SlowImpactsEnum >>>> over postings when a field only has IndexOptions.DOCS >>>> >>>> curl -X POST "http://localhost:9200/big5/_search" -H "Content-Type: >>>> application/json" -d '{ "query": { "term": { "process.name": "kernel" >>>> } } }' >>>> >>>> >>>> Lucene912PostingsReader ->> ImpactsEnum impacts(FieldInfo fieldInfo, >>>> BlockTermState state, int flags) Has an extra check on *indexHasFreqs* >>>> >>>> if (state.docFreq >= BLOCK_SIZE >>>> && indexHasFreqs >>>> && (indexHasPositions == false >>>> || PostingsEnum.featureRequested(flags, PostingsEnum.POSITIONS) == >>>> false)) { >>>> return new BlockImpactsDocsEnum(fieldInfo, (IntBlockTermState) state); >>>> } >>>> >>>> >>>> Whereas Lucene99PostingsReader creates the faster BlockImpactsDocsEnum >>>> for fields with IndexOptions.DOCS and only creates the SlowImpactsEnum when >>>> document frequency is less than 128 ( Block size ) >>>> >>>> >>>> Lucene99PostingsReader ->> ImpactsEnum impacts(FieldInfo fieldInfo, >>>> BlockTermState state, int flags) >>>> >>>> if (state.docFreq <= BLOCK_SIZE) { >>>> // no skip data >>>> return new SlowImpactsEnum(postings(fieldInfo, state, null, flags)); >>>> } >>>> >>>> >>>> if (indexHasPositions == false >>>> || PostingsEnum.featureRequested(flags, PostingsEnum.POSITIONS) == >>>> false) { >>>> return new BlockImpactsDocsEnum(fieldInfo, (IntBlockTermState) state); >>>> } >>>> >>>> >>>> >>>> Since Lucene 9.12.0 wraps a SlowImpactsEnum which has a no-op for >>>> advanceShallow method, the Term Query is never able to skip data when >>>> called from the bulk scorer via DISI#nextDoc() Whereas the advanceShallow >>>> gets used in Lucene 9.11.1 and skips over a lot of docs resulting in faster >>>> completion. >>>> The difference with 116 million docs of Big5 index is >200ms in Lucene >>>> 9.12.0 to <=5ms in Lucene 9.11.1 >>>> >>>> I tried reindexing the process.name into another index but with >>>> docs_and_freqs enabled and the query latency came back to normal since it >>>> uses BlockImpactsDocsEnum as its ImpactsEnum. >>>> >>>> Is this a bug in the 912 postings reader ? Or is it not possible to use >>>> the BlockImpactsDocsEnum with the new postings format ? >>>> >>>> >>>> Thanks, >>>> Aniketh >>>> >>> >> >> -- >> Adrien >> >