Thanks for explaining the alternatives and root cause. Will create a bug and move the discussion to GitHub.
On Wed, 2 Apr, 2025, 13:53 Adrien Grand, <jpou...@gmail.com> wrote: > > It was fast because (once the collector has filled its priority queue), > we'd check the (constant) impacts to find the first block that's strictly > better than the min competitive score. Since all scores are equal, that > would quickly skip to the end > > This sounds correct to me. To expand on your analysis, SlowImpactsEnum > computes the max possible score as SimScorer.score(freq = > Integer.MAX_VALUE, norm = 1), while scores are computed as > SimScorer.score(freq = 1, norm = 1). Assuming BM25Similarity, this > over-estimates scores by a factor of ~2, so skipping works much much worse > (and not at all when querying a single term). > > Even though Lucene101PostingsFormat no longer uses SlowImpactsEnum, it > still has the same issue of computing the max possible score with > freq=MAX_VALUE instead of freq=1 when term frequencies are disabled. > > > We can address this in OpenSearch by just wrapping things in a > ConstantScoreQuery if frequencies are not enabled for the field. > > In case you missed it, this would produce different scores, as scores on a > field that disables term frequencies and norms still have an IDF component, > or the ConstantScoreQuery would need to be boosted by this IDF-only score. > > > I'm wondering if it would make sense to push a scoreMode override into > TermQuery's createWeight. If the field doesn't have frequencies, would it > make sense to do a similar scoreMode override to ConstantScoreQuery (i.e. > COMPLETE -> COMPLETE_NO_SCORES, everything else -> TOP_DOCS)? > > If you want to do something like this, we should wrap the TermWeight under > a ConstantScoreWeight (with the proper constant score) so that we don't > pass a ScoreMode that doesn't require scores and then have the collector > call TermScorer#score(). Looking at the test framework, it should fail > queries that try to do this: > https://github.com/apache/lucene/blob/525bf34bfdfc16cc220d326d2cf30541f1afef29/lucene/test-framework/src/java/org/apache/lucene/tests/search/AssertingScorer.java#L119 > . > > In any case, I'd be keen to fix our ImpactsEnums to return a maximum > possible term frequency of 1 instead of Integer.MAX_VALUE when frequencies > are not indexed, which should also address the problem you're observing. > > > On Wed, Apr 2, 2025 at 9:04 PM Michael Froh <msf...@gmail.com> wrote: > >> Hey, >> >> Full disclosure -- I sit at a desk next to Aniketh, so we chatted about >> this one in the real world. >> >> Our current working theory is as follows: >> >> We're using the TOP_SCORES score mode on both the old and new code paths. >> On the old code path, we were returning a BlockImpactsDocsEnum, even though >> we didn't have frequencies. I'm guessing that the impacts were constant >> across all blocks, since we have no frequencies and no norms. It was fast >> because (once the collector has filled its priority queue), we'd check the >> (constant) impacts to find the first block that's strictly better than the >> min competitive score. Since all scores are equal, that would quickly skip >> to the end. On the new code path, we always return the SlowImpactsEnum, >> since we don't have frequencies. Once we fill up the collector's priority >> queue, we're not able to do any impact checks to try to find a >> competitive block, so we iterate through all the doc IDs for the term. >> >> The best solution would probably be to flip the score mode from >> TOP_SCORES to TOP_DOCS, and avoid looking at impacts altogether, since we >> don't have frequencies anyway. Then early termination logic will kick in, >> which is arguably even better than the previous fast skipping logic. We can >> address this in OpenSearch by just wrapping things in a ConstantScoreQuery >> if frequencies are not enabled for the field. >> >> I'm wondering if it would make sense to push a scoreMode override into >> TermQuery's createWeight. If the field doesn't have frequencies, would it >> make sense to do a similar scoreMode override to ConstantScoreQuery (i.e. >> COMPLETE -> COMPLETE_NO_SCORES, everything else -> TOP_DOCS)? >> >> Thanks, >> Froh >> >> >> On Wed, Apr 2, 2025 at 11:29 AM ANIKETH JAIN <checkanik...@gmail.com> >> wrote: >> >>> Hey folks, >>> >>> While investigating a regression in OpenSearch versions 2.17.1 ( Lucene >>> 9.11.1 ) and 2.18.0 ( Lucene 9.12.0 ) for simple Term Query in Big5 >>> workload over process.name field, I noticed that the new >>> Lucene912PostingsReader creates the ImpactsEnum by wrapping SlowImpactsEnum >>> over postings when a field only has IndexOptions.DOCS >>> >>> curl -X POST "http://localhost:9200/big5/_search" -H "Content-Type: >>> application/json" -d '{ "query": { "term": { "process.name": "kernel" } >>> } }' >>> >>> >>> Lucene912PostingsReader ->> ImpactsEnum impacts(FieldInfo fieldInfo, >>> BlockTermState state, int flags) Has an extra check on *indexHasFreqs* >>> >>> if (state.docFreq >= BLOCK_SIZE >>> && indexHasFreqs >>> && (indexHasPositions == false >>> || PostingsEnum.featureRequested(flags, PostingsEnum.POSITIONS) == >>> false)) { >>> return new BlockImpactsDocsEnum(fieldInfo, (IntBlockTermState) state); >>> } >>> >>> >>> Whereas Lucene99PostingsReader creates the faster BlockImpactsDocsEnum >>> for fields with IndexOptions.DOCS and only creates the SlowImpactsEnum when >>> document frequency is less than 128 ( Block size ) >>> >>> >>> Lucene99PostingsReader ->> ImpactsEnum impacts(FieldInfo fieldInfo, >>> BlockTermState state, int flags) >>> >>> if (state.docFreq <= BLOCK_SIZE) { >>> // no skip data >>> return new SlowImpactsEnum(postings(fieldInfo, state, null, flags)); >>> } >>> >>> >>> if (indexHasPositions == false >>> || PostingsEnum.featureRequested(flags, PostingsEnum.POSITIONS) == >>> false) { >>> return new BlockImpactsDocsEnum(fieldInfo, (IntBlockTermState) state); >>> } >>> >>> >>> >>> Since Lucene 9.12.0 wraps a SlowImpactsEnum which has a no-op for >>> advanceShallow method, the Term Query is never able to skip data when >>> called from the bulk scorer via DISI#nextDoc() Whereas the advanceShallow >>> gets used in Lucene 9.11.1 and skips over a lot of docs resulting in faster >>> completion. >>> The difference with 116 million docs of Big5 index is >200ms in Lucene >>> 9.12.0 to <=5ms in Lucene 9.11.1 >>> >>> I tried reindexing the process.name into another index but with >>> docs_and_freqs enabled and the query latency came back to normal since it >>> uses BlockImpactsDocsEnum as its ImpactsEnum. >>> >>> Is this a bug in the 912 postings reader ? Or is it not possible to use >>> the BlockImpactsDocsEnum with the new postings format ? >>> >>> >>> Thanks, >>> Aniketh >>> >> > > -- > Adrien >