jaepil commented on code in PR #15948: URL: https://github.com/apache/lucene/pull/15948#discussion_r3167841428
########## lucene/core/src/java/org/apache/lucene/search/BayesianScoreEstimator.java: ########## @@ -0,0 +1,228 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.List; +import java.util.Random; +import org.apache.lucene.index.IndexReader; +import org.apache.lucene.index.StoredFields; +import org.apache.lucene.index.Term; +import org.apache.lucene.util.ArrayUtil; + +/** + * Estimates {@link BayesianScoreQuery} parameters (alpha, beta, base rate) from corpus statistics + * via pseudo-query sampling. + * + * <p>The estimation algorithm: + * + * <ol> + * <li>Sample N documents randomly from the index + * <li>For each document, create a pseudo-query from its first few tokens in the target field + * <li>Run each pseudo-query via BM25 and collect the score distribution + * <li>Estimate: beta = median(scores), alpha = 1 / std(scores) + * <li>Estimate base rate: mean fraction of documents scoring above the 95th percentile + * </ol> + * + * @lucene.experimental + */ +public class BayesianScoreEstimator { Review Comment: Great question, and the answer is: calibration doesn't need to model the user query distribution — it only needs the score distribution to be representative of the corpus's BM25 dynamic range. Here's why: α and β are derived from the BM25 score distribution's spread (alpha = 1/std) and center (beta = median). These are scale statistics. As long as the pseudo-queries exercise the same scoring code path that real user queries will hit (BM25Similarity over the same field's term frequencies and IDF table), the resulting α/β describe the scorer's calibration, which is invariant to which specific terms appear in the query. The base rate is similarly a corpus-level fraction, not query-conditional. A useful sanity check: sigmoid is monotone, so α and β never change ranking — they only adjust where on the (0,1) curve scores land for downstream Log-OP fusion. Even substantial pseudo-query/real-query distribution mismatch only shifts the calibration curve, which is the same effect as picking a different α/β manually. That said, the "random docs + first N tokens" approach in this PR does have a real weakness on corpora with shared boilerplate prefixes (license headers, structured templates), where pseudo-queries collapse into near-duplicates. I'm thinking about replacing the document-text path with reservoir sampling over the field's indexed vocabulary, which would give uniform random samples of unique terms instead — a more defensible "what does this scorer's distribution look like" probe than "what do the first 5 words of random documents look like." We did test this calibration approach during the research phase across several corpora and didn't see issues, but I'd like to redo that validation directly against the Lucene implementation as a follow-up PR before this leaves @lucene.experimental status. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
