Re: [PR] Improve BayesianScoreQuery and LogOddsFusionQuery with base rate prior, weighted Log-OP, and parameter estimation [lucene]

via GitHub Thu, 30 Apr 2026 05:19:32 -0700


jaepil commented on code in PR #15948:
URL: https://github.com/apache/lucene/pull/15948#discussion_r3167841428



##########
lucene/core/src/java/org/apache/lucene/search/BayesianScoreEstimator.java:
##########
@@ -0,0 +1,228 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Random;
+import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.index.StoredFields;
+import org.apache.lucene.index.Term;
+import org.apache.lucene.util.ArrayUtil;
+
+/**
+ * Estimates {@link BayesianScoreQuery} parameters (alpha, beta, base rate) 
from corpus statistics
+ * via pseudo-query sampling.
+ *
+ * <p>The estimation algorithm:
+ *
+ * <ol>
+ *   <li>Sample N documents randomly from the index
+ *   <li>For each document, create a pseudo-query from its first few tokens in 
the target field
+ *   <li>Run each pseudo-query via BM25 and collect the score distribution
+ *   <li>Estimate: beta = median(scores), alpha = 1 / std(scores)
+ *   <li>Estimate base rate: mean fraction of documents scoring above the 95th 
percentile
+ * </ol>
+ *
+ * @lucene.experimental
+ */
+public class BayesianScoreEstimator {

Review Comment:
   Great question, and the answer is: calibration doesn't need to model the 
user query distribution — it only needs the score distribution to be 
representative of the corpus's BM25 dynamic range.
   
   Here's why: α and β are derived from the BM25 score distribution's spread 
(alpha = 1/std) and center (beta = median). These are scale statistics. As long 
as the pseudo-queries exercise the same scoring code path that real user 
queries will hit (BM25Similarity over the same field's term frequencies and IDF 
table), the resulting α/β describe the scorer's calibration, which is invariant 
to which specific terms appear in the query. The base rate is similarly a 
corpus-level fraction, not query-conditional.
   
   A useful sanity check: sigmoid is monotone, so α and β never change ranking 
— they only adjust where on the (0,1) curve scores land for downstream Log-OP 
fusion. Even substantial pseudo-query/real-query distribution mismatch only 
shifts the calibration curve, which is the same effect as picking a different 
α/β manually.
   
   That said, the "random docs + first N tokens" approach in this PR does have 
a real weakness on corpora with shared boilerplate prefixes (license headers, 
structured templates), where pseudo-queries collapse into near-duplicates. I'm 
thinking about replacing the document-text path with reservoir sampling over 
the field's indexed vocabulary, which would give uniform random samples of 
unique terms instead — a more defensible "what does this scorer's distribution 
look like" probe than "what do the first 5 words of random documents look like."
   
   We did test this calibration approach during the research phase across 
several corpora and didn't see issues, but I'd like to redo that validation 
directly against the Lucene implementation as a follow-up PR before this leaves 
@lucene.experimental status.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Improve BayesianScoreQuery and LogOddsFusionQuery with base rate prior, weighted Log-OP, and parameter estimation [lucene]

Reply via email to