Re: [PR] OAK-12221 Cost estimation should be lower if more conditions are indexed [jackrabbit-oak]

via GitHub Fri, 22 May 2026 05:25:16 -0700


bhabegger commented on code in PR #2912:
URL: https://github.com/apache/jackrabbit-oak/pull/2912#discussion_r3288227784



##########
oak-search/src/main/java/org/apache/jackrabbit/oak/plugins/index/search/spi/query/FulltextIndexPlanner.java:
##########
@@ -960,6 +977,116 @@ private int getMaxPossibleNumDocs(Map<String, 
PropertyDefinition> propDefns, Fil
         return minNumDocs;
     }
 
+    /**
+     * OAK-12221: cost estimation using a multiplicative selectivity model.
+     * <p>
+     * Each indexed condition contributes a selectivity (the probability that 
a document
+     * satisfies the condition, given it has the field): {@code mcvFraction} 
for MCV matches,
+     * {@code 1/weight} when only a weight is known, and {@code 1.0} when 
{@code weight == 1}.
+     * The combined selectivity is the product of all per-condition 
selectivities, applied
+     * once at the end against the most restrictive field's doc count. This 
makes the
+     * estimate independent of iteration order.
+     */
+    private int getMaxPossibleNumDocsBySelectivity(Map<String, 
PropertyDefinition> propDefns, Filter filter) {

Review Comment:
   This method is quite long and likely has many cases. May it is worth a unit 
test ? 



##########
oak-search/src/main/java/org/apache/jackrabbit/oak/plugins/index/search/spi/query/FulltextIndexPlanner.java:
##########
@@ -960,6 +977,116 @@ private int getMaxPossibleNumDocs(Map<String, 
PropertyDefinition> propDefns, Fil
         return minNumDocs;
     }
 
+    /**
+     * OAK-12221: cost estimation using a multiplicative selectivity model.
+     * <p>
+     * Each indexed condition contributes a selectivity (the probability that 
a document
+     * satisfies the condition, given it has the field): {@code mcvFraction} 
for MCV matches,
+     * {@code 1/weight} when only a weight is known, and {@code 1.0} when 
{@code weight == 1}.
+     * The combined selectivity is the product of all per-condition 
selectivities, applied
+     * once at the end against the most restrictive field's doc count. This 
makes the
+     * estimate independent of iteration order.
+     */
+    private int getMaxPossibleNumDocsBySelectivity(Map<String, 
PropertyDefinition> propDefns, Filter filter) {
+        IndexStatistics indexStatistics = indexNode.getIndexStatistics();
+        if (indexStatistics == null) {

Review Comment:
   Do these index statistics exist already ? If not wouldn't we be sending this 
warning systematically ? A bit noisy no ? 



##########
oak-lucene/src/test/java/org/apache/jackrabbit/oak/plugins/index/lucene/IndexPlannerTest.java:
##########
@@ -557,7 +556,7 @@ public void indexedButZeroWeightProps() throws Exception{
 
         // Since, the index has no entries for "bar", estimated entry count 
for plan2 would be 0
         assertEquals(0, plan2.getEstimatedEntryCount());
-        assertThat(plan2.getEstimatedEntryCount(), 
lessThan(plan1.getEstimatedEntryCount()));
+        assertTrue(plan2.getEstimatedEntryCount() < 
plan1.getEstimatedEntryCount());

Review Comment:
   Hmmm, interesting. I do have a preference for assertThat(x).isLessThan(y) 
(but the assertj version hamcrest I don't know so well).
   
   Why ? Because in case of failure of the test you get "expected value below 5 
but got 4" and not "expected true but got false". This makes debugging the test 
way easier ;)



##########
oak-search/src/main/java/org/apache/jackrabbit/oak/plugins/index/search/spi/query/FulltextIndexPlanner.java:
##########


Review Comment:
   Can't we fail faster in the absence if indexStatistics (higher up) rather 
than check multiple times ? 
   
   Basically a FullTextIndexPlannerWithStats and a FullTextIndexPlannerNoStats 
? You toggle could then also be used to selectively choose which class instance 
to use rather than have to be repeated. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] OAK-12221 Cost estimation should be lower if more conditions are indexed [jackrabbit-oak]

Reply via email to