Thomas Mueller created OAK-7300:
-----------------------------------

             Summary: Lucene Index: per-column selectivity to improve cost 
estimation
                 Key: OAK-7300
                 URL: https://issues.apache.org/jira/browse/OAK-7300
             Project: Jackrabbit Oak
          Issue Type: Improvement
          Components: lucene, query
            Reporter: Thomas Mueller
            Assignee: Thomas Mueller
             Fix For: 1.10


In OAK-6735 we have improved cost estimation for Lucene indexes, however the 
following case is still not working as expected: a very common property is 
indexes (many nodes have that property), and each value of that property is 
more or less unique. In this case, currently the cost estimation is the total 
number of documents that contain that property. Assuming the condition 
"property is not null" this is correct, however for the common case "property = 
x" the estimated cost is far too high.

A known workaround is to set the "costPerEntry" for the given index to a low 
value, for example 0.2. However this isn't a good solution, as it affects all 
properties and queries.

It would be good to be able to set the selectivity per property, for example by 
specifying the number of distinct values, or (better yet) the average number of 
entries for a given key (1 for unique values, 2 meaning for each distinct 
values there are two documents on average).

That value can be set manually (cost override), and it can be set 
automatically, e.g. when building the index, or updated from time to time 
during the index update, using a cardinality
estimation algorithm. That doesn't have to be accurate; we could use an rough 
approximation such as hyperbitbit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to