Greg Bowyer created SOLR-3763:
---------------------------------

             Summary: Make solr use lucene filters directly
                 Key: SOLR-3763
                 URL: https://issues.apache.org/jira/browse/SOLR-3763
             Project: Solr
          Issue Type: Improvement
    Affects Versions: 4.0, 4.1, 5.0
            Reporter: Greg Bowyer
            Assignee: Greg Bowyer


Presently solr uses bitsets, queries and collectors to implement the concept of 
filters. This has proven to be very powerful, but does come at the cost of 
introducing a large body of code into solr making it harder to optimise and 
maintain.

Another issue here is that filters currently cache sub-optimally given the 
changes in lucene towards atomic readers.

Rather than patch these issues, this is an attempt to rework the filters in 
solr to leverage the Filter subsystem from lucene as much as possible.

In good time the aim is to get this to do the following:

∘ Handle setting up filter implementations that are able to correctly cache 
with reference to the AtomicReader that they are caching for rather that for 
the entire index at large

∘ Get the post filters working, I am thinking that this can be done via lucenes 
chained filter, with the ‟expensive” filters being put towards the end of the 
chain - this has different semantics internally to the original implementation 
but IMHO should have the same result for end users

∘ Learn how to create filters that are potentially more efficient, at present 
solr basically runs a simple query that gathers a DocSet that relates to the 
documents that we want filtered; it would be interesting to make use of filter 
implementations that are in theory faster than query filters (for instance 
there are filters that are able to query the FieldCache)

∘ Learn how to decompose filters so that a complex filter query can be cached 
(potentially) as its constituent parts; for example the filter below currently 
needs love, care and feeding to ensure that the filter cache is not unduly 
stressed

{code}
  'category:(100) OR category:(200) OR category:(300)'
{code}

Really there is no reason not to express this in a cached form as 

{code}
BooleanFilter(
    FilterClause(CachedFilter(TermFilter(Term("category", 100))), SHOULD),
    FilterClause(CachedFilter(TermFilter(Term("category", 200))), SHOULD),
    FilterClause(CachedFilter(TermFilter(Term("category", 300))), SHOULD)
  )
{code}

This would yeild better cache usage I think as we can resuse docsets across 
multiple queries as well as avoid issues when filters are presented in 
differing orders

∘ Instead of end users providing costing we might (and this is a big might 
FWIW), be able to create a sort of execution plan of filters, leveraging a 
combination of what the index is able to tell us as well as sampling and 
‟educated guesswork”; in essence this is what some DBMS software, for example 
postgresql does - it has a genetic algo that attempts to solve the travelling 
salesman - to great effect

∘ I am sure I will probably come up with other ambitious ideas to plug in here 
..... :S 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to