Greg Bowyer created SOLR-3763: --------------------------------- Summary: Make solr use lucene filters directly Key: SOLR-3763 URL: https://issues.apache.org/jira/browse/SOLR-3763 Project: Solr Issue Type: Improvement Affects Versions: 4.0, 4.1, 5.0 Reporter: Greg Bowyer Assignee: Greg Bowyer
Presently solr uses bitsets, queries and collectors to implement the concept of filters. This has proven to be very powerful, but does come at the cost of introducing a large body of code into solr making it harder to optimise and maintain. Another issue here is that filters currently cache sub-optimally given the changes in lucene towards atomic readers. Rather than patch these issues, this is an attempt to rework the filters in solr to leverage the Filter subsystem from lucene as much as possible. In good time the aim is to get this to do the following: ∘ Handle setting up filter implementations that are able to correctly cache with reference to the AtomicReader that they are caching for rather that for the entire index at large ∘ Get the post filters working, I am thinking that this can be done via lucenes chained filter, with the ‟expensive” filters being put towards the end of the chain - this has different semantics internally to the original implementation but IMHO should have the same result for end users ∘ Learn how to create filters that are potentially more efficient, at present solr basically runs a simple query that gathers a DocSet that relates to the documents that we want filtered; it would be interesting to make use of filter implementations that are in theory faster than query filters (for instance there are filters that are able to query the FieldCache) ∘ Learn how to decompose filters so that a complex filter query can be cached (potentially) as its constituent parts; for example the filter below currently needs love, care and feeding to ensure that the filter cache is not unduly stressed {code} 'category:(100) OR category:(200) OR category:(300)' {code} Really there is no reason not to express this in a cached form as {code} BooleanFilter( FilterClause(CachedFilter(TermFilter(Term("category", 100))), SHOULD), FilterClause(CachedFilter(TermFilter(Term("category", 200))), SHOULD), FilterClause(CachedFilter(TermFilter(Term("category", 300))), SHOULD) ) {code} This would yeild better cache usage I think as we can resuse docsets across multiple queries as well as avoid issues when filters are presented in differing orders ∘ Instead of end users providing costing we might (and this is a big might FWIW), be able to create a sort of execution plan of filters, leveraging a combination of what the index is able to tell us as well as sampling and ‟educated guesswork”; in essence this is what some DBMS software, for example postgresql does - it has a genetic algo that attempts to solve the travelling salesman - to great effect ∘ I am sure I will probably come up with other ambitious ideas to plug in here ..... :S -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org