[ https://issues.apache.org/jira/browse/SOLR-3763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Greg Bowyer updated SOLR-3763: ------------------------------ Attachment: SOLR-3763-Make-solr-use-lucene-filters-directly.patch Updated to latest trunk, the cache unit test still fails as does the spatial lat/lon tests > Make solr use lucene filters directly > ------------------------------------- > > Key: SOLR-3763 > URL: https://issues.apache.org/jira/browse/SOLR-3763 > Project: Solr > Issue Type: Improvement > Affects Versions: 4.0, 4.1, 5.0 > Reporter: Greg Bowyer > Assignee: Greg Bowyer > Attachments: SOLR-3763-Make-solr-use-lucene-filters-directly.patch, > SOLR-3763-Make-solr-use-lucene-filters-directly.patch > > > Presently solr uses bitsets, queries and collectors to implement the concept > of filters. This has proven to be very powerful, but does come at the cost of > introducing a large body of code into solr making it harder to optimise and > maintain. > Another issue here is that filters currently cache sub-optimally given the > changes in lucene towards atomic readers. > Rather than patch these issues, this is an attempt to rework the filters in > solr to leverage the Filter subsystem from lucene as much as possible. > In good time the aim is to get this to do the following: > ∘ Handle setting up filter implementations that are able to correctly cache > with reference to the AtomicReader that they are caching for rather that for > the entire index at large > ∘ Get the post filters working, I am thinking that this can be done via > lucenes chained filter, with the ‟expensive” filters being put towards the > end of the chain - this has different semantics internally to the original > implementation but IMHO should have the same result for end users > ∘ Learn how to create filters that are potentially more efficient, at present > solr basically runs a simple query that gathers a DocSet that relates to the > documents that we want filtered; it would be interesting to make use of > filter implementations that are in theory faster than query filters (for > instance there are filters that are able to query the FieldCache) > ∘ Learn how to decompose filters so that a complex filter query can be cached > (potentially) as its constituent parts; for example the filter below > currently needs love, care and feeding to ensure that the filter cache is not > unduly stressed > {code} > 'category:(100) OR category:(200) OR category:(300)' > {code} > Really there is no reason not to express this in a cached form as > {code} > BooleanFilter( > FilterClause(CachedFilter(TermFilter(Term("category", 100))), SHOULD), > FilterClause(CachedFilter(TermFilter(Term("category", 200))), SHOULD), > FilterClause(CachedFilter(TermFilter(Term("category", 300))), SHOULD) > ) > {code} > This would yeild better cache usage I think as we can resuse docsets across > multiple queries as well as avoid issues when filters are presented in > differing orders > ∘ Instead of end users providing costing we might (and this is a big might > FWIW), be able to create a sort of execution plan of filters, leveraging a > combination of what the index is able to tell us as well as sampling and > ‟educated guesswork”; in essence this is what some DBMS software, for example > postgresql does - it has a genetic algo that attempts to solve the travelling > salesman - to great effect > ∘ I am sure I will probably come up with other ambitious ideas to plug in > here ..... :S > Patches obviously forthcoming but the bulk of the work can be followed here > https://github.com/GregBowyer/lucene-solr/commits/solr-uses-lucene-filters -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org