[jira] [Updated] (SOLR-3763) Make solr use lucene filters directly

Greg Bowyer (JIRA) Mon, 27 Aug 2012 23:36:12 -0700

     [ 
https://issues.apache.org/jira/browse/SOLR-3763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Greg Bowyer updated SOLR-3763:
------------------------------

    Description: 
Presently solr uses bitsets, queries and collectors to implement the concept of 
filters. This has proven to be very powerful, but does come at the cost of 
introducing a large body of code into solr making it harder to optimise and 
maintain.

Another issue here is that filters currently cache sub-optimally given the 
changes in lucene towards atomic readers.

Rather than patch these issues, this is an attempt to rework the filters in 
solr to leverage the Filter subsystem from lucene as much as possible.

In good time the aim is to get this to do the following:

∘ Handle setting up filter implementations that are able to correctly cache 
with reference to the AtomicReader that they are caching for rather that for 
the entire index at large

∘ Get the post filters working, I am thinking that this can be done via lucenes 
chained filter, with the ‟expensive” filters being put towards the end of the 
chain - this has different semantics internally to the original implementation 
but IMHO should have the same result for end users

∘ Learn how to create filters that are potentially more efficient, at present 
solr basically runs a simple query that gathers a DocSet that relates to the 
documents that we want filtered; it would be interesting to make use of filter 
implementations that are in theory faster than query filters (for instance 
there are filters that are able to query the FieldCache)

∘ Learn how to decompose filters so that a complex filter query can be cached 
(potentially) as its constituent parts; for example the filter below currently 
needs love, care and feeding to ensure that the filter cache is not unduly 
stressed

{code}
  'category:(100) OR category:(200) OR category:(300)'
{code}

Really there is no reason not to express this in a cached form as 

{code}
BooleanFilter(
    FilterClause(CachedFilter(TermFilter(Term("category", 100))), SHOULD),
    FilterClause(CachedFilter(TermFilter(Term("category", 200))), SHOULD),
    FilterClause(CachedFilter(TermFilter(Term("category", 300))), SHOULD)
  )
{code}

This would yeild better cache usage I think as we can resuse docsets across 
multiple queries as well as avoid issues when filters are presented in 
differing orders

∘ Instead of end users providing costing we might (and this is a big might 
FWIW), be able to create a sort of execution plan of filters, leveraging a 
combination of what the index is able to tell us as well as sampling and 
‟educated guesswork”; in essence this is what some DBMS software, for example 
postgresql does - it has a genetic algo that attempts to solve the travelling 
salesman - to great effect

∘ I am sure I will probably come up with other ambitious ideas to plug in here 
..... :S 

Patches obviously forthcoming but the bulk of the work can be followed here 
https://github.com/GregBowyer/lucene-solr/commits/solr-uses-lucene-filters

  was:
Presently solr uses bitsets, queries and collectors to implement the concept of 
filters. This has proven to be very powerful, but does come at the cost of 
introducing a large body of code into solr making it harder to optimise and 
maintain.

Another issue here is that filters currently cache sub-optimally given the 
changes in lucene towards atomic readers.

Rather than patch these issues, this is an attempt to rework the filters in 
solr to leverage the Filter subsystem from lucene as much as possible.

In good time the aim is to get this to do the following:

∘ Handle setting up filter implementations that are able to correctly cache 
with reference to the AtomicReader that they are caching for rather that for 
the entire index at large

∘ Get the post filters working, I am thinking that this can be done via lucenes 
chained filter, with the ‟expensive” filters being put towards the end of the 
chain - this has different semantics internally to the original implementation 
but IMHO should have the same result for end users

∘ Learn how to create filters that are potentially more efficient, at present 
solr basically runs a simple query that gathers a DocSet that relates to the 
documents that we want filtered; it would be interesting to make use of filter 
implementations that are in theory faster than query filters (for instance 
there are filters that are able to query the FieldCache)

∘ Learn how to decompose filters so that a complex filter query can be cached 
(potentially) as its constituent parts; for example the filter below currently 
needs love, care and feeding to ensure that the filter cache is not unduly 
stressed

{code}
  'category:(100) OR category:(200) OR category:(300)'
{code}

Really there is no reason not to express this in a cached form as 

{code}
BooleanFilter(
    FilterClause(CachedFilter(TermFilter(Term("category", 100))), SHOULD),
    FilterClause(CachedFilter(TermFilter(Term("category", 200))), SHOULD),
    FilterClause(CachedFilter(TermFilter(Term("category", 300))), SHOULD)
  )
{code}

This would yeild better cache usage I think as we can resuse docsets across 
multiple queries as well as avoid issues when filters are presented in 
differing orders

∘ Instead of end users providing costing we might (and this is a big might 
FWIW), be able to create a sort of execution plan of filters, leveraging a 
combination of what the index is able to tell us as well as sampling and 
‟educated guesswork”; in essence this is what some DBMS software, for example 
postgresql does - it has a genetic algo that attempts to solve the travelling 
salesman - to great effect

∘ I am sure I will probably come up with other ambitious ideas to plug in here 
..... :S 

    
> Make solr use lucene filters directly
> -------------------------------------
>
>                 Key: SOLR-3763
>                 URL: https://issues.apache.org/jira/browse/SOLR-3763
>             Project: Solr
>          Issue Type: Improvement
>    Affects Versions: 4.0, 4.1, 5.0
>            Reporter: Greg Bowyer
>            Assignee: Greg Bowyer
>
> Presently solr uses bitsets, queries and collectors to implement the concept 
> of filters. This has proven to be very powerful, but does come at the cost of 
> introducing a large body of code into solr making it harder to optimise and 
> maintain.
> Another issue here is that filters currently cache sub-optimally given the 
> changes in lucene towards atomic readers.
> Rather than patch these issues, this is an attempt to rework the filters in 
> solr to leverage the Filter subsystem from lucene as much as possible.
> In good time the aim is to get this to do the following:
> ∘ Handle setting up filter implementations that are able to correctly cache 
> with reference to the AtomicReader that they are caching for rather that for 
> the entire index at large
> ∘ Get the post filters working, I am thinking that this can be done via 
> lucenes chained filter, with the ‟expensive” filters being put towards the 
> end of the chain - this has different semantics internally to the original 
> implementation but IMHO should have the same result for end users
> ∘ Learn how to create filters that are potentially more efficient, at present 
> solr basically runs a simple query that gathers a DocSet that relates to the 
> documents that we want filtered; it would be interesting to make use of 
> filter implementations that are in theory faster than query filters (for 
> instance there are filters that are able to query the FieldCache)
> ∘ Learn how to decompose filters so that a complex filter query can be cached 
> (potentially) as its constituent parts; for example the filter below 
> currently needs love, care and feeding to ensure that the filter cache is not 
> unduly stressed
> {code}
>   'category:(100) OR category:(200) OR category:(300)'
> {code}
> Really there is no reason not to express this in a cached form as 
> {code}
> BooleanFilter(
>     FilterClause(CachedFilter(TermFilter(Term("category", 100))), SHOULD),
>     FilterClause(CachedFilter(TermFilter(Term("category", 200))), SHOULD),
>     FilterClause(CachedFilter(TermFilter(Term("category", 300))), SHOULD)
>   )
> {code}
> This would yeild better cache usage I think as we can resuse docsets across 
> multiple queries as well as avoid issues when filters are presented in 
> differing orders
> ∘ Instead of end users providing costing we might (and this is a big might 
> FWIW), be able to create a sort of execution plan of filters, leveraging a 
> combination of what the index is able to tell us as well as sampling and 
> ‟educated guesswork”; in essence this is what some DBMS software, for example 
> postgresql does - it has a genetic algo that attempts to solve the travelling 
> salesman - to great effect
> ∘ I am sure I will probably come up with other ambitious ideas to plug in 
> here ..... :S 
> Patches obviously forthcoming but the bulk of the work can be followed here 
> https://github.com/GregBowyer/lucene-solr/commits/solr-uses-lucene-filters

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-3763) Make solr use lucene filters directly

Reply via email to