[jira] [Updated] (LUCENE-5637) Scaling scale function

Chris Russell (JIRA) Thu, 01 May 2014 13:58:44 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-5637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Chris Russell updated LUCENE-5637:
----------------------------------

    Description: 
The existing scale() function examines the scores of all documents in the index 
in order to calculate its scale constant.  This does not perform well in solr 
on very large indexes or with costly scoring mechanisms such as geo distance.

I have developed a patch that allows the scale function to only score documents 
that match the given filters, thus improving performance of the scale function. 
 

For test queries involving two scale operations where one was scaling the 
result of keyword scoring and the other was scaling the result of geo distance 
scoring on an index with ~2 million documents, query time was improved from 
~400 ms with vanilla scale to ~190 ms with new scale.  A similar query using no 
scaling ran in ~90 ms.  (Each enhanced scale function added to the query 
appeared to add about 50 ms of processing)
e.g. scaled query - q = scale(keywords, 0, 90) and scale(geo, 0, 10)
e.g. unscaled query - q = keywords and geo
In both cases fq includes keywords and geo.

In order to accomplish this goal I had to introduce a couple of changes:
1) In the indexsearcher.search method where scorers are created and then used 
to score on a per-atomicreadercontext basis I had to make it so that all 
scorers would be created before any scoring was done.  This was so that the 
scale function would have an opportunity to observe the entire index before 
being asked to score something.
2) Introduced a new property to the Bits interface that indicates whether or 
not the bits provide constant-time access.  Why? Read on.
3) FilterSet used to return Null when asked for its bits because it did not 
have any, it had an iterator.  This was an issue when trying to make it so that 
scale would only score documents matching the filter.  Thus a new bits 
implementation was added (LazyIteratorBackedBits) that could expose an iterator 
as a Bits implementation.  It advances the iterator on-demand when asked about 
a document and uses an OpenBitSet to keep track of what it has advanced beyond. 
 Thus once the iterator is exhausted it provides constant-time answers like any 
other Bits.
4) Introduced a function on the ValueSource interface to allow a Bits to be 
passed in for filtering purposes.

This was originally developed against Solr 4.2 but I have ported it to Solr 
4.8.  There is one failing unit test related to code that has been added in the 
interim, AnalyzingInfixSuggesterTest.testRandomNRT.  I have not been able to 
figure out why this test fails.  All other tests pass.

In relation to implementation detail 1) above, the introduction of 
LeafCollectors in trunk has caused somewhat of an issue. ( LUCENE-5527 ) It 
seems to no longer be possible to create multiple scorers without immediately 
scoring on that LeafCollector.  This may be related to the encapsulation of the 
Collector.setNextReader() method which was very useful for this purpose.

  was:
The existing scale() function examines the scores of all documents in the index 
in order to calculate its scale constant.  This does not perform well in solr 
on very large indexes or with costly scoring mechanisms such as geo distance.

I have developed a patch that allows the scale function to only score documents 
that match the given filters, thus improving performance of the scale function. 
 

For test queries involving two scale operations where one was scaling the 
result of keyword scoring and the other was scaling the result of geo distance 
scoring on an index with ~2 million documents, query time was improved from 
~400 ms with vanilla scale to ~190 ms with new scale.  A similar query using no 
scaling ran in ~90 ms.  (Each enhanced scale function added to the query 
appeared to add about 50 ms of processing)
e.g. scaled query - q = scale(keywords, 0, 90) and scale(geo, 0, 10)
e.g. unscaled query - q = keywords and geo
In both cases fq includes keywords and geo.

In order to accomplish this goal I had to introduce a couple of changes:
1) In the indexsearcher.search method where scorers are created and then used 
to score on a per-atomicreadercontext basis I had to make it so that all 
scorers would be created before any scoring was done.  This was so that the 
scale function would have an opportunity to observe the entire index before 
being asked to score something.
2) Introduced a new property to the Bits interface that indicates whether or 
not the bits provide constant-time access.  Why? Read on.
3) FilterSet used to return Null when asked for its bits because it did not 
have any, it had an iterator.  This was an issue when trying to make it so that 
scale would only score documents matching the filter.  Thus a new bits 
implementation was added (LazyIteratorBackedBits) that could expose an iterator 
as a Bits implementation.  It advances the iterator on-demand when asked about 
a document and uses an OpenBitSet to keep track of what it has advanced beyond. 
 Thus once the iterator is exhausted it provides constant-time answers like any 
other Bits.
4) Introduced a function on the ValueSource interface to allow a Bits to be 
passed in for filtering purposes.

This was originally developed against Solr 4.2 but I have ported it to Solr 
4.8.  There is one failing unit test related to code that has been added in the 
interim, AnalyzingInfixSuggesterTest.testRandomNRT.  I have not been able to 
figure out why this test fails.  All other tests pass.

In relation to implementation detail 1) above, the introduction of 
LeafCollectors in trunk has caused somewhat of an issue.  It seems to no longer 
be possible to create multiple scorers without immediately scoring on that 
LeafCollector.  This may be related to the encapsulation of the 
Collector.setNextReader() method which was very useful for this purpose.


> Scaling scale function
> ----------------------
>
>                 Key: LUCENE-5637
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5637
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Chris Russell
>            Priority: Minor
>              Labels: patch, performance
>             Fix For: 4.8
>
>         Attachments: Lucene-5637.patch
>
>
> The existing scale() function examines the scores of all documents in the 
> index in order to calculate its scale constant.  This does not perform well 
> in solr on very large indexes or with costly scoring mechanisms such as geo 
> distance.
> I have developed a patch that allows the scale function to only score 
> documents that match the given filters, thus improving performance of the 
> scale function.  
> For test queries involving two scale operations where one was scaling the 
> result of keyword scoring and the other was scaling the result of geo 
> distance scoring on an index with ~2 million documents, query time was 
> improved from ~400 ms with vanilla scale to ~190 ms with new scale.  A 
> similar query using no scaling ran in ~90 ms.  (Each enhanced scale function 
> added to the query appeared to add about 50 ms of processing)
> e.g. scaled query - q = scale(keywords, 0, 90) and scale(geo, 0, 10)
> e.g. unscaled query - q = keywords and geo
> In both cases fq includes keywords and geo.
> In order to accomplish this goal I had to introduce a couple of changes:
> 1) In the indexsearcher.search method where scorers are created and then used 
> to score on a per-atomicreadercontext basis I had to make it so that all 
> scorers would be created before any scoring was done.  This was so that the 
> scale function would have an opportunity to observe the entire index before 
> being asked to score something.
> 2) Introduced a new property to the Bits interface that indicates whether or 
> not the bits provide constant-time access.  Why? Read on.
> 3) FilterSet used to return Null when asked for its bits because it did not 
> have any, it had an iterator.  This was an issue when trying to make it so 
> that scale would only score documents matching the filter.  Thus a new bits 
> implementation was added (LazyIteratorBackedBits) that could expose an 
> iterator as a Bits implementation.  It advances the iterator on-demand when 
> asked about a document and uses an OpenBitSet to keep track of what it has 
> advanced beyond.  Thus once the iterator is exhausted it provides 
> constant-time answers like any other Bits.
> 4) Introduced a function on the ValueSource interface to allow a Bits to be 
> passed in for filtering purposes.
> This was originally developed against Solr 4.2 but I have ported it to Solr 
> 4.8.  There is one failing unit test related to code that has been added in 
> the interim, AnalyzingInfixSuggesterTest.testRandomNRT.  I have not been able 
> to figure out why this test fails.  All other tests pass.
> In relation to implementation detail 1) above, the introduction of 
> LeafCollectors in trunk has caused somewhat of an issue. ( LUCENE-5527 ) It 
> seems to no longer be possible to create multiple scorers without immediately 
> scoring on that LeafCollector.  This may be related to the encapsulation of 
> the Collector.setNextReader() method which was very useful for this purpose.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5637) Scaling scale function

Reply via email to