[jira] [Updated] (LUCENE-5938) New DocIdSet implementation with random write access

2014-09-29 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-5938:
-
Attachment: LUCENE-5938.patch

I updated the patch to recent trunk modifications and ran the benchmark again, 
I think it is ready. In summary this patch:
 - introduces a new doc-id set impl similar to FixedBitSet but which is much 
faster in the sparse case and a bit slower in the dense case (between 1.5x and 
4x slower according to benchmarks).
 - introduces a doc-id set builder that supports random write access by 
starting with a sparse bit set and upgrading to a dense FixedBitSet when the 
cardinality of the set increases
 - changes MultiTermQueryWrapperFilter and TermsFilter to use this new builder
 - removes CONSTANT_SCORE_AUTO_REWRITE and makes CONSTANT_SCORE_FILTER_REWRITE 
the default

For queries that match many documents ({{wikimedium10m.tasks}}, the builder 
always ends up building a FixedBitSet), this new patch can be slower or faster 
depending on the cost to iterate the matching terms (since they are enumerated 
only once now) vs. the cost of building the doc-id set.

For queries that match few documents ({{low_freq.tasks}}, attached to this 
issue), this new patch makes things faster since it just sets a couple of bits 
in random order and then iterates over them instead of merging documents coming 
from several other iterators on the fly using a priority queue.

Independently of the benchmarks, I think it's a good simplification to remove 
the constant-score auto rewrite mode.

{noformat}
wikimedium10m.tasks

TaskQPS baseline  StdDev   QPS patch  StdDev
Pct diff
  IntNRQ8.79  (9.6%)8.41  (3.4%)   
-4.3% ( -15% -9%)
  Fuzzy2   60.83 (11.1%)   58.34  (8.7%)   
-4.1% ( -21% -   17%)
OrNotHighMed   98.35 (13.8%)   97.12 (10.9%)   
-1.3% ( -22% -   27%)
   OrHighNotHigh   18.88 (13.7%)   18.71 (11.1%)   
-0.9% ( -22% -   27%)
   OrNotHighHigh   17.10 (13.4%)   17.03 (11.2%)   
-0.4% ( -22% -   27%)
OrNotHighLow  126.52 (13.6%)  126.85 (10.9%)
0.3% ( -21% -   28%)
   OrHighMed   76.90 (14.0%)   77.25 (11.4%)
0.5% ( -21% -   30%)
OrHighNotLow   41.29 (14.3%)   41.51 (12.4%)
0.5% ( -22% -   31%)
OrHighNotMed   57.70 (13.6%)   58.03 (11.6%)
0.6% ( -21% -   29%)
   OrHighLow   73.14 (14.7%)   73.74 (12.0%)
0.8% ( -22% -   32%)
 LowSloppyPhrase  127.22  (8.6%)  128.43  (3.8%)
1.0% ( -10% -   14%)
  OrHighHigh   29.11 (15.1%)   29.41 (12.2%)
1.0% ( -22% -   33%)
HighSloppyPhrase   12.83 (10.4%)   13.02  (5.3%)
1.4% ( -12% -   19%)
 Prefix3  113.92  (9.9%)  115.71  (2.4%)
1.6% (  -9% -   15%)
PKLookup  297.13  (9.2%)  302.03  (3.5%)
1.6% ( -10% -   15%)
 MedSloppyPhrase   38.60  (8.8%)   39.26  (3.7%)
1.7% (  -9% -   15%)
 AndHighHigh   71.39  (6.9%)   72.67  (0.9%)
1.8% (  -5% -   10%)
HighTerm   87.17  (7.9%)   88.85  (2.1%)
1.9% (  -7% -   12%)
   MedPhrase   74.60  (9.3%)   76.10  (4.3%)
2.0% ( -10% -   17%)
   LowPhrase   21.67  (9.6%)   22.12  (4.0%)
2.1% ( -10% -   17%)
  AndHighMed  297.13  (9.4%)  303.73  (2.1%)
2.2% (  -8% -   15%)
  HighPhrase   16.65  (8.2%)   17.04  (3.7%)
2.3% (  -8% -   15%)
HighSpanNear4.56 (10.7%)4.67  (6.1%)
2.4% ( -12% -   21%)
 LowTerm  769.53  (7.8%)  788.24  (2.0%)
2.4% (  -6% -   13%)
  AndHighLow  726.76 (10.6%)  744.51  (4.2%)
2.4% ( -11% -   19%)
 MedSpanNear   65.27  (9.3%)   67.00  (3.2%)
2.6% (  -9% -   16%)
Wildcard  114.28  (9.1%)  118.05  (7.4%)
3.3% ( -12% -   21%)
 LowSpanNear  174.75 (10.3%)  180.83  (3.5%)
3.5% (  -9% -   19%)
  Fuzzy1   67.63 (11.3%)   70.08  (3.2%)
3.6% (  -9% -   20%)
 MedTerm  209.00  (9.3%)  216.71  (1.9%)
3.7% (  -6% -   16%)
 Respell   48.30 (10.6%)   50.58  (1.7%)
4.7% (  -6% -   18%)

low_freq.tasks

TaskQPS baseline  StdDev   QPS patch  StdDev
Pct diff
PKLookup  278.50  (8.8%)  297.48 (13.9%)
6.8% ( -14% -   

[jira] [Updated] (LUCENE-5938) New DocIdSet implementation with random write access

2014-09-29 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-5938:
-
Attachment: LUCENE-5938.patch

Thanks for the review Mike, here is a new patch that tries to address your 
concerns.

bq. Maybe fix word to be long instead

Done.

bq. do we need to guard against zeroWords being 0?

I added a comment as well as a test as you suggested.

{quote}
Maybe add some more comments around tricky parts of SparseFixedBitSet.
E.g., the different branches inside set? And, it looks strange doing
1L  i, but in fact the JVM/processor make that 1L  (i % 64). And
Iterator.currentOrNextDoc is scary looking... do we have enough tests
here?
{quote}

I added more comments, hopefully they make sense. Please let me know if there 
are still things that are not clear. currentOrNextDoc is a bit complicated 
because of the different cases that need to be handled but the structure is 
actually quite simple so at least get and set should be easy to understand. I 
extracted the insertion of a new long to a separate method, this should make 
set easier to read?

Regarding the shift, indeed it relies on the fact that shifts are mod 64 
(FixedBitSet does the same). I added some comments about it.

Regarding the tests, BaseDocIdSetTestCase.testAgainstBitSet tests various 
densities and assertEquals checks nextDoc(), docId(), interleaved calls to 
nextDoc() and advance(), and that the oal.util.Bits representation is 
consistent with the iterator. I think that is good?

bq. I hit this test failure, which reproduces with the patch

I dug that one, and the reason is that the searcher is created with threads, so 
the exception is indeed wrapped into an ExecutionException, which is in turn 
wrapped into a RuntimeException to by-pass the fact that ExecutionException is 
checked. It doesn't reproduce on trunk because the default rewrite method reads 
data from the index in MultiTermQuery.rewrite (collectTerms) which does not run 
in a thread pool. You can reproduce the issue on trunk by setting the rewrite 
method of the term range query to {{CONSTANT_SCORE_FILTER_REWRITE}}. I fixed 
the test in the patch to walk down the causes of the exception that is thrown.

 New DocIdSet implementation with random write access
 

 Key: LUCENE-5938
 URL: https://issues.apache.org/jira/browse/LUCENE-5938
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
 Attachments: LUCENE-5938.patch, LUCENE-5938.patch, LUCENE-5938.patch, 
 LUCENE-5938.patch, LUCENE-5938.patch, low_freq.tasks


 We have a great cost API that is supposed to help make decisions about how to 
 best execute queries. However, due to the fact that several of our filter 
 implementations (eg. TermsFilter and BooleanFilter) return FixedBitSets, 
 either we use the cost API and make bad decisions, or need to fall back to 
 heuristics which are not as good such as 
 RandomAccessFilterStrategy.useRandomAccess which decides that random access 
 should be used if the first doc in the set is less than 100.
 On the other hand, we also have some nice compressed and cacheable DocIdSet 
 implementation but we cannot make use of them because TermsFilter requires a 
 DocIdSet that has random write access, and FixedBitSet is the only DocIdSet 
 that we have that supports random access.
 I think it would be nice to replace FixedBitSet in those filters with another 
 DocIdSet that would also support random write access but would have a better 
 cost?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5938) New DocIdSet implementation with random write access

2014-09-12 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-5938:
-
Attachment: low_freq.tasks
LUCENE-5938.patch

OK, I did something slightly different. It happens that all queries in the 
tasks file match a pretty large number of documents, which favors FixedBitSet. 
So now I've configured a threshold: FixedBitSet is used when more than maxDoc / 
16384 docs match and SparseFixedBitSet is used otherwise. Since 
SparseFixedBitSet is much faster than FixedBitSet for such low densities, the 
cost to start by creating a SparseFixedBitSet and then upgrading to a 
FixedBitSet is negligible compared to starting with a FixedBitSet from the 
beginning (see http://people.apache.org/~jpountz/doc_id_sets2.html).

So now the benchmark looks better for those queries that match many documents:
{noformat}
  IntNRQ7.10  (6.3%)6.57  (9.6%)   
-7.4% ( -21% -9%)
 Prefix3  110.36 (14.8%)  109.88  (9.5%)   
-0.4% ( -21% -   28%)
Wildcard   62.83 (14.5%)   66.93  (9.5%)
6.5% ( -15% -   35%)
{noformat}

I don't think the improvement with {{Wildcard}} is noise, I can reproduce it 
easily. I think the reason is that since the default is filter rewrite now, we 
don't have to compute the terms intersection twice, which is costly with 
wildcard queries.

I also wanted to see what happens with queries that match fewer documents 
compared to boolean rewrite, so I generated a set of wildcard queries that are 
expanded to a couple of terms and don't match too many documents (see tasks 
file attached):

{noformat}
Wildcard   99.90  (9.0%)  294.66 (30.6%)  
194.9% ( 142% -  257%)
{noformat}

For such queries, the new default rewrite method looks much better.

 New DocIdSet implementation with random write access
 

 Key: LUCENE-5938
 URL: https://issues.apache.org/jira/browse/LUCENE-5938
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
 Attachments: LUCENE-5938.patch, LUCENE-5938.patch, LUCENE-5938.patch, 
 low_freq.tasks


 We have a great cost API that is supposed to help make decisions about how to 
 best execute queries. However, due to the fact that several of our filter 
 implementations (eg. TermsFilter and BooleanFilter) return FixedBitSets, 
 either we use the cost API and make bad decisions, or need to fall back to 
 heuristics which are not as good such as 
 RandomAccessFilterStrategy.useRandomAccess which decides that random access 
 should be used if the first doc in the set is less than 100.
 On the other hand, we also have some nice compressed and cacheable DocIdSet 
 implementation but we cannot make use of them because TermsFilter requires a 
 DocIdSet that has random write access, and FixedBitSet is the only DocIdSet 
 that we have that supports random access.
 I think it would be nice to replace FixedBitSet in those filters with another 
 DocIdSet that would also support random write access but would have a better 
 cost?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5938) New DocIdSet implementation with random write access

2014-09-11 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-5938:
-
Attachment: LUCENE-5938.patch

I have been trying to think about how to do it and here is a first iteration.

This DocIdSet, called SparseFixedBitSet is similar to FixedBitSet except that 
it only stores words that have at least one bit that is set. It is inspired 
from hash array mapped tries, and uses Long.bitCount/numberOfTrailingZeros 
operations in order to lookup words given an index.

Some more notes:
 - like FixedBitSet it supports random access and implements the Bits interface
 - although not completely accurate, its cost() method should return numbers 
that are very close to the set cardinality, especially on sparse sets that have 
bits uniformly dispersed
 - memory usage is much better on sparse sets compared to FixedBitSet
 - it supports random write access, so could be used eg. in TermsFilter

I re-ran the benchmark that I had for our doc-id sets with this new impl and it 
seems to perform very well: http://people.apache.org/~jpountz/doc_id_sets2.html

 New DocIdSet implementation with random write access
 

 Key: LUCENE-5938
 URL: https://issues.apache.org/jira/browse/LUCENE-5938
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
 Attachments: LUCENE-5938.patch


 We have a great cost API that is supposed to help make decisions about how to 
 best execute queries. However, due to the fact that several of our filter 
 implementations (eg. TermsFilter and BooleanFilter) return FixedBitSets, 
 either we use the cost API and make bad decisions, or need to fall back to 
 heuristics which are not as good such as 
 RandomAccessFilterStrategy.useRandomAccess which decides that random access 
 should be used if the first doc in the set is less than 100.
 On the other hand, we also have some nice compressed and cacheable DocIdSet 
 implementation but we cannot make use of them because TermsFilter requires a 
 DocIdSet that has random write access, and FixedBitSet is the only DocIdSet 
 that we have that supports random access.
 I think it would be nice to replace FixedBitSet in those filters with another 
 DocIdSet that would also support random write access but would have a better 
 cost?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5938) New DocIdSet implementation with random write access

2014-09-11 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-5938:
-
Attachment: LUCENE-5938.patch

Here is an updated patch:
 - this sparse set is used in MultiTermQueryWrapperFilter and TermsFilter
 - constant-score auto rewrite has been removed in favor of filter rewrite

I didn't change BooleanFilter and ChainedFilter because I think they need more 
work. For example, it looks wrong to consume all documents of a filter when 
computing a conjunction instead of taking a leap-frog approach. Let's tackle 
this in another issue?

Tests pass.

 New DocIdSet implementation with random write access
 

 Key: LUCENE-5938
 URL: https://issues.apache.org/jira/browse/LUCENE-5938
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
 Attachments: LUCENE-5938.patch, LUCENE-5938.patch


 We have a great cost API that is supposed to help make decisions about how to 
 best execute queries. However, due to the fact that several of our filter 
 implementations (eg. TermsFilter and BooleanFilter) return FixedBitSets, 
 either we use the cost API and make bad decisions, or need to fall back to 
 heuristics which are not as good such as 
 RandomAccessFilterStrategy.useRandomAccess which decides that random access 
 should be used if the first doc in the set is less than 100.
 On the other hand, we also have some nice compressed and cacheable DocIdSet 
 implementation but we cannot make use of them because TermsFilter requires a 
 DocIdSet that has random write access, and FixedBitSet is the only DocIdSet 
 that we have that supports random access.
 I think it would be nice to replace FixedBitSet in those filters with another 
 DocIdSet that would also support random write access but would have a better 
 cost?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org