[jira] [Updated] (LUCENE-7462) Faster search APIs for doc values

2016-10-20 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-7462:
-
Attachment: LUCENE-7462.patch

Here is a patch that tries to implement this advanceExact method on all codecs. 
Initially I wanted to require that the target is strictly greater than the 
current doc id but this caused issues with comparators that may need to get the 
value multiple times or with scorers that call Scorer.score() multiple times 
(which makes the norm be decoded twice). So the current patch only requires 
that the target is greater than or equal to the current document. I managed to 
get the whole test suite passing twice in a row and luceneutil still gives 
results that are similar to above.

> Faster search APIs for doc values
> -
>
> Key: LUCENE-7462
> URL: https://issues.apache.org/jira/browse/LUCENE-7462
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: master (7.0)
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-7462-advanceExact.patch, LUCENE-7462.patch
>
>
> While the iterator API helps deal with sparse doc values more efficiently, it 
> also makes search-time operations more costly. For instance, the old 
> random-access API allowed to compute facets on a given segment without any 
> conditionals, by just incrementing the counter at index {{ordinal+1}} while 
> the new API requires to advance the iterator if necessary and then check 
> whether it is exactly on the right document or not.
> Since it is very common for fields to exist across most documents, I suspect 
> codecs will keep an internal structure that is similar to the current codec 
> in the dense case, by having a dense representation of the data and just 
> making the iterator skip over the minority of documents that do not have a 
> value.
> I suggest that we add APIs that make things cheaper at search time. For 
> instance in the case of SORTED doc values, it could look like 
> {{LegacySortedDocValues}} with the additional restriction that documents can 
> only be consumed in order. Codecs that can implement this API efficiently 
> would hide it behind a {{SortedDocValues}} adapter, and then at search time 
> facets and comparators (which liked the {{LegacySortedDocValues}} API better) 
> would either unwrap or hide the SortedDocValues they got behind a more 
> random-access API (which would only happen in the truly sparse case if the 
> codec optimizes the dense case).
> One challenge is that we already use the same idea for hiding single-valued 
> impls behind multi-valued impls, so we would need to enforce the order in 
> which the wrapping needs to happen. At first sight, it seems that it would be 
> best to do the single-value-behind-multi-value-API wrapping above the 
> random-access-behind-iterator-API wrapping. The complexity of 
> wrapping/unwrapping in the right order could be contained in the 
> {{DocValues}} helper class.
> I think this change would also simplify search-time consumption of doc 
> values, which currently needs to spend several lines of code positioning the 
> iterator everytime it needs to do something interesting with doc values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7462) Faster search APIs for doc values

2016-10-19 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-7462:
-
Attachment: LUCENE-7462-advanceExact.patch

I have been playing with the idea of having an advanceExact method (which I 
guess is the alternative to adding a 2nd search API for doc values). It removes 
stress on consumers since this method can be called blindly since it does not 
advance beyond the target document. It also removes some stress on the codec 
since it doesn't have to find the next document that has a value anymore.

I ran the wikimedium10m benchmark, to which I added the sorting tasks from the 
nigthly benchmark to check the impact. There seems to be a consistent speedup 
for queries for which norms is the bottleneck (term queries and simple 
conjunctions/disjunctions) and sorted queries (TermTitleSort and TermDTSort).

{noformat}
TaskQPS baseline  StdDev   QPS patch  StdDev
Pct diff
  Fuzzy2   55.31 (20.1%)   54.45 (18.5%)   
-1.6% ( -33% -   46%)
OrNotHighLow  875.16  (3.3%)  870.60  (2.9%)   
-0.5% (  -6% -5%)
 MedSloppyPhrase  210.38  (3.9%)  209.40  (3.8%)   
-0.5% (  -7% -7%)
 LowSloppyPhrase  126.86  (2.5%)  126.74  (2.1%)   
-0.1% (  -4% -4%)
  AndHighMed  151.22  (1.7%)  151.30  (2.3%)
0.0% (  -3% -4%)
 LowSpanNear   20.08  (2.6%)   20.10  (2.9%)
0.1% (  -5% -5%)
 Respell   77.27  (3.8%)   77.36  (3.5%)
0.1% (  -6% -7%)
   LowPhrase   42.32  (2.1%)   42.40  (1.9%)
0.2% (  -3% -4%)
  HighPhrase   20.01  (4.1%)   20.06  (3.7%)
0.3% (  -7% -8%)
Wildcard   46.20  (3.5%)   46.32  (3.9%)
0.3% (  -6% -7%)
HighSloppyPhrase   15.99  (5.1%)   16.04  (4.9%)
0.3% (  -9% -   10%)
 Prefix3   43.21  (2.9%)   43.39  (3.1%)
0.4% (  -5% -6%)
   MedPhrase  151.07  (3.4%)  151.69  (3.7%)
0.4% (  -6% -7%)
OrNotHighMed  151.21  (2.3%)  151.98  (2.6%)
0.5% (  -4% -5%)
 AndHighHigh   58.73  (1.4%)   59.05  (1.4%)
0.5% (  -2% -3%)
 MedSpanNear   22.36  (1.6%)   22.48  (1.6%)
0.6% (  -2% -3%)
  IntNRQ   13.75 (12.5%)   13.83 (13.1%)
0.6% ( -22% -   29%)
OrHighNotMed   62.26  (2.7%)   62.70  (3.2%)
0.7% (  -5% -6%)
   OrNotHighHigh   58.38  (2.6%)   58.82  (2.4%)
0.7% (  -4% -5%)
HighSpanNear   39.78  (2.2%)   40.09  (3.0%)
0.8% (  -4% -6%)
   OrHighNotHigh   44.88  (2.8%)   45.29  (2.7%)
0.9% (  -4% -6%)
  AndHighLow  694.25  (4.8%)  703.66  (3.8%)
1.4% (  -6% -   10%)
   OrHighLow   91.20  (3.4%)   92.54  (3.7%)
1.5% (  -5% -8%)
OrHighNotLow  105.90  (3.0%)  107.79  (4.4%)
1.8% (  -5% -9%)
  Fuzzy1   79.92 (12.3%)   81.61 (12.1%)
2.1% ( -19% -   30%)
  OrHighHigh   29.18  (7.2%)   29.83  (7.3%)
2.2% ( -11% -   18%)
   OrHighMed   19.44  (7.2%)   19.89  (7.3%)
2.3% ( -11% -   18%)
   TermTitleSort   81.70  (5.6%)   83.67  (5.8%)
2.4% (  -8% -   14%)
 LowTerm  682.24  (4.5%)  704.58  (4.1%)
3.3% (  -5% -   12%)
  TermDTSort  103.25  (5.7%)  106.77  (4.0%)
3.4% (  -5% -   13%)
 MedTerm  249.00  (2.5%)  260.56  (3.2%)
4.6% (  -1% -   10%)
HighTerm  103.70  (3.2%)  109.27  (3.6%)
5.4% (  -1% -   12%)
{noformat}

Note that the patch has barely any tests, so it's really just for playing. :) 
We'd also still need to define the semantics of this method.

> Faster search APIs for doc values
> -
>
> Key: LUCENE-7462
> URL: https://issues.apache.org/jira/browse/LUCENE-7462
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: master (7.0)
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-7462-advanceExact.patch
>
>
> While the iterator API helps deal with sparse doc values more efficiently, it 
> also makes search-time operations more costly. For instance, the old 
> random-access API allowed to compute facets on a given segment without any 
>