[jira] [Updated] (LUCENE-6690) Speed up MultiTermsEnum.next()

Adrien Grand (JIRA) Mon, 20 Jul 2015 09:12:25 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-6690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Adrien Grand updated LUCENE-6690:
---------------------------------
    Attachment: OrdinalMapBuildBench.java

Here is the benchmark I've been using. It's certainly not great but I don't 
think it's too bad either. :)

> Speed up MultiTermsEnum.next()
> ------------------------------
>
>                 Key: LUCENE-6690
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6690
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>         Attachments: OrdinalMapBuildBench.java
>
>
> OrdinalMap is very useful when computing top terms on a multi-index segment. 
> However I've seen it being occasionally slow to build, which was either 
> making facets (when the ordinals map is computed lazily) or reopen (when 
> computed eagerly) slow. So out of curiosity, I tried to profile ordinal map 
> building on a simple index: 10M random strings of length between 0 and 20 
> stored as a SORTED doc values field. The index has 19 segments. The 
> bottleneck was MultiTermsEnum.next() (by far) due to lots of BytesRef 
> comparisons (UTF8SortedAsUnicodeComparator).
> MultiTermsEnum stores sub enums in two different places:
>  - top: a simple array containing all enums on the current term
>  - queue: a queue for enums that are not exhausted yet but beyond the current 
> term.
> A non-exhausted enum is in exactly one of these data-structures. When moving 
> to the next term, MultiTermsEnum advances all enums in {{top}}, then adds 
> them to {{queue}} and finally, pops all enum that are on the same term back 
> into {{top}}.
> We could save reorderings of the priority queue by not removing entries from 
> the priority queue and then calling updateTop to advance enums which are on 
> the current term. This is already what we do for disjunctions of doc IDs in 
> DISIPriorityQueue.
> On the index described above and current trunk, building an OrdinalMap has to 
> call UTF8SortedAsUnicodeComparator.compare 80114820 times and runs in 1.9 s. 
> With the change, it calls UTF8SortedAsUnicodeComparator.compare 36900694 
> times, BytesRef.equals 16297638 times and runs in 1.4s (~26% faster).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-6690) Speed up MultiTermsEnum.next()

Reply via email to