[jira] [Commented] (LUCENE-6690) Speed up MultiTermsEnum.next()

Martijn van Groningen (JIRA) Tue, 21 Jul 2015 03:10:31 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-6690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14634897#comment-14634897
 ]


Martijn van Groningen commented on LUCENE-6690:
-----------------------------------------------

+1 this looks great. I test it out on an index with a single doc values field 
with 100M unique values and 42 segments and the rebuilding time dropped from 
~17000 ms to ~10800ms.  

> Speed up MultiTermsEnum.next()
> ------------------------------
>
>                 Key: LUCENE-6690
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6690
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>         Attachments: LUCENE-6690.patch, OrdinalMapBuildBench.java
>
>
> OrdinalMap is very useful when computing top terms on a multi-index segment. 
> However I've seen it being occasionally slow to build, which was either 
> making facets (when the ordinals map is computed lazily) or reopen (when 
> computed eagerly) slow. So out of curiosity, I tried to profile ordinal map 
> building on a simple index: 10M random strings of length between 0 and 20 
> stored as a SORTED doc values field. The index has 19 segments. The 
> bottleneck was MultiTermsEnum.next() (by far) due to lots of BytesRef 
> comparisons (UTF8SortedAsUnicodeComparator).
> MultiTermsEnum stores sub enums in two different places:
>  - top: a simple array containing all enums on the current term
>  - queue: a queue for enums that are not exhausted yet but beyond the current 
> term.
> A non-exhausted enum is in exactly one of these data-structures. When moving 
> to the next term, MultiTermsEnum advances all enums in {{top}}, then adds 
> them to {{queue}} and finally, pops all enum that are on the same term back 
> into {{top}}.
> We could save reorderings of the priority queue by not removing entries from 
> the priority queue and then calling updateTop to advance enums which are on 
> the current term. This is already what we do for disjunctions of doc IDs in 
> DISIPriorityQueue.
> On the index described above and current trunk, building an OrdinalMap has to 
> call UTF8SortedAsUnicodeComparator.compare 80114820 times and runs in 1.9 s. 
> With the change, it calls UTF8SortedAsUnicodeComparator.compare 36900694 
> times, BytesRef.equals 16297638 times and runs in 1.4s (~26% faster).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-6690) Speed up MultiTermsEnum.next()

Reply via email to