[
https://issues.apache.org/jira/browse/LUCENE-6690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Adrien Grand updated LUCENE-6690:
---------------------------------
Attachment: OrdinalMapBuildBench.java
Here is the benchmark I've been using. It's certainly not great but I don't
think it's too bad either. :)
> Speed up MultiTermsEnum.next()
> ------------------------------
>
> Key: LUCENE-6690
> URL: https://issues.apache.org/jira/browse/LUCENE-6690
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Assignee: Adrien Grand
> Priority: Minor
> Attachments: OrdinalMapBuildBench.java
>
>
> OrdinalMap is very useful when computing top terms on a multi-index segment.
> However I've seen it being occasionally slow to build, which was either
> making facets (when the ordinals map is computed lazily) or reopen (when
> computed eagerly) slow. So out of curiosity, I tried to profile ordinal map
> building on a simple index: 10M random strings of length between 0 and 20
> stored as a SORTED doc values field. The index has 19 segments. The
> bottleneck was MultiTermsEnum.next() (by far) due to lots of BytesRef
> comparisons (UTF8SortedAsUnicodeComparator).
> MultiTermsEnum stores sub enums in two different places:
> - top: a simple array containing all enums on the current term
> - queue: a queue for enums that are not exhausted yet but beyond the current
> term.
> A non-exhausted enum is in exactly one of these data-structures. When moving
> to the next term, MultiTermsEnum advances all enums in {{top}}, then adds
> them to {{queue}} and finally, pops all enum that are on the same term back
> into {{top}}.
> We could save reorderings of the priority queue by not removing entries from
> the priority queue and then calling updateTop to advance enums which are on
> the current term. This is already what we do for disjunctions of doc IDs in
> DISIPriorityQueue.
> On the index described above and current trunk, building an OrdinalMap has to
> call UTF8SortedAsUnicodeComparator.compare 80114820 times and runs in 1.9 s.
> With the change, it calls UTF8SortedAsUnicodeComparator.compare 36900694
> times, BytesRef.equals 16297638 times and runs in 1.4s (~26% faster).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]