[ https://issues.apache.org/jira/browse/LUCENE-6690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Adrien Grand updated LUCENE-6690: --------------------------------- Attachment: OrdinalMapBuildBench.java Here is the benchmark I've been using. It's certainly not great but I don't think it's too bad either. :) > Speed up MultiTermsEnum.next() > ------------------------------ > > Key: LUCENE-6690 > URL: https://issues.apache.org/jira/browse/LUCENE-6690 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Adrien Grand > Assignee: Adrien Grand > Priority: Minor > Attachments: OrdinalMapBuildBench.java > > > OrdinalMap is very useful when computing top terms on a multi-index segment. > However I've seen it being occasionally slow to build, which was either > making facets (when the ordinals map is computed lazily) or reopen (when > computed eagerly) slow. So out of curiosity, I tried to profile ordinal map > building on a simple index: 10M random strings of length between 0 and 20 > stored as a SORTED doc values field. The index has 19 segments. The > bottleneck was MultiTermsEnum.next() (by far) due to lots of BytesRef > comparisons (UTF8SortedAsUnicodeComparator). > MultiTermsEnum stores sub enums in two different places: > - top: a simple array containing all enums on the current term > - queue: a queue for enums that are not exhausted yet but beyond the current > term. > A non-exhausted enum is in exactly one of these data-structures. When moving > to the next term, MultiTermsEnum advances all enums in {{top}}, then adds > them to {{queue}} and finally, pops all enum that are on the same term back > into {{top}}. > We could save reorderings of the priority queue by not removing entries from > the priority queue and then calling updateTop to advance enums which are on > the current term. This is already what we do for disjunctions of doc IDs in > DISIPriorityQueue. > On the index described above and current trunk, building an OrdinalMap has to > call UTF8SortedAsUnicodeComparator.compare 80114820 times and runs in 1.9 s. > With the change, it calls UTF8SortedAsUnicodeComparator.compare 36900694 > times, BytesRef.equals 16297638 times and runs in 1.4s (~26% faster). -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org