[jira] [Commented] (LUCENE-7905) Optimizations for OrdinalMap
[ https://issues.apache.org/jira/browse/LUCENE-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16087429#comment-16087429 ] ASF subversion and git services commented on LUCENE-7905: - Commit 3df97d3f0c0e558c52514a7e500afeffe96e795d in lucene-solr's branch refs/heads/master from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3df97d3 ] LUCENE-7905: optimize how OrdinalMap builds its map > Optimizations for OrdinalMap > > > Key: LUCENE-7905 > URL: https://issues.apache.org/jira/browse/LUCENE-7905 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 7.1 > > Attachments: LUCENE-7905.patch, LUCENE-7905.patch, LUCENE-7905.patch, > LUCENE-7905-specialized.patch > > > {{OrdinalMap}} is a useful class to quickly map per-segment ordinals to > global space, but it's fairly costly to build, which must typically be done > on every NRT refresh. > I'm using it quite heavily in two different places, one for > {{SortedSetDocValuesFacetCounts}}, and another custom usage, and I found some > small optimizations to improve its construction time. > I switched it to use a simple priority queue to merge the terms instead of > the more general {{MultiTermsEnum}}, which does extra work since it must also > provide postings, implement seekExact, etc. > I also pulled {{OrdinalMap}} out into its own oal.index class. > When testing construction time for my case the patch is ~16% faster (159.9s > -> 134.2s) in one case with 91.4 M terms and ~9% faster (115.6s -> 105.7s) in > another case with 26.6 M terms. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7905) Optimizations for OrdinalMap
[ https://issues.apache.org/jira/browse/LUCENE-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16087430#comment-16087430 ] ASF subversion and git services commented on LUCENE-7905: - Commit a0557cfef970780eff355a06f9fc39b9ecc6 in lucene-solr's branch refs/heads/branch_7x from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a0557cf ] LUCENE-7905: optimize how OrdinalMap builds its map > Optimizations for OrdinalMap > > > Key: LUCENE-7905 > URL: https://issues.apache.org/jira/browse/LUCENE-7905 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 7.1 > > Attachments: LUCENE-7905.patch, LUCENE-7905.patch, LUCENE-7905.patch, > LUCENE-7905-specialized.patch > > > {{OrdinalMap}} is a useful class to quickly map per-segment ordinals to > global space, but it's fairly costly to build, which must typically be done > on every NRT refresh. > I'm using it quite heavily in two different places, one for > {{SortedSetDocValuesFacetCounts}}, and another custom usage, and I found some > small optimizations to improve its construction time. > I switched it to use a simple priority queue to merge the terms instead of > the more general {{MultiTermsEnum}}, which does extra work since it must also > provide postings, implement seekExact, etc. > I also pulled {{OrdinalMap}} out into its own oal.index class. > When testing construction time for my case the patch is ~16% faster (159.9s > -> 134.2s) in one case with 91.4 M terms and ~9% faster (115.6s -> 105.7s) in > another case with 26.6 M terms. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7905) Optimizations for OrdinalMap
[ https://issues.apache.org/jira/browse/LUCENE-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085535#comment-16085535 ] Robert Muir commented on LUCENE-7905: - thanks Mike! > Optimizations for OrdinalMap > > > Key: LUCENE-7905 > URL: https://issues.apache.org/jira/browse/LUCENE-7905 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 7.1 > > Attachments: LUCENE-7905.patch, LUCENE-7905.patch, LUCENE-7905.patch, > LUCENE-7905-specialized.patch > > > {{OrdinalMap}} is a useful class to quickly map per-segment ordinals to > global space, but it's fairly costly to build, which must typically be done > on every NRT refresh. > I'm using it quite heavily in two different places, one for > {{SortedSetDocValuesFacetCounts}}, and another custom usage, and I found some > small optimizations to improve its construction time. > I switched it to use a simple priority queue to merge the terms instead of > the more general {{MultiTermsEnum}}, which does extra work since it must also > provide postings, implement seekExact, etc. > I also pulled {{OrdinalMap}} out into its own oal.index class. > When testing construction time for my case the patch is ~16% faster (159.9s > -> 134.2s) in one case with 91.4 M terms and ~9% faster (115.6s -> 105.7s) in > another case with 26.6 M terms. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7905) Optimizations for OrdinalMap
[ https://issues.apache.org/jira/browse/LUCENE-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085452#comment-16085452 ] Dawid Weiss commented on LUCENE-7905: - It'd probably have to be stored per-leaf of a binary tree the pq actually is. And then you'd need to maintain those prefix counts when pushing elements up/down the tree... The associated bookkeeping may not be worth it. This said, an algorithm for this has probably been invented. Back in the 70s. :D > Optimizations for OrdinalMap > > > Key: LUCENE-7905 > URL: https://issues.apache.org/jira/browse/LUCENE-7905 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 7.1 > > Attachments: LUCENE-7905.patch, LUCENE-7905.patch, > LUCENE-7905-specialized.patch > > > {{OrdinalMap}} is a useful class to quickly map per-segment ordinals to > global space, but it's fairly costly to build, which must typically be done > on every NRT refresh. > I'm using it quite heavily in two different places, one for > {{SortedSetDocValuesFacetCounts}}, and another custom usage, and I found some > small optimizations to improve its construction time. > I switched it to use a simple priority queue to merge the terms instead of > the more general {{MultiTermsEnum}}, which does extra work since it must also > provide postings, implement seekExact, etc. > I also pulled {{OrdinalMap}} out into its own oal.index class. > When testing construction time for my case the patch is ~16% faster (159.9s > -> 134.2s) in one case with 91.4 M terms and ~9% faster (115.6s -> 105.7s) in > another case with 26.6 M terms. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7905) Optimizations for OrdinalMap
[ https://issues.apache.org/jira/browse/LUCENE-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085437#comment-16085437 ] Michael McCandless commented on LUCENE-7905: bq. You could do it with radix-like sort, but you'd need all the terms from all the TermEnums; the pq has the advantage of scanning through them on the fly, progressively? Exactly! I want a combined algorithm, that uses the "streaming" that the PQ gives us, and uses the "don't redundantly compare common prefixes over and over again" that radix sort gives us, because if you think about the comparisons that PQ is doing to maintain its heap structure, many of them are wasted on common prefixes. The hard part is efficiently computing "all entries in the PQ now share a common prefix of 6" as heap entries are updated over time, but surely there is a way. Though I suppose it's gains would be limited in this usage (merge sorting terms from all segments), because the small/tiny segments would mess up the common prefixes, i.e. the common prefix would often be low or 0 because of them. But if you merge sorted equal sized segments, e.g. what happens when merging segments, then it could be powerful. > Optimizations for OrdinalMap > > > Key: LUCENE-7905 > URL: https://issues.apache.org/jira/browse/LUCENE-7905 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 7.1 > > Attachments: LUCENE-7905.patch, LUCENE-7905.patch, > LUCENE-7905-specialized.patch > > > {{OrdinalMap}} is a useful class to quickly map per-segment ordinals to > global space, but it's fairly costly to build, which must typically be done > on every NRT refresh. > I'm using it quite heavily in two different places, one for > {{SortedSetDocValuesFacetCounts}}, and another custom usage, and I found some > small optimizations to improve its construction time. > I switched it to use a simple priority queue to merge the terms instead of > the more general {{MultiTermsEnum}}, which does extra work since it must also > provide postings, implement seekExact, etc. > I also pulled {{OrdinalMap}} out into its own oal.index class. > When testing construction time for my case the patch is ~16% faster (159.9s > -> 134.2s) in one case with 91.4 M terms and ~9% faster (115.6s -> 105.7s) in > another case with 26.6 M terms. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7905) Optimizations for OrdinalMap
[ https://issues.apache.org/jira/browse/LUCENE-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085423#comment-16085423 ] Dawid Weiss commented on LUCENE-7905: - bq. I do feel like how we compare terms in the PQ is inefficient, and we should be able to do something like what radix sort does, because at any given time, the terms in the queue likely share long common prefixes yet we keep inefficiently re-comparing those long common prefixes. You could do it with radix-like sort, but you'd need all the terms from all the TermEnums; the pq has the advantage of scanning through them on the fly, progressively? > Optimizations for OrdinalMap > > > Key: LUCENE-7905 > URL: https://issues.apache.org/jira/browse/LUCENE-7905 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 7.1 > > Attachments: LUCENE-7905.patch, LUCENE-7905.patch, > LUCENE-7905-specialized.patch > > > {{OrdinalMap}} is a useful class to quickly map per-segment ordinals to > global space, but it's fairly costly to build, which must typically be done > on every NRT refresh. > I'm using it quite heavily in two different places, one for > {{SortedSetDocValuesFacetCounts}}, and another custom usage, and I found some > small optimizations to improve its construction time. > I switched it to use a simple priority queue to merge the terms instead of > the more general {{MultiTermsEnum}}, which does extra work since it must also > provide postings, implement seekExact, etc. > I also pulled {{OrdinalMap}} out into its own oal.index class. > When testing construction time for my case the patch is ~16% faster (159.9s > -> 134.2s) in one case with 91.4 M terms and ~9% faster (115.6s -> 105.7s) in > another case with 26.6 M terms. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7905) Optimizations for OrdinalMap
[ https://issues.apache.org/jira/browse/LUCENE-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085413#comment-16085413 ] Michael McCandless commented on LUCENE-7905: Good point @rcmuir; I'll improve the javadocs. > Optimizations for OrdinalMap > > > Key: LUCENE-7905 > URL: https://issues.apache.org/jira/browse/LUCENE-7905 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 7.1 > > Attachments: LUCENE-7905.patch, LUCENE-7905.patch, > LUCENE-7905-specialized.patch > > > {{OrdinalMap}} is a useful class to quickly map per-segment ordinals to > global space, but it's fairly costly to build, which must typically be done > on every NRT refresh. > I'm using it quite heavily in two different places, one for > {{SortedSetDocValuesFacetCounts}}, and another custom usage, and I found some > small optimizations to improve its construction time. > I switched it to use a simple priority queue to merge the terms instead of > the more general {{MultiTermsEnum}}, which does extra work since it must also > provide postings, implement seekExact, etc. > I also pulled {{OrdinalMap}} out into its own oal.index class. > When testing construction time for my case the patch is ~16% faster (159.9s > -> 134.2s) in one case with 91.4 M terms and ~9% faster (115.6s -> 105.7s) in > another case with 26.6 M terms. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7905) Optimizations for OrdinalMap
[ https://issues.apache.org/jira/browse/LUCENE-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085139#comment-16085139 ] Robert Muir commented on LUCENE-7905: - I think its good to pull out OrdinalMap into its own class from MultiDocValues, but i think its a little trappy that it then has no warnings on it and just some cool sounding javadocs. MultiDocValues still warns you twice: https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/MultiDocValues.java#L40-L46 But I think we should have some kind of doc about the cost of this thing in OrdinalMap now that its separated? The "tax" of multiple segments is real here, makes it a hotspot just like it is for blocktree term dictionaries at merge. > Optimizations for OrdinalMap > > > Key: LUCENE-7905 > URL: https://issues.apache.org/jira/browse/LUCENE-7905 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 7.1 > > Attachments: LUCENE-7905.patch, LUCENE-7905.patch, > LUCENE-7905-specialized.patch > > > {{OrdinalMap}} is a useful class to quickly map per-segment ordinals to > global space, but it's fairly costly to build, which must typically be done > on every NRT refresh. > I'm using it quite heavily in two different places, one for > {{SortedSetDocValuesFacetCounts}}, and another custom usage, and I found some > small optimizations to improve its construction time. > I switched it to use a simple priority queue to merge the terms instead of > the more general {{MultiTermsEnum}}, which does extra work since it must also > provide postings, implement seekExact, etc. > I also pulled {{OrdinalMap}} out into its own oal.index class. > When testing construction time for my case the patch is ~16% faster (159.9s > -> 134.2s) in one case with 91.4 M terms and ~9% faster (115.6s -> 105.7s) in > another case with 26.6 M terms. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7905) Optimizations for OrdinalMap
[ https://issues.apache.org/jira/browse/LUCENE-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084349#comment-16084349 ] Robert Muir commented on LUCENE-7905: - {quote} That's it, except I also changed: while (segmentOrds[segmentIndex] <= segmentOrd) { ordDeltaBits[segmentIndex] |= delta; ordDeltas[segmentIndex].add(delta); segmentOrds[segmentIndex]++; } {quote} OK, I have to look in more detail later. We should be a little careful because this class is also used by merging, and merging has a strange case that you won't encounter during manual construction: the case where there are "holes" in the ords (deleted ords when all documents containing that ord are merged away). Might be best if can test some of this directly... > Optimizations for OrdinalMap > > > Key: LUCENE-7905 > URL: https://issues.apache.org/jira/browse/LUCENE-7905 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 7.1 > > Attachments: LUCENE-7905.patch > > > {{OrdinalMap}} is a useful class to quickly map per-segment ordinals to > global space, but it's fairly costly to build, which must typically be done > on every NRT refresh. > I'm using it quite heavily in two different places, one for > {{SortedSetDocValuesFacetCounts}}, and another custom usage, and I found some > small optimizations to improve its construction time. > I switched it to use a simple priority queue to merge the terms instead of > the more general {{MultiTermsEnum}}, which does extra work since it must also > provide postings, implement seekExact, etc. > I also pulled {{OrdinalMap}} out into its own oal.index class. > When testing construction time for my case the patch is ~16% faster (159.9s > -> 134.2s) in one case with 91.4 M terms and ~9% faster (115.6s -> 105.7s) in > another case with 26.6 M terms. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7905) Optimizations for OrdinalMap
[ https://issues.apache.org/jira/browse/LUCENE-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084281#comment-16084281 ] Michael McCandless commented on LUCENE-7905: bq. Its a bit tricky to see the diffs since the file got moved too, but basically it just replaces MultiTermsEnum with a standard PQ? That's it, except I also changed: {noformat} while (segmentOrds[segmentIndex] <= segmentOrd) { ordDeltaBits[segmentIndex] |= delta; ordDeltas[segmentIndex].add(delta); segmentOrds[segmentIndex]++; } {noformat} to: {noformat} assert segmentOrds[segmentIndex] <= segmentOrd; do { ordDeltas[segmentIndex].add(delta); segmentOrds[segmentIndex]++; } while (segmentOrds[segmentIndex] <= segmentOrd); {noformat} Which should make the branch easier to predict (since the loop will always run the first time), but maybe the effect is negligible. I think likely the cost we're saving from MTE is its {{TermMergeQueue.fillTop}} method? It's doing a lot of work, sort of recursing into the PQ, with a if inside a for inside a while, to find all subs on the current term, and then it has to do {{pushTop}} after that. In general MTE is not allowed to .next() the subs because it doesn't know if the caller will ask for postings on this term. [~rcmuir] suggested we could maybe make pullTop()/pushTop() lazy which is a neat idea... > Optimizations for OrdinalMap > > > Key: LUCENE-7905 > URL: https://issues.apache.org/jira/browse/LUCENE-7905 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 7.1 > > Attachments: LUCENE-7905.patch > > > {{OrdinalMap}} is a useful class to quickly map per-segment ordinals to > global space, but it's fairly costly to build, which must typically be done > on every NRT refresh. > I'm using it quite heavily in two different places, one for > {{SortedSetDocValuesFacetCounts}}, and another custom usage, and I found some > small optimizations to improve its construction time. > I switched it to use a simple priority queue to merge the terms instead of > the more general {{MultiTermsEnum}}, which does extra work since it must also > provide postings, implement seekExact, etc. > I also pulled {{OrdinalMap}} out into its own oal.index class. > When testing construction time for my case the patch is ~16% faster (159.9s > -> 134.2s) in one case with 91.4 M terms and ~9% faster (115.6s -> 105.7s) in > another case with 26.6 M terms. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7905) Optimizations for OrdinalMap
[ https://issues.apache.org/jira/browse/LUCENE-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084284#comment-16084284 ] Michael McCandless commented on LUCENE-7905: bq. Maybe we should also check how much better things would be with a specialized priority queue too? As far as I remember, it helped a lot with disjunction scorers. I like that idea! I'll look at what we did there and see if it can work here. bq. Maybe we should decouple OrdinalMap and MultiTermsEnum entirely and give OrdinalMap its own TermsEnum+index wrapper? +1, I'll do that. > Optimizations for OrdinalMap > > > Key: LUCENE-7905 > URL: https://issues.apache.org/jira/browse/LUCENE-7905 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 7.1 > > Attachments: LUCENE-7905.patch > > > {{OrdinalMap}} is a useful class to quickly map per-segment ordinals to > global space, but it's fairly costly to build, which must typically be done > on every NRT refresh. > I'm using it quite heavily in two different places, one for > {{SortedSetDocValuesFacetCounts}}, and another custom usage, and I found some > small optimizations to improve its construction time. > I switched it to use a simple priority queue to merge the terms instead of > the more general {{MultiTermsEnum}}, which does extra work since it must also > provide postings, implement seekExact, etc. > I also pulled {{OrdinalMap}} out into its own oal.index class. > When testing construction time for my case the patch is ~16% faster (159.9s > -> 134.2s) in one case with 91.4 M terms and ~9% faster (115.6s -> 105.7s) in > another case with 26.6 M terms. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7905) Optimizations for OrdinalMap
[ https://issues.apache.org/jira/browse/LUCENE-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084134#comment-16084134 ] Adrien Grand commented on LUCENE-7905: -- I like the idea of switching to a simple priority queue to remove overhead. Maybe we should also check how much better things would be with a specialized priority queue too? As far as I remember, it helped a lot with disjunction scorers. One minor thing that confuses me with the patch is that it moves off {{MultiTermsEnum}} but reuses {{MultiTermsEnum.TermsEnumIndex}} and adds an additional method to it. Maybe we should decouple {{OrdinalMap}} and {{MultiTermsEnum}} entirely and give {{OrdinalMap}} its own {{TermsEnum}}+index wrapper? > Optimizations for OrdinalMap > > > Key: LUCENE-7905 > URL: https://issues.apache.org/jira/browse/LUCENE-7905 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 7.1 > > Attachments: LUCENE-7905.patch > > > {{OrdinalMap}} is a useful class to quickly map per-segment ordinals to > global space, but it's fairly costly to build, which must typically be done > on every NRT refresh. > I'm using it quite heavily in two different places, one for > {{SortedSetDocValuesFacetCounts}}, and another custom usage, and I found some > small optimizations to improve its construction time. > I switched it to use a simple priority queue to merge the terms instead of > the more general {{MultiTermsEnum}}, which does extra work since it must also > provide postings, implement seekExact, etc. > I also pulled {{OrdinalMap}} out into its own oal.index class. > When testing construction time for my case the patch is ~16% faster (159.9s > -> 134.2s) in one case with 91.4 M terms and ~9% faster (115.6s -> 105.7s) in > another case with 26.6 M terms. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7905) Optimizations for OrdinalMap
[ https://issues.apache.org/jira/browse/LUCENE-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084128#comment-16084128 ] Robert Muir commented on LUCENE-7905: - Its a bit tricky to see the diffs since the file got moved too, but basically it just replaces MultiTermsEnum with a standard PQ? Do we know why this is a speedup? Is it an inefficiency in MultiTermsEnum? > Optimizations for OrdinalMap > > > Key: LUCENE-7905 > URL: https://issues.apache.org/jira/browse/LUCENE-7905 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 7.1 > > Attachments: LUCENE-7905.patch > > > {{OrdinalMap}} is a useful class to quickly map per-segment ordinals to > global space, but it's fairly costly to build, which must typically be done > on every NRT refresh. > I'm using it quite heavily in two different places, one for > {{SortedSetDocValuesFacetCounts}}, and another custom usage, and I found some > small optimizations to improve its construction time. > I switched it to use a simple priority queue to merge the terms instead of > the more general {{MultiTermsEnum}}, which does extra work since it must also > provide postings, implement seekExact, etc. > I also pulled {{OrdinalMap}} out into its own oal.index class. > When testing construction time for my case the patch is ~16% faster (159.9s > -> 134.2s) in one case with 91.4 M terms and ~9% faster (115.6s -> 105.7s) in > another case with 26.6 M terms. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7905) Optimizations for OrdinalMap
[ https://issues.apache.org/jira/browse/LUCENE-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084136#comment-16084136 ] Robert Muir commented on LUCENE-7905: - >From your explain that MTE must do more work because it "provides postings", >this doesn't seem right to me that it would slow down the actual merging of >terms. I can see the argument about seekExact because next() has some special >code to accomodate that: https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/MultiTermsEnum.java#L287-L298 > Optimizations for OrdinalMap > > > Key: LUCENE-7905 > URL: https://issues.apache.org/jira/browse/LUCENE-7905 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 7.1 > > Attachments: LUCENE-7905.patch > > > {{OrdinalMap}} is a useful class to quickly map per-segment ordinals to > global space, but it's fairly costly to build, which must typically be done > on every NRT refresh. > I'm using it quite heavily in two different places, one for > {{SortedSetDocValuesFacetCounts}}, and another custom usage, and I found some > small optimizations to improve its construction time. > I switched it to use a simple priority queue to merge the terms instead of > the more general {{MultiTermsEnum}}, which does extra work since it must also > provide postings, implement seekExact, etc. > I also pulled {{OrdinalMap}} out into its own oal.index class. > When testing construction time for my case the patch is ~16% faster (159.9s > -> 134.2s) in one case with 91.4 M terms and ~9% faster (115.6s -> 105.7s) in > another case with 26.6 M terms. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org