[jira] [Commented] (LUCENE-7905) Optimizations for OrdinalMap

2017-07-14 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16087429#comment-16087429
 ] 

ASF subversion and git services commented on LUCENE-7905:
-

Commit 3df97d3f0c0e558c52514a7e500afeffe96e795d in lucene-solr's branch 
refs/heads/master from Mike McCandless
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3df97d3 ]

LUCENE-7905: optimize how OrdinalMap builds its map


> Optimizations for OrdinalMap
> 
>
> Key: LUCENE-7905
> URL: https://issues.apache.org/jira/browse/LUCENE-7905
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 7.1
>
> Attachments: LUCENE-7905.patch, LUCENE-7905.patch, LUCENE-7905.patch, 
> LUCENE-7905-specialized.patch
>
>
> {{OrdinalMap}} is a useful class to quickly map per-segment ordinals to 
> global space, but it's fairly costly to build, which must typically be done 
> on every NRT refresh.
> I'm using it quite heavily in two different places, one for 
> {{SortedSetDocValuesFacetCounts}}, and another custom usage, and I found some 
> small optimizations to improve its construction time.
> I switched it to use a simple priority queue to merge the terms instead of 
> the more general {{MultiTermsEnum}}, which does extra work since it must also 
> provide postings, implement seekExact, etc.
> I also pulled {{OrdinalMap}} out into its own oal.index class.
> When testing construction time for my case the patch is ~16% faster (159.9s 
> -> 134.2s) in one case with 91.4 M terms and ~9% faster (115.6s -> 105.7s) in 
> another case with 26.6 M terms.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7905) Optimizations for OrdinalMap

2017-07-14 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16087430#comment-16087430
 ] 

ASF subversion and git services commented on LUCENE-7905:
-

Commit a0557cfef970780eff355a06f9fc39b9ecc6 in lucene-solr's branch 
refs/heads/branch_7x from Mike McCandless
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a0557cf ]

LUCENE-7905: optimize how OrdinalMap builds its map


> Optimizations for OrdinalMap
> 
>
> Key: LUCENE-7905
> URL: https://issues.apache.org/jira/browse/LUCENE-7905
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 7.1
>
> Attachments: LUCENE-7905.patch, LUCENE-7905.patch, LUCENE-7905.patch, 
> LUCENE-7905-specialized.patch
>
>
> {{OrdinalMap}} is a useful class to quickly map per-segment ordinals to 
> global space, but it's fairly costly to build, which must typically be done 
> on every NRT refresh.
> I'm using it quite heavily in two different places, one for 
> {{SortedSetDocValuesFacetCounts}}, and another custom usage, and I found some 
> small optimizations to improve its construction time.
> I switched it to use a simple priority queue to merge the terms instead of 
> the more general {{MultiTermsEnum}}, which does extra work since it must also 
> provide postings, implement seekExact, etc.
> I also pulled {{OrdinalMap}} out into its own oal.index class.
> When testing construction time for my case the patch is ~16% faster (159.9s 
> -> 134.2s) in one case with 91.4 M terms and ~9% faster (115.6s -> 105.7s) in 
> another case with 26.6 M terms.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7905) Optimizations for OrdinalMap

2017-07-13 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085535#comment-16085535
 ] 

Robert Muir commented on LUCENE-7905:
-

thanks Mike!

> Optimizations for OrdinalMap
> 
>
> Key: LUCENE-7905
> URL: https://issues.apache.org/jira/browse/LUCENE-7905
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 7.1
>
> Attachments: LUCENE-7905.patch, LUCENE-7905.patch, LUCENE-7905.patch, 
> LUCENE-7905-specialized.patch
>
>
> {{OrdinalMap}} is a useful class to quickly map per-segment ordinals to 
> global space, but it's fairly costly to build, which must typically be done 
> on every NRT refresh.
> I'm using it quite heavily in two different places, one for 
> {{SortedSetDocValuesFacetCounts}}, and another custom usage, and I found some 
> small optimizations to improve its construction time.
> I switched it to use a simple priority queue to merge the terms instead of 
> the more general {{MultiTermsEnum}}, which does extra work since it must also 
> provide postings, implement seekExact, etc.
> I also pulled {{OrdinalMap}} out into its own oal.index class.
> When testing construction time for my case the patch is ~16% faster (159.9s 
> -> 134.2s) in one case with 91.4 M terms and ~9% faster (115.6s -> 105.7s) in 
> another case with 26.6 M terms.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7905) Optimizations for OrdinalMap

2017-07-13 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085452#comment-16085452
 ] 

Dawid Weiss commented on LUCENE-7905:
-

It'd probably have to be stored per-leaf of a binary tree the pq actually is. 
And then you'd need to maintain those prefix counts when pushing elements 
up/down the tree... The associated bookkeeping may not be worth it.

This said, an algorithm for this has probably been invented. Back in the 70s. :D

> Optimizations for OrdinalMap
> 
>
> Key: LUCENE-7905
> URL: https://issues.apache.org/jira/browse/LUCENE-7905
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 7.1
>
> Attachments: LUCENE-7905.patch, LUCENE-7905.patch, 
> LUCENE-7905-specialized.patch
>
>
> {{OrdinalMap}} is a useful class to quickly map per-segment ordinals to 
> global space, but it's fairly costly to build, which must typically be done 
> on every NRT refresh.
> I'm using it quite heavily in two different places, one for 
> {{SortedSetDocValuesFacetCounts}}, and another custom usage, and I found some 
> small optimizations to improve its construction time.
> I switched it to use a simple priority queue to merge the terms instead of 
> the more general {{MultiTermsEnum}}, which does extra work since it must also 
> provide postings, implement seekExact, etc.
> I also pulled {{OrdinalMap}} out into its own oal.index class.
> When testing construction time for my case the patch is ~16% faster (159.9s 
> -> 134.2s) in one case with 91.4 M terms and ~9% faster (115.6s -> 105.7s) in 
> another case with 26.6 M terms.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7905) Optimizations for OrdinalMap

2017-07-13 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085437#comment-16085437
 ] 

Michael McCandless commented on LUCENE-7905:


bq. You could do it with radix-like sort, but you'd need all the terms from all 
the TermEnums; the pq has the advantage of scanning through them on the fly, 
progressively?

Exactly!  I want a combined algorithm, that uses the "streaming" that the PQ 
gives us, and uses the "don't redundantly compare common prefixes over and over 
again" that radix sort gives us, because if you think about the comparisons 
that PQ is doing to maintain its heap structure, many of them are wasted on 
common prefixes.  The hard part is efficiently computing "all entries in the PQ 
now share a common prefix of 6" as heap entries are updated over time, but 
surely there is a way.

Though I suppose it's gains would be limited in this usage (merge sorting terms 
from all segments), because the small/tiny segments would mess up the common 
prefixes, i.e. the common prefix would often be low or 0 because of them.  But 
if you merge sorted equal sized segments, e.g. what happens when merging 
segments, then it could be powerful.

> Optimizations for OrdinalMap
> 
>
> Key: LUCENE-7905
> URL: https://issues.apache.org/jira/browse/LUCENE-7905
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 7.1
>
> Attachments: LUCENE-7905.patch, LUCENE-7905.patch, 
> LUCENE-7905-specialized.patch
>
>
> {{OrdinalMap}} is a useful class to quickly map per-segment ordinals to 
> global space, but it's fairly costly to build, which must typically be done 
> on every NRT refresh.
> I'm using it quite heavily in two different places, one for 
> {{SortedSetDocValuesFacetCounts}}, and another custom usage, and I found some 
> small optimizations to improve its construction time.
> I switched it to use a simple priority queue to merge the terms instead of 
> the more general {{MultiTermsEnum}}, which does extra work since it must also 
> provide postings, implement seekExact, etc.
> I also pulled {{OrdinalMap}} out into its own oal.index class.
> When testing construction time for my case the patch is ~16% faster (159.9s 
> -> 134.2s) in one case with 91.4 M terms and ~9% faster (115.6s -> 105.7s) in 
> another case with 26.6 M terms.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7905) Optimizations for OrdinalMap

2017-07-13 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085423#comment-16085423
 ] 

Dawid Weiss commented on LUCENE-7905:
-

bq. I do feel like how we compare terms in the PQ is inefficient, and we should 
be able to do something like what radix sort does, because at any given time, 
the terms in the queue likely share long common prefixes yet we keep 
inefficiently re-comparing those long common prefixes.

You could do it with radix-like sort, but you'd need all the terms from all the 
TermEnums; the pq has the advantage of scanning through them on the fly, 
progressively?

> Optimizations for OrdinalMap
> 
>
> Key: LUCENE-7905
> URL: https://issues.apache.org/jira/browse/LUCENE-7905
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 7.1
>
> Attachments: LUCENE-7905.patch, LUCENE-7905.patch, 
> LUCENE-7905-specialized.patch
>
>
> {{OrdinalMap}} is a useful class to quickly map per-segment ordinals to 
> global space, but it's fairly costly to build, which must typically be done 
> on every NRT refresh.
> I'm using it quite heavily in two different places, one for 
> {{SortedSetDocValuesFacetCounts}}, and another custom usage, and I found some 
> small optimizations to improve its construction time.
> I switched it to use a simple priority queue to merge the terms instead of 
> the more general {{MultiTermsEnum}}, which does extra work since it must also 
> provide postings, implement seekExact, etc.
> I also pulled {{OrdinalMap}} out into its own oal.index class.
> When testing construction time for my case the patch is ~16% faster (159.9s 
> -> 134.2s) in one case with 91.4 M terms and ~9% faster (115.6s -> 105.7s) in 
> another case with 26.6 M terms.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7905) Optimizations for OrdinalMap

2017-07-13 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085413#comment-16085413
 ] 

Michael McCandless commented on LUCENE-7905:


Good point @rcmuir; I'll improve the javadocs.

> Optimizations for OrdinalMap
> 
>
> Key: LUCENE-7905
> URL: https://issues.apache.org/jira/browse/LUCENE-7905
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 7.1
>
> Attachments: LUCENE-7905.patch, LUCENE-7905.patch, 
> LUCENE-7905-specialized.patch
>
>
> {{OrdinalMap}} is a useful class to quickly map per-segment ordinals to 
> global space, but it's fairly costly to build, which must typically be done 
> on every NRT refresh.
> I'm using it quite heavily in two different places, one for 
> {{SortedSetDocValuesFacetCounts}}, and another custom usage, and I found some 
> small optimizations to improve its construction time.
> I switched it to use a simple priority queue to merge the terms instead of 
> the more general {{MultiTermsEnum}}, which does extra work since it must also 
> provide postings, implement seekExact, etc.
> I also pulled {{OrdinalMap}} out into its own oal.index class.
> When testing construction time for my case the patch is ~16% faster (159.9s 
> -> 134.2s) in one case with 91.4 M terms and ~9% faster (115.6s -> 105.7s) in 
> another case with 26.6 M terms.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7905) Optimizations for OrdinalMap

2017-07-12 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085139#comment-16085139
 ] 

Robert Muir commented on LUCENE-7905:
-

I think its good to pull out OrdinalMap into its own class from MultiDocValues, 
but i think its a little trappy that it then has no warnings on it and just 
some cool sounding javadocs.

MultiDocValues still warns you twice: 
https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/MultiDocValues.java#L40-L46

But I think we should have some kind of doc about the cost of this thing in 
OrdinalMap now that its separated? The "tax" of multiple segments is real here, 
makes it a hotspot just like it is for blocktree term dictionaries at merge.

> Optimizations for OrdinalMap
> 
>
> Key: LUCENE-7905
> URL: https://issues.apache.org/jira/browse/LUCENE-7905
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 7.1
>
> Attachments: LUCENE-7905.patch, LUCENE-7905.patch, 
> LUCENE-7905-specialized.patch
>
>
> {{OrdinalMap}} is a useful class to quickly map per-segment ordinals to 
> global space, but it's fairly costly to build, which must typically be done 
> on every NRT refresh.
> I'm using it quite heavily in two different places, one for 
> {{SortedSetDocValuesFacetCounts}}, and another custom usage, and I found some 
> small optimizations to improve its construction time.
> I switched it to use a simple priority queue to merge the terms instead of 
> the more general {{MultiTermsEnum}}, which does extra work since it must also 
> provide postings, implement seekExact, etc.
> I also pulled {{OrdinalMap}} out into its own oal.index class.
> When testing construction time for my case the patch is ~16% faster (159.9s 
> -> 134.2s) in one case with 91.4 M terms and ~9% faster (115.6s -> 105.7s) in 
> another case with 26.6 M terms.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7905) Optimizations for OrdinalMap

2017-07-12 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084349#comment-16084349
 ] 

Robert Muir commented on LUCENE-7905:
-

{quote}
That's it, except I also changed:
  while (segmentOrds[segmentIndex] <= segmentOrd) {
ordDeltaBits[segmentIndex] |= delta;
ordDeltas[segmentIndex].add(delta);
segmentOrds[segmentIndex]++;
  }
{quote}

OK, I have to look in more detail later. We should be a little careful because 
this class is also used by merging, and merging has a strange case that you 
won't encounter during manual construction: the case where there are "holes" in 
the ords (deleted ords when all documents containing that ord are merged away). 
Might be best if can test some of this directly...

> Optimizations for OrdinalMap
> 
>
> Key: LUCENE-7905
> URL: https://issues.apache.org/jira/browse/LUCENE-7905
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 7.1
>
> Attachments: LUCENE-7905.patch
>
>
> {{OrdinalMap}} is a useful class to quickly map per-segment ordinals to 
> global space, but it's fairly costly to build, which must typically be done 
> on every NRT refresh.
> I'm using it quite heavily in two different places, one for 
> {{SortedSetDocValuesFacetCounts}}, and another custom usage, and I found some 
> small optimizations to improve its construction time.
> I switched it to use a simple priority queue to merge the terms instead of 
> the more general {{MultiTermsEnum}}, which does extra work since it must also 
> provide postings, implement seekExact, etc.
> I also pulled {{OrdinalMap}} out into its own oal.index class.
> When testing construction time for my case the patch is ~16% faster (159.9s 
> -> 134.2s) in one case with 91.4 M terms and ~9% faster (115.6s -> 105.7s) in 
> another case with 26.6 M terms.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7905) Optimizations for OrdinalMap

2017-07-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084281#comment-16084281
 ] 

Michael McCandless commented on LUCENE-7905:


bq. Its a bit tricky to see the diffs since the file got moved too, but 
basically it just replaces MultiTermsEnum with a standard PQ?

That's it, except I also changed:

{noformat}
  while (segmentOrds[segmentIndex] <= segmentOrd) {
ordDeltaBits[segmentIndex] |= delta;
ordDeltas[segmentIndex].add(delta);
segmentOrds[segmentIndex]++;
  }
{noformat}

to:

{noformat}
assert segmentOrds[segmentIndex] <= segmentOrd;
do {
  ordDeltas[segmentIndex].add(delta);
  segmentOrds[segmentIndex]++;
} while (segmentOrds[segmentIndex] <= segmentOrd);
{noformat}

Which should make the branch easier to predict (since the loop will always run 
the first time), but maybe the effect is negligible.

I think likely the cost we're saving from MTE is its {{TermMergeQueue.fillTop}} 
method?  It's doing a lot of work, sort of recursing into the PQ, with a if 
inside a for inside a while, to find all subs on the current term, and then it 
has to do {{pushTop}} after that.  In general MTE is not allowed to .next() the 
subs because it doesn't know if the caller will ask for postings on this term.  
[~rcmuir] suggested we could maybe make pullTop()/pushTop() lazy which is a 
neat idea...

> Optimizations for OrdinalMap
> 
>
> Key: LUCENE-7905
> URL: https://issues.apache.org/jira/browse/LUCENE-7905
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 7.1
>
> Attachments: LUCENE-7905.patch
>
>
> {{OrdinalMap}} is a useful class to quickly map per-segment ordinals to 
> global space, but it's fairly costly to build, which must typically be done 
> on every NRT refresh.
> I'm using it quite heavily in two different places, one for 
> {{SortedSetDocValuesFacetCounts}}, and another custom usage, and I found some 
> small optimizations to improve its construction time.
> I switched it to use a simple priority queue to merge the terms instead of 
> the more general {{MultiTermsEnum}}, which does extra work since it must also 
> provide postings, implement seekExact, etc.
> I also pulled {{OrdinalMap}} out into its own oal.index class.
> When testing construction time for my case the patch is ~16% faster (159.9s 
> -> 134.2s) in one case with 91.4 M terms and ~9% faster (115.6s -> 105.7s) in 
> another case with 26.6 M terms.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7905) Optimizations for OrdinalMap

2017-07-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084284#comment-16084284
 ] 

Michael McCandless commented on LUCENE-7905:


bq. Maybe we should also check how much better things would be with a 
specialized priority queue too? As far as I remember, it helped a lot with 
disjunction scorers.

I like that idea!  I'll look at what we did there and see if it can work here.

bq. Maybe we should decouple OrdinalMap and MultiTermsEnum entirely and give 
OrdinalMap its own TermsEnum+index wrapper?

+1, I'll do that.

> Optimizations for OrdinalMap
> 
>
> Key: LUCENE-7905
> URL: https://issues.apache.org/jira/browse/LUCENE-7905
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 7.1
>
> Attachments: LUCENE-7905.patch
>
>
> {{OrdinalMap}} is a useful class to quickly map per-segment ordinals to 
> global space, but it's fairly costly to build, which must typically be done 
> on every NRT refresh.
> I'm using it quite heavily in two different places, one for 
> {{SortedSetDocValuesFacetCounts}}, and another custom usage, and I found some 
> small optimizations to improve its construction time.
> I switched it to use a simple priority queue to merge the terms instead of 
> the more general {{MultiTermsEnum}}, which does extra work since it must also 
> provide postings, implement seekExact, etc.
> I also pulled {{OrdinalMap}} out into its own oal.index class.
> When testing construction time for my case the patch is ~16% faster (159.9s 
> -> 134.2s) in one case with 91.4 M terms and ~9% faster (115.6s -> 105.7s) in 
> another case with 26.6 M terms.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7905) Optimizations for OrdinalMap

2017-07-12 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084134#comment-16084134
 ] 

Adrien Grand commented on LUCENE-7905:
--

I like the idea of switching to a simple priority queue to remove overhead. 
Maybe we should also check how much better things would be with a specialized 
priority queue too? As far as I remember, it helped a lot with disjunction 
scorers.

One minor thing that confuses me with the patch is that it moves off 
{{MultiTermsEnum}} but reuses {{MultiTermsEnum.TermsEnumIndex}} and adds an 
additional method to it. Maybe we should decouple {{OrdinalMap}} and  
{{MultiTermsEnum}} entirely and give {{OrdinalMap}} its own {{TermsEnum}}+index 
wrapper?

> Optimizations for OrdinalMap
> 
>
> Key: LUCENE-7905
> URL: https://issues.apache.org/jira/browse/LUCENE-7905
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 7.1
>
> Attachments: LUCENE-7905.patch
>
>
> {{OrdinalMap}} is a useful class to quickly map per-segment ordinals to 
> global space, but it's fairly costly to build, which must typically be done 
> on every NRT refresh.
> I'm using it quite heavily in two different places, one for 
> {{SortedSetDocValuesFacetCounts}}, and another custom usage, and I found some 
> small optimizations to improve its construction time.
> I switched it to use a simple priority queue to merge the terms instead of 
> the more general {{MultiTermsEnum}}, which does extra work since it must also 
> provide postings, implement seekExact, etc.
> I also pulled {{OrdinalMap}} out into its own oal.index class.
> When testing construction time for my case the patch is ~16% faster (159.9s 
> -> 134.2s) in one case with 91.4 M terms and ~9% faster (115.6s -> 105.7s) in 
> another case with 26.6 M terms.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7905) Optimizations for OrdinalMap

2017-07-12 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084128#comment-16084128
 ] 

Robert Muir commented on LUCENE-7905:
-

Its a bit tricky to see the diffs since the file got moved too, but basically 
it just replaces MultiTermsEnum with a standard PQ?

Do we know why this is a speedup? Is it an inefficiency in MultiTermsEnum?

> Optimizations for OrdinalMap
> 
>
> Key: LUCENE-7905
> URL: https://issues.apache.org/jira/browse/LUCENE-7905
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 7.1
>
> Attachments: LUCENE-7905.patch
>
>
> {{OrdinalMap}} is a useful class to quickly map per-segment ordinals to 
> global space, but it's fairly costly to build, which must typically be done 
> on every NRT refresh.
> I'm using it quite heavily in two different places, one for 
> {{SortedSetDocValuesFacetCounts}}, and another custom usage, and I found some 
> small optimizations to improve its construction time.
> I switched it to use a simple priority queue to merge the terms instead of 
> the more general {{MultiTermsEnum}}, which does extra work since it must also 
> provide postings, implement seekExact, etc.
> I also pulled {{OrdinalMap}} out into its own oal.index class.
> When testing construction time for my case the patch is ~16% faster (159.9s 
> -> 134.2s) in one case with 91.4 M terms and ~9% faster (115.6s -> 105.7s) in 
> another case with 26.6 M terms.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7905) Optimizations for OrdinalMap

2017-07-12 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084136#comment-16084136
 ] 

Robert Muir commented on LUCENE-7905:
-

>From your explain that MTE must do more work because it "provides postings", 
>this doesn't seem right to me that it would slow down the actual merging of 
>terms. I can see the argument about seekExact because next() has some special 
>code to accomodate that:

https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/MultiTermsEnum.java#L287-L298


> Optimizations for OrdinalMap
> 
>
> Key: LUCENE-7905
> URL: https://issues.apache.org/jira/browse/LUCENE-7905
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 7.1
>
> Attachments: LUCENE-7905.patch
>
>
> {{OrdinalMap}} is a useful class to quickly map per-segment ordinals to 
> global space, but it's fairly costly to build, which must typically be done 
> on every NRT refresh.
> I'm using it quite heavily in two different places, one for 
> {{SortedSetDocValuesFacetCounts}}, and another custom usage, and I found some 
> small optimizations to improve its construction time.
> I switched it to use a simple priority queue to merge the terms instead of 
> the more general {{MultiTermsEnum}}, which does extra work since it must also 
> provide postings, implement seekExact, etc.
> I also pulled {{OrdinalMap}} out into its own oal.index class.
> When testing construction time for my case the patch is ~16% faster (159.9s 
> -> 134.2s) in one case with 91.4 M terms and ~9% faster (115.6s -> 105.7s) in 
> another case with 26.6 M terms.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org