[ 
https://issues.apache.org/jira/browse/LUCENE-7101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15193402#comment-15193402
 ] 

Michael McCandless commented on LUCENE-7101:
--------------------------------------------

To get log(N) behavior you need to merge N segments that are "close" to the 
same size on each merge...

E.g., say you are sorting "plenty" of data, with only 16 MB heap buffer to work 
with.

Today, we will write 128 (default merge factor) 16 MB segments, and then merge 
those down to a single 2 GB (= 128 * 16 MB) segment.  Next we write another 127 
16 MB segments, and merge the 2 GB segment, plus 127 16 MB segments, into a 
3.98 GB segment.  Then another 127 16 MB segments, and merge that into a 5.97 
GB segment, etc.

So the net number of MB copied by merging will be something like:

  (128 + 0*127)*16 + (128 + 1*127)*16 + (128 + 2*127)*16 + (128 + 3*127)*16 + 
...

The equation has a linear number of terms vs your input size (input data 
divided by 1.98 GB, or so?).

But this equation asymptotes towards O(N^2):

It rewrites to:

  16 * 128 + 16 * (0*127 + 1*127 + 2*127 + 3*127 + ...)

and rewrites to:

  16 * 128 + 16 * 127 * (0 + 1 + 2 + 3 + ...)

where the number of terms in that last part is linear vs your input data size.

So this bug only happens if you sort enough data, vs. your allowed heap usage, 
to require many merges, so the too-large 128 merge factor "hides" this problem 
for most use cases.


> OfflineSorter's merging is O(N^2) cost for large sorts
> ------------------------------------------------------
>
>                 Key: LUCENE-7101
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7101
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Blocker
>             Fix For: master, 6.0
>
>
> Our {{OfflineSorter}} acts just like Lucene, writing small initial
> segments of sorted values (from what it was able to sort at once in
> heap), periodically merging them when there are too many, and doing a
> {{forceMerge(1)}} in the end.
> But the merge logic is too simplistic today, resulting in O(N^2)
> cost.  Smallish sorts usually won't hit it, because the default 128
> merge factor is so high, but e.g. the new 2B points tests do hit the
> N^2 behavior.  I suspect the high merge factor hurts performance (OS
> struggles to use what free RAM it has to read-ahead on 128 files), and
> also risks file descriptor exhaustion.
> I think we should implement a simple log merge policy for it, and drop
> its default merge factor to 10.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to