Re: ConcurrentMergeScheduler and MergePolicy question

Jason Rutherglen Thu, 30 Jul 2009 11:08:20 -0700

If the app is creating many small segments LUCENE-1313 will help
by keeping them in ram until they are too large. Smaller
segments will be merged into a large segment ram -> disk. Then
disk -> disk is faster as we're only merging larger segments. IW
will not pause while writing the hopefully fairly large buffer
out to disk, whereas today it will, which is a delay similar to
merging large segments.


In lieu of the last options mentioned in Mike's email, merging
of large segments can be performed offline to avoid tying up the
IO subsystem and CPU, meaning on a dedicated merge server such
as a Solr master or a Hadoop/Katta cluster, rather than on the
NRT search server.

On Thu, Jul 30, 2009 at 9:16 AM, Michael
McCandless<luc...@mikemccandless.com> wrote:
> The merge selection (LogMergePolicy) tries to merge "roughly" equal
> sized (measured in bytes) segments together, so it creates a "roughly"
> log-staircase pattern.
>
> I agree, in an NRT app, larger mergeFactor is likely best since it
> minimizes reopen time overall.  It's also important to
> setMergedSegmentWarmer so a newly merged segment is merged (in the
> background, with CMS) before being returned in a reopened NRT reader.
> And making a custom MergeScheduler that defers big merges until "after
> hours" should work well too...
>
> On the impact of search performance for large vs small mergeFactors, I
> think the jury is still out.  People should keep testing that (and
> report back!).  Certainly, for the fastest reopen time you never want
> any merging to be done :)
>
> I think there are a number of good merge improvements in flight right
> now:
>
>  * LUCENE-1750: limiting the max size of the merged segment
>
>  * LUCENE-1076: allow merge policy to select non-contiguous segments
>
>  * LUCENE-1737: always bulk-copy when merging -- the bulk copy
>    optimization makes merging the doc stores much faster now, but
>    it's a brittle optimization since it's sensitive to exactly which
>    fields, and in what order, you add to your docs
>
> Other things we've talked about but no issues yet:
>
>  * Down prioritize all IO associated w/ merging.  Java/OS doesn't
>    give us good support for this so I think we'd have to somehow
>    emulate in Lucene, at the Directory level.
>
>  * Don't let the IO from merging wipe the OS's IO cache.  For this we
>    need to access madvise/posix_fadvise, which we don't have from
>    javaland, so I think we'd need an OS dependent, optional JNI
>    extension to do this.
>
> Mike
>
> On Thu, Jul 30, 2009 at 10:56 AM, Shai Erera<ser...@gmail.com> wrote:
>> I think that when LUCENE-1750 is finished, you will be able to:
>>
>> 1) Create a MergePolicy that limits the segments size it's about to merge to
>> a certain size.
>> 2) Then have a daemon or something that runs on "idle" times and call
>> optimize(maxNumSegments), or even open a new writer w/ the default merge
>> policy and allow it to merge?
>>
>> Shai
>>
>> On Thu, Jul 30, 2009 at 5:48 PM, Grant Ingersoll <gsing...@apache.org>
>> wrote:
>>>
>>> Note also response from Mike that talks a little bit about something along
>>> these lines:
>>> http://www.lucidimagination.com/search/document/fa990adba4d2572b/is_there_a_way_to_control_when_merges_happen#f6f0bfeef4bf9a39
>>>
>>> -Grant
>>>
>>> On Jul 30, 2009, at 10:35 AM, Grant Ingersoll wrote:
>>>
>>>> Given a large segment and a bunch of small segments, how does the
>>>> ConcurrentMergeScheduler (CMS) work?  Does it always merge the smaller
>>>> segments into the bigger one, or does it merge the smaller segments
>>>> together?
>>>>
>>>> Something I've been thinking about:  Given a high update environment (and
>>>> near real time, less than 1 minute, search constraints) and/or a very 
>>>> bursty
>>>> environment, we've always said to keep the merge factor small for search
>>>> reasons, at least in the high-update case.  However, I've seen a couple of
>>>> times where this causes problems because merges can take over and cause
>>>> pauses, even with CMS, so I am wonder if it makes sense to have a larger
>>>> merge factor (>10), knowing that I may have a few large segments and then a
>>>> bunch of small ones and that the CMS will, in the background, be able to
>>>> keep merging the smaller segments together and in most cases avoid ever
>>>> having to merge into the large segments (b/c maybe I can just optimize down
>>>> at slower times or even merge larger segments later. )   Seems like this
>>>> would allow one to make sure larger merges need not take place, or at least
>>>> reduce the chances of that happening.
>>>>
>>>> Not sure if I worded that correctly.
>>>>
>>>> Thanks,
>>>> Grant
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>>>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: ConcurrentMergeScheduler and MergePolicy question

Reply via email to