Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

John Wang Tue, 22 Sep 2009 16:16:15 -0700

Jason:

I am not sure what "parameters" are you referring to either. Are you
responding to the right email?


Anyhoot, I used everything for the default for both MergePolicies.

LogMergePolicy.setCalibrateSizeByDeletes was a contribution by us from ZMP
for normalize segment size using deleted doc counts. So it was part of ZMP.

The idea with ZMP is to have a set of balanced-sized segments instead of 1
large segment. (as I have been repeatedly describing on this email thread)

To get this balance, we represent every point before the merge as a state
modeled in a Viterbi alg with a cost function for each type of merge, this
is used to select the desired segment to merge.

I hate to hijack a Lucene thread to discuss Zoie, feel free to post
questions on the Zoie group for details.

-John

On Wed, Sep 23, 2009 at 1:56 AM, Jason Rutherglen <
jason.rutherg...@gmail.com> wrote:

> John,
>
> I have a few questions in order to better understand, as the
> wiki does not reflect the entirety of what you're trying to
> describe.
>
> > But it is required to set up several parameters carefully to
> get desired behavior.
>
> Which parameters are you referring to?
>
> What were the ZMP parameters used for the test?
>
> What was the number CMS of threads?
>
> It would be helpful to see a time based table of the data used
> to generate the chart at the bottom with the segment infos at
> regular intervals.
>
> What is the difference between how ZMP and
> LogMergePolicy.setCalibrateSizeByDeletes handles deletes?
>
> Are the queries using Zoie or Lucene's index searcher?
>
> Can you explain why the Viterbi algorithm was used and how it
> works in this context?
>
> -J
>
> On Mon, Sep 21, 2009 at 6:35 PM, John Wang <john.w...@gmail.com> wrote:
> > Jason:
> >
> >     You are missing the point.
> >
> >     The idea is to avoid merging of large segments. The point of this
> > MergePolicy is to balance segment merges across the index. The aim is not
> to
> > have 1 large segment, it is to have n segments with balanced sizes.
> >
> >     When the large segment is out of the IO cache, replacing it is very
> > costly. What we have done is to split the cost over time by having more
> > frequent but faster merges.
> >
> >     I am not suggesting Lucene's default mergePolicy isn't good, it is
> just
> > not suitable for our case where there are high updates introducing tons
> of
> > deletes. The fact that the api is nice enough to allow MergePolicies to
> be
> > plgged it is a good thing.
> >
> >     Please DO read the wiki.
> >
> > -John
> >
> > On Tue, Sep 22, 2009 at 8:58 AM, Jason Rutherglen
> > <jason.rutherg...@gmail.com> wrote:
> >>
> >> I'm not sure I communicated the idea properly. If CMS is set to
> >> 1 thread, no matter how intensive the CPU for a merge, it's
> >> limited to 1 core of what is in many cases a 4 or 8 core server.
> >> That leaves the other 3 or 7 cores for queries, which if slow,
> >> indicates that it isn't the merging that's slowing down queries,
> >> but the dumping of the queried segments from the system IO cache.
> >>
> >> This holds true regardless of the merge policy used. So while a
> >> new merge policy sounds great, unless the system IO cache
> >> problem is solved, there will always be a lingering problem in
> >> regards to large merges with a regularly updated index. Avoiding
> >> large merges probably isn't the answer. And
> >> LogByteSizeMergePolicy somewhat allows managing the size of the
> >> segments merged already. I would personally prefer being able to
> >> merge segments up to a given estimated size, which requires
> >> LUCENE-1076 to do well.
> >>
> >> > is rather different from Lucene benchmark as we are testing
> >> high updates in a realtime environment
> >>
> >> Lucene's benchmark allows this. NearRealtimeReaderTask is a good
> >> place to start.
> >>
> >> On Mon, Sep 21, 2009 at 4:50 PM, John Wang <john.w...@gmail.com> wrote:
> >> > Jason:
> >> >
> >> >    Before jumping into any conclusions, let me describe the test
> setup.
> >> > It
> >> > is rather different from Lucene benchmark as we are testing high
> updates
> >> > in
> >> > a realtime environment:
> >> >
> >> >    We took a public corpus: medline, indexed to approximately 3
> million
> >> > docs. And update all the docs over and over again for a 10 hour
> >> > duration.
> >> >
> >> >    Only differences in code used where the different MergePolicy
> >> > settings
> >> > were applied.
> >> >
> >> >    Taking the variable of HW/OS out of the equation, let's igonored
> the
> >> > absolute numbers and compare the relative numbers between the two
> runs.
> >> >
> >> >    The spike is due to merging of a large segment when we accumulate.
> >> > The
> >> > graph/perf numbers fit our hypothesis that the default MergePolicy
> >> > chooses
> >> > to merge small segments before large ones and does not handle segmens
> >> > with
> >> > high number of deletes well.
> >> >
> >> >     Merging is BOTH IO and CPU intensive. Especially large ones.
> >> >
> >> >     I think the wiki explains it pretty well.
> >> >
> >> >     What are you saying is true with IO cache w.r.t. merge. Everytime
> >> > new
> >> > files are created, old files in IO cache is invalided. As the
> experiment
> >> > shows, this is detrimental to query performance when large segmens are
> >> > being
> >> > merged.
> >> >
> >> >     "As we move to a sharded model of indexes, large merges will
> >> > naturally not occur." Our test is on a 3 million document index, not
> >> > very
> >> > large for a single shard. Some katta people have run it on a much much
> >> > larger index per shard. Saying large merges will not occur on indexes
> of
> >> > this size IMHO is unfounded.
> >> >
> >> > -John
> >> >
> >> > On Tue, Sep 22, 2009 at 2:34 AM, Jason Rutherglen
> >> > <jason.rutherg...@gmail.com> wrote:
> >> >>
> >> >> John,
> >> >>
> >> >> It would be great if Lucene's benchmark were used so everyone
> >> >> could execute the test in their own environment and verify. It's
> >> >> not clear the settings or code used to generate the results so
> >> >> it's difficult to draw any reliable conclusions.
> >> >>
> >> >> The steep spike shows greater evidence for the IO cache being
> >> >> cleared during large merges resulting in search performance
> >> >> degradation. See:
> >> >> http://www.lucidimagination.com/search/?q=madvise
> >> >>
> >> >> Merging is IO intensive, less CPU intensive, if the
> >> >> ConcurrentMergeScheduler is used, which defaults to 3 threads,
> >> >> then the CPU could be maxed out. Using a single thread on
> >> >> synchronous spinning magnetic media seems more logical. Queries
> >> >> are usually the inverse, CPU intensive, not IO intensive when
> >> >> the index is in the IO cache. After merging a large segment (or
> >> >> during), queries would start hitting disk, and the results
> >> >> clearly show that. The queries are suddenly more time consuming
> >> >> as they seek on disk at a time when IO activity is at it's peak
> >> >> from merging large segments. Using madvise would prevent usable
> >> >> indexes from being swapped to disk during a merge, query
> >> >> performance would continue unabated.
> >> >>
> >> >> As we move to a sharded model of indexes, large merges will
> >> >> naturally not occur. Shards will reach a specified size and new
> >> >> documents will be sent to new shards.
> >> >>
> >> >> -J
> >> >>
> >> >> On Sun, Sep 20, 2009 at 11:12 PM, John Wang <john.w...@gmail.com>
> >> >> wrote:
> >> >> > The current default Lucene MergePolicy does not handle frequent
> >> >> > updates
> >> >> > well.
> >> >> >
> >> >> > We have done some performance analysis with that and a custom merge
> >> >> > policy:
> >> >> >
> >> >> > http://code.google.com/p/zoie/wiki/ZoieMergePolicy
> >> >> >
> >> >> > -John
> >> >> >
> >> >> > On Mon, Sep 21, 2009 at 1:08 PM, Jason Rutherglen <
> >> >> > jason.rutherg...@gmail.com> wrote:
> >> >> >
> >> >> >> I opened SOLR-1447 for this
> >> >> >>
> >> >> >> 2009/9/18 Noble Paul നോബിള്‍  नोब्ळ् <noble.p...@corp.aol.com>:
> >> >> >> > We can use a simple reflection based implementation to simplify
> >> >> >> > reading too many parameters.
> >> >> >> >
> >> >> >> > What I wish to emphasize is that Solr should be agnostic of xml
> >> >> >> > altogether. It should only be aware of specific Objects and
> >> >> >> > interfaces. If users wish to plugin something else in some other
> >> >> >> > way
> >> >> >> > ,
> >> >> >> > it should be fine
> >> >> >> >
> >> >> >> >
> >> >> >> >  There is a huge learning involved in learning the current
> >> >> >> > solrconfig.xml . Let us not make people throw away that .
> >> >> >> >
> >> >> >> > On Sat, Sep 19, 2009 at 1:59 AM, Jason Rutherglen
> >> >> >> > <jason.rutherg...@gmail.com> wrote:
> >> >> >> >> Over the weekend I may write a patch to allow simple reflection
> >> >> >> >> based
> >> >> >> >> injection from within solrconfig.
> >> >> >> >>
> >> >> >> >> On Fri, Sep 18, 2009 at 8:10 AM, Yonik Seeley
> >> >> >> >> <yo...@lucidimagination.com> wrote:
> >> >> >> >>> On Thu, Sep 17, 2009 at 4:30 PM, Shalin Shekhar Mangar
> >> >> >> >>> <shalinman...@gmail.com> wrote:
> >> >> >> >>>>> I was wondering if there is a way I can modify
> >> >> >> >>>>> calibrateSizeByDeletes
> >> >> >> just
> >> >> >> >>>>> by configuration ?
> >> >> >> >>>>>
> >> >> >> >>>>
> >> >> >> >>>> Alas, no. The only option that I see for you is to sub-class
> >> >> >> >>>> LogByteSizeMergePolicy and set calibrateSizeByDeletes to true
> >> >> >> >>>> in
> >> >> >> >>>> the
> >> >> >> >>>> constructor. However, please open a Jira issue and so we
> don't
> >> >> >> >>>> forget
> >> >> >> about
> >> >> >> >>>> it.
> >> >> >> >>>
> >> >> >> >>> It's the continuing stuff like this that makes me feel like we
> >> >> >> >>> should
> >> >> >> >>> be Spring (or equivalent) based someday... I'm just not sure
> how
> >> >> >> >>> we're
> >> >> >> >>> going to get there.
> >> >> >> >>>
> >> >> >> >>> -Yonik
> >> >> >> >>> http://www.lucidimagination.com
> >> >> >> >>>
> >> >> >> >>
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> > --
> >> >> >> > -----------------------------------------------------
> >> >> >> > Noble Paul | Principal Engineer| AOL | http://aol.com
> >> >> >> >
> >> >> >>
> >> >> >
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> >> >> For additional commands, e-mail: java-dev-h...@lucene.apache.org
> >> >>
> >> >
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-dev-h...@lucene.apache.org
> >>
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?

Reply via email to