Jason:

   Before jumping into any conclusions, let me describe the test setup. It
is rather different from Lucene benchmark as we are testing high updates in
a realtime environment:

   We took a public corpus: medline, indexed to approximately 3 million
docs. And update all the docs over and over again for a 10 hour duration.

   Only differences in code used where the different MergePolicy settings
were applied.

   Taking the variable of HW/OS out of the equation, let's igonored the
absolute numbers and compare the relative numbers between the two runs.

   The spike is due to merging of a large segment when we accumulate. The
graph/perf numbers fit our hypothesis that the default MergePolicy chooses
to merge small segments before large ones and does not handle segmens with
high number of deletes well.

    Merging is BOTH IO and CPU intensive. Especially large ones.

    I think the wiki explains it pretty well.

    What are you saying is true with IO cache w.r.t. merge. Everytime new
files are created, old files in IO cache is invalided. As the experiment
shows, this is detrimental to query performance when large segmens are being
merged.

    "As we move to a sharded model of indexes, large merges will
naturally not occur." Our test is on a 3 million document index, not very
large for a single shard. Some katta people have run it on a much much
larger index per shard. Saying large merges will not occur on indexes of
this size IMHO is unfounded.

-John

On Tue, Sep 22, 2009 at 2:34 AM, Jason Rutherglen <
jason.rutherg...@gmail.com> wrote:

> John,
>
> It would be great if Lucene's benchmark were used so everyone
> could execute the test in their own environment and verify. It's
> not clear the settings or code used to generate the results so
> it's difficult to draw any reliable conclusions.
>
> The steep spike shows greater evidence for the IO cache being
> cleared during large merges resulting in search performance
> degradation. See:
> http://www.lucidimagination.com/search/?q=madvise
>
> Merging is IO intensive, less CPU intensive, if the
> ConcurrentMergeScheduler is used, which defaults to 3 threads,
> then the CPU could be maxed out. Using a single thread on
> synchronous spinning magnetic media seems more logical. Queries
> are usually the inverse, CPU intensive, not IO intensive when
> the index is in the IO cache. After merging a large segment (or
> during), queries would start hitting disk, and the results
> clearly show that. The queries are suddenly more time consuming
> as they seek on disk at a time when IO activity is at it's peak
> from merging large segments. Using madvise would prevent usable
> indexes from being swapped to disk during a merge, query
> performance would continue unabated.
>
> As we move to a sharded model of indexes, large merges will
> naturally not occur. Shards will reach a specified size and new
> documents will be sent to new shards.
>
> -J
>
> On Sun, Sep 20, 2009 at 11:12 PM, John Wang <john.w...@gmail.com> wrote:
> > The current default Lucene MergePolicy does not handle frequent updates
> > well.
> >
> > We have done some performance analysis with that and a custom merge
> policy:
> >
> > http://code.google.com/p/zoie/wiki/ZoieMergePolicy
> >
> > -John
> >
> > On Mon, Sep 21, 2009 at 1:08 PM, Jason Rutherglen <
> > jason.rutherg...@gmail.com> wrote:
> >
> >> I opened SOLR-1447 for this
> >>
> >> 2009/9/18 Noble Paul നോബിള്‍  नोब्ळ् <noble.p...@corp.aol.com>:
> >> > We can use a simple reflection based implementation to simplify
> >> > reading too many parameters.
> >> >
> >> > What I wish to emphasize is that Solr should be agnostic of xml
> >> > altogether. It should only be aware of specific Objects and
> >> > interfaces. If users wish to plugin something else in some other way ,
> >> > it should be fine
> >> >
> >> >
> >> >  There is a huge learning involved in learning the current
> >> > solrconfig.xml . Let us not make people throw away that .
> >> >
> >> > On Sat, Sep 19, 2009 at 1:59 AM, Jason Rutherglen
> >> > <jason.rutherg...@gmail.com> wrote:
> >> >> Over the weekend I may write a patch to allow simple reflection based
> >> >> injection from within solrconfig.
> >> >>
> >> >> On Fri, Sep 18, 2009 at 8:10 AM, Yonik Seeley
> >> >> <yo...@lucidimagination.com> wrote:
> >> >>> On Thu, Sep 17, 2009 at 4:30 PM, Shalin Shekhar Mangar
> >> >>> <shalinman...@gmail.com> wrote:
> >> >>>>> I was wondering if there is a way I can modify
> calibrateSizeByDeletes
> >> just
> >> >>>>> by configuration ?
> >> >>>>>
> >> >>>>
> >> >>>> Alas, no. The only option that I see for you is to sub-class
> >> >>>> LogByteSizeMergePolicy and set calibrateSizeByDeletes to true in
> the
> >> >>>> constructor. However, please open a Jira issue and so we don't
> forget
> >> about
> >> >>>> it.
> >> >>>
> >> >>> It's the continuing stuff like this that makes me feel like we
> should
> >> >>> be Spring (or equivalent) based someday... I'm just not sure how
> we're
> >> >>> going to get there.
> >> >>>
> >> >>> -Yonik
> >> >>> http://www.lucidimagination.com
> >> >>>
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > -----------------------------------------------------
> >> > Noble Paul | Principal Engineer| AOL | http://aol.com
> >> >
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

Reply via email to