John,

I think that inherent in your test is a uniform distribution of updates.

This seems unrealistic to me, not least because any distribution of updates
caused by a population of objects interacting with each other should be
translation invariant in time which is something a uniform distribution just
cannot be.

The only plausible way I can see to cause uniform distribution of updates is
a global update to many entries.  Such a global update problem usually
indicates that the object set should be factored into objects and
properties.  Then what was a global update becomes an update to a single
property.  The cost of fetching an object with all updated properties is a
secondary retrieval to elaborate the state implied by the properties.  This
can literally be done in a single additional Lucene query since all property
keys will be available from the object fetch.  Moreover, you generally have
far fewer unique properties than you have objects so the property fetch is
blindingly fast.

My own experience is that natural update rates almost invariable decay over
time and that the peak rate of updates varies dramatically between objects.
Both of these factors mean that most of the objects being updated should be
predominantly objects were were updated recently.  Rather shortly, this kind
of distribution should result in the rate of updates per item being much
lower for the larger segments.

Can you say more about what motivates your test model and where I am wrong
about your situation?

On Mon, Sep 21, 2009 at 4:50 PM, John Wang <john.w...@gmail.com> wrote:

> Jason:
>
>    Before jumping into any conclusions, let me describe the test setup. It
> is rather different from Lucene benchmark as we are testing high updates in
> a realtime environment:
>
>    We took a public corpus: medline, indexed to approximately 3 million
> docs. And update all the docs over and over again for a 10 hour duration.
>
>    Only differences in code used where the different MergePolicy settings
> were applied.
>
>    Taking the variable of HW/OS out of the equation, let's igonored the
> absolute numbers and compare the relative numbers between the two runs.
>
>    The spike is due to merging of a large segment when we accumulate. The
> graph/perf numbers fit our hypothesis that the default MergePolicy chooses
> to merge small segments before large ones and does not handle segmens with
> high number of deletes well.
>
>     Merging is BOTH IO and CPU intensive. Especially large ones.
>
>     I think the wiki explains it pretty well.
>
>     What are you saying is true with IO cache w.r.t. merge. Everytime new
> files are created, old files in IO cache is invalided. As the experiment
> shows, this is detrimental to query performance when large segmens are being
> merged.
>
>     "As we move to a sharded model of indexes, large merges will
> naturally not occur." Our test is on a 3 million document index, not very
> large for a single shard. Some katta people have run it on a much much
> larger index per shard. Saying large merges will not occur on indexes of
> this size IMHO is unfounded.
>
> -John
>
> On Tue, Sep 22, 2009 at 2:34 AM, Jason Rutherglen <
> jason.rutherg...@gmail.com> wrote:
>
>> John,
>>
>> It would be great if Lucene's benchmark were used so everyone
>> could execute the test in their own environment and verify. It's
>> not clear the settings or code used to generate the results so
>> it's difficult to draw any reliable conclusions.
>>
>> The steep spike shows greater evidence for the IO cache being
>> cleared during large merges resulting in search performance
>> degradation. See:
>> http://www.lucidimagination.com/search/?q=madvise
>>
>> Merging is IO intensive, less CPU intensive, if the
>> ConcurrentMergeScheduler is used, which defaults to 3 threads,
>> then the CPU could be maxed out. Using a single thread on
>> synchronous spinning magnetic media seems more logical. Queries
>> are usually the inverse, CPU intensive, not IO intensive when
>> the index is in the IO cache. After merging a large segment (or
>> during), queries would start hitting disk, and the results
>> clearly show that. The queries are suddenly more time consuming
>> as they seek on disk at a time when IO activity is at it's peak
>> from merging large segments. Using madvise would prevent usable
>> indexes from being swapped to disk during a merge, query
>> performance would continue unabated.
>>
>> As we move to a sharded model of indexes, large merges will
>> naturally not occur. Shards will reach a specified size and new
>> documents will be sent to new shards.
>>
>> -J
>>
>> On Sun, Sep 20, 2009 at 11:12 PM, John Wang <john.w...@gmail.com> wrote:
>> > The current default Lucene MergePolicy does not handle frequent updates
>> > well.
>> >
>> > We have done some performance analysis with that and a custom merge
>> policy:
>> >
>> > http://code.google.com/p/zoie/wiki/ZoieMergePolicy
>> >
>> > -John
>> >
>> > On Mon, Sep 21, 2009 at 1:08 PM, Jason Rutherglen <
>> > jason.rutherg...@gmail.com> wrote:
>> >
>> >> I opened SOLR-1447 for this
>> >>
>> >> 2009/9/18 Noble Paul നോബിള്‍  नोब्ळ् <noble.p...@corp.aol.com>:
>> >> > We can use a simple reflection based implementation to simplify
>> >> > reading too many parameters.
>> >> >
>> >> > What I wish to emphasize is that Solr should be agnostic of xml
>> >> > altogether. It should only be aware of specific Objects and
>> >> > interfaces. If users wish to plugin something else in some other way
>> ,
>> >> > it should be fine
>> >> >
>> >> >
>> >> >  There is a huge learning involved in learning the current
>> >> > solrconfig.xml . Let us not make people throw away that .
>> >> >
>> >> > On Sat, Sep 19, 2009 at 1:59 AM, Jason Rutherglen
>> >> > <jason.rutherg...@gmail.com> wrote:
>> >> >> Over the weekend I may write a patch to allow simple reflection
>> based
>> >> >> injection from within solrconfig.
>> >> >>
>> >> >> On Fri, Sep 18, 2009 at 8:10 AM, Yonik Seeley
>> >> >> <yo...@lucidimagination.com> wrote:
>> >> >>> On Thu, Sep 17, 2009 at 4:30 PM, Shalin Shekhar Mangar
>> >> >>> <shalinman...@gmail.com> wrote:
>> >> >>>>> I was wondering if there is a way I can modify
>> calibrateSizeByDeletes
>> >> just
>> >> >>>>> by configuration ?
>> >> >>>>>
>> >> >>>>
>> >> >>>> Alas, no. The only option that I see for you is to sub-class
>> >> >>>> LogByteSizeMergePolicy and set calibrateSizeByDeletes to true in
>> the
>> >> >>>> constructor. However, please open a Jira issue and so we don't
>> forget
>> >> about
>> >> >>>> it.
>> >> >>>
>> >> >>> It's the continuing stuff like this that makes me feel like we
>> should
>> >> >>> be Spring (or equivalent) based someday... I'm just not sure how
>> we're
>> >> >>> going to get there.
>> >> >>>
>> >> >>> -Yonik
>> >> >>> http://www.lucidimagination.com
>> >> >>>
>> >> >>
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > -----------------------------------------------------
>> >> > Noble Paul | Principal Engineer| AOL | http://aol.com
>> >> >
>> >>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>>
>


-- 
Ted Dunning, CTO
DeepDyve

Reply via email to