Hi Ted:

     In our case it is profile updates. Each profile -> 1 document keyed on
member id.

     We do experience people updating their profile and the assumption is
every member is likely to update their profile (that is a bit aggressive I'd
agree, but it is nevertheless a safe upper bound)

     In our scenario, there are 2 types of realtime updates:

1) every document can be updated (within a shard)
2) add-only, e.g. tweets etc.

     In our test, we aimed at 1)

-John

On Tue, Sep 22, 2009 at 8:28 AM, Ted Dunning <ted.dunn...@gmail.com> wrote:

> John,
>
> I think that inherent in your test is a uniform distribution of updates.
>
> This seems unrealistic to me, not least because any distribution of updates
> caused by a population of objects interacting with each other should be
> translation invariant in time which is something a uniform distribution just
> cannot be.
>
> The only plausible way I can see to cause uniform distribution of updates
> is a global update to many entries.  Such a global update problem usually
> indicates that the object set should be factored into objects and
> properties.  Then what was a global update becomes an update to a single
> property.  The cost of fetching an object with all updated properties is a
> secondary retrieval to elaborate the state implied by the properties.  This
> can literally be done in a single additional Lucene query since all property
> keys will be available from the object fetch.  Moreover, you generally have
> far fewer unique properties than you have objects so the property fetch is
> blindingly fast.
>
> My own experience is that natural update rates almost invariable decay over
> time and that the peak rate of updates varies dramatically between objects.
> Both of these factors mean that most of the objects being updated should be
> predominantly objects were were updated recently.  Rather shortly, this kind
> of distribution should result in the rate of updates per item being much
> lower for the larger segments.
>
> Can you say more about what motivates your test model and where I am wrong
> about your situation?
>
>
> On Mon, Sep 21, 2009 at 4:50 PM, John Wang <john.w...@gmail.com> wrote:
>
>> Jason:
>>
>>    Before jumping into any conclusions, let me describe the test setup. It
>> is rather different from Lucene benchmark as we are testing high updates in
>> a realtime environment:
>>
>>    We took a public corpus: medline, indexed to approximately 3 million
>> docs. And update all the docs over and over again for a 10 hour duration.
>>
>>    Only differences in code used where the different MergePolicy settings
>> were applied.
>>
>>    Taking the variable of HW/OS out of the equation, let's igonored the
>> absolute numbers and compare the relative numbers between the two runs.
>>
>>    The spike is due to merging of a large segment when we accumulate. The
>> graph/perf numbers fit our hypothesis that the default MergePolicy chooses
>> to merge small segments before large ones and does not handle segmens with
>> high number of deletes well.
>>
>>     Merging is BOTH IO and CPU intensive. Especially large ones.
>>
>>     I think the wiki explains it pretty well.
>>
>>     What are you saying is true with IO cache w.r.t. merge. Everytime new
>> files are created, old files in IO cache is invalided. As the experiment
>> shows, this is detrimental to query performance when large segmens are being
>> merged.
>>
>>     "As we move to a sharded model of indexes, large merges will
>> naturally not occur." Our test is on a 3 million document index, not very
>> large for a single shard. Some katta people have run it on a much much
>> larger index per shard. Saying large merges will not occur on indexes of
>> this size IMHO is unfounded.
>>
>> -John
>>
>> On Tue, Sep 22, 2009 at 2:34 AM, Jason Rutherglen <
>> jason.rutherg...@gmail.com> wrote:
>>
>>> John,
>>>
>>> It would be great if Lucene's benchmark were used so everyone
>>> could execute the test in their own environment and verify. It's
>>> not clear the settings or code used to generate the results so
>>> it's difficult to draw any reliable conclusions.
>>>
>>> The steep spike shows greater evidence for the IO cache being
>>> cleared during large merges resulting in search performance
>>> degradation. See:
>>> http://www.lucidimagination.com/search/?q=madvise
>>>
>>> Merging is IO intensive, less CPU intensive, if the
>>> ConcurrentMergeScheduler is used, which defaults to 3 threads,
>>> then the CPU could be maxed out. Using a single thread on
>>> synchronous spinning magnetic media seems more logical. Queries
>>> are usually the inverse, CPU intensive, not IO intensive when
>>> the index is in the IO cache. After merging a large segment (or
>>> during), queries would start hitting disk, and the results
>>> clearly show that. The queries are suddenly more time consuming
>>> as they seek on disk at a time when IO activity is at it's peak
>>> from merging large segments. Using madvise would prevent usable
>>> indexes from being swapped to disk during a merge, query
>>> performance would continue unabated.
>>>
>>> As we move to a sharded model of indexes, large merges will
>>> naturally not occur. Shards will reach a specified size and new
>>> documents will be sent to new shards.
>>>
>>> -J
>>>
>>> On Sun, Sep 20, 2009 at 11:12 PM, John Wang <john.w...@gmail.com> wrote:
>>> > The current default Lucene MergePolicy does not handle frequent updates
>>> > well.
>>> >
>>> > We have done some performance analysis with that and a custom merge
>>> policy:
>>> >
>>> > http://code.google.com/p/zoie/wiki/ZoieMergePolicy
>>> >
>>> > -John
>>> >
>>> > On Mon, Sep 21, 2009 at 1:08 PM, Jason Rutherglen <
>>> > jason.rutherg...@gmail.com> wrote:
>>> >
>>> >> I opened SOLR-1447 for this
>>> >>
>>> >> 2009/9/18 Noble Paul നോബിള്‍  नोब्ळ् <noble.p...@corp.aol.com>:
>>> >> > We can use a simple reflection based implementation to simplify
>>> >> > reading too many parameters.
>>> >> >
>>> >> > What I wish to emphasize is that Solr should be agnostic of xml
>>> >> > altogether. It should only be aware of specific Objects and
>>> >> > interfaces. If users wish to plugin something else in some other way
>>> ,
>>> >> > it should be fine
>>> >> >
>>> >> >
>>> >> >  There is a huge learning involved in learning the current
>>> >> > solrconfig.xml . Let us not make people throw away that .
>>> >> >
>>> >> > On Sat, Sep 19, 2009 at 1:59 AM, Jason Rutherglen
>>> >> > <jason.rutherg...@gmail.com> wrote:
>>> >> >> Over the weekend I may write a patch to allow simple reflection
>>> based
>>> >> >> injection from within solrconfig.
>>> >> >>
>>> >> >> On Fri, Sep 18, 2009 at 8:10 AM, Yonik Seeley
>>> >> >> <yo...@lucidimagination.com> wrote:
>>> >> >>> On Thu, Sep 17, 2009 at 4:30 PM, Shalin Shekhar Mangar
>>> >> >>> <shalinman...@gmail.com> wrote:
>>> >> >>>>> I was wondering if there is a way I can modify
>>> calibrateSizeByDeletes
>>> >> just
>>> >> >>>>> by configuration ?
>>> >> >>>>>
>>> >> >>>>
>>> >> >>>> Alas, no. The only option that I see for you is to sub-class
>>> >> >>>> LogByteSizeMergePolicy and set calibrateSizeByDeletes to true in
>>> the
>>> >> >>>> constructor. However, please open a Jira issue and so we don't
>>> forget
>>> >> about
>>> >> >>>> it.
>>> >> >>>
>>> >> >>> It's the continuing stuff like this that makes me feel like we
>>> should
>>> >> >>> be Spring (or equivalent) based someday... I'm just not sure how
>>> we're
>>> >> >>> going to get there.
>>> >> >>>
>>> >> >>> -Yonik
>>> >> >>> http://www.lucidimagination.com
>>> >> >>>
>>> >> >>
>>> >> >
>>> >> >
>>> >> >
>>> >> > --
>>> >> > -----------------------------------------------------
>>> >> > Noble Paul | Principal Engineer| AOL | http://aol.com
>>> >> >
>>> >>
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>>
>>>
>>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>
>

Reply via email to