Hi Ted: In our case it is profile updates. Each profile -> 1 document keyed on member id.
We do experience people updating their profile and the assumption is every member is likely to update their profile (that is a bit aggressive I'd agree, but it is nevertheless a safe upper bound) In our scenario, there are 2 types of realtime updates: 1) every document can be updated (within a shard) 2) add-only, e.g. tweets etc. In our test, we aimed at 1) -John On Tue, Sep 22, 2009 at 8:28 AM, Ted Dunning <ted.dunn...@gmail.com> wrote: > John, > > I think that inherent in your test is a uniform distribution of updates. > > This seems unrealistic to me, not least because any distribution of updates > caused by a population of objects interacting with each other should be > translation invariant in time which is something a uniform distribution just > cannot be. > > The only plausible way I can see to cause uniform distribution of updates > is a global update to many entries. Such a global update problem usually > indicates that the object set should be factored into objects and > properties. Then what was a global update becomes an update to a single > property. The cost of fetching an object with all updated properties is a > secondary retrieval to elaborate the state implied by the properties. This > can literally be done in a single additional Lucene query since all property > keys will be available from the object fetch. Moreover, you generally have > far fewer unique properties than you have objects so the property fetch is > blindingly fast. > > My own experience is that natural update rates almost invariable decay over > time and that the peak rate of updates varies dramatically between objects. > Both of these factors mean that most of the objects being updated should be > predominantly objects were were updated recently. Rather shortly, this kind > of distribution should result in the rate of updates per item being much > lower for the larger segments. > > Can you say more about what motivates your test model and where I am wrong > about your situation? > > > On Mon, Sep 21, 2009 at 4:50 PM, John Wang <john.w...@gmail.com> wrote: > >> Jason: >> >> Before jumping into any conclusions, let me describe the test setup. It >> is rather different from Lucene benchmark as we are testing high updates in >> a realtime environment: >> >> We took a public corpus: medline, indexed to approximately 3 million >> docs. And update all the docs over and over again for a 10 hour duration. >> >> Only differences in code used where the different MergePolicy settings >> were applied. >> >> Taking the variable of HW/OS out of the equation, let's igonored the >> absolute numbers and compare the relative numbers between the two runs. >> >> The spike is due to merging of a large segment when we accumulate. The >> graph/perf numbers fit our hypothesis that the default MergePolicy chooses >> to merge small segments before large ones and does not handle segmens with >> high number of deletes well. >> >> Merging is BOTH IO and CPU intensive. Especially large ones. >> >> I think the wiki explains it pretty well. >> >> What are you saying is true with IO cache w.r.t. merge. Everytime new >> files are created, old files in IO cache is invalided. As the experiment >> shows, this is detrimental to query performance when large segmens are being >> merged. >> >> "As we move to a sharded model of indexes, large merges will >> naturally not occur." Our test is on a 3 million document index, not very >> large for a single shard. Some katta people have run it on a much much >> larger index per shard. Saying large merges will not occur on indexes of >> this size IMHO is unfounded. >> >> -John >> >> On Tue, Sep 22, 2009 at 2:34 AM, Jason Rutherglen < >> jason.rutherg...@gmail.com> wrote: >> >>> John, >>> >>> It would be great if Lucene's benchmark were used so everyone >>> could execute the test in their own environment and verify. It's >>> not clear the settings or code used to generate the results so >>> it's difficult to draw any reliable conclusions. >>> >>> The steep spike shows greater evidence for the IO cache being >>> cleared during large merges resulting in search performance >>> degradation. See: >>> http://www.lucidimagination.com/search/?q=madvise >>> >>> Merging is IO intensive, less CPU intensive, if the >>> ConcurrentMergeScheduler is used, which defaults to 3 threads, >>> then the CPU could be maxed out. Using a single thread on >>> synchronous spinning magnetic media seems more logical. Queries >>> are usually the inverse, CPU intensive, not IO intensive when >>> the index is in the IO cache. After merging a large segment (or >>> during), queries would start hitting disk, and the results >>> clearly show that. The queries are suddenly more time consuming >>> as they seek on disk at a time when IO activity is at it's peak >>> from merging large segments. Using madvise would prevent usable >>> indexes from being swapped to disk during a merge, query >>> performance would continue unabated. >>> >>> As we move to a sharded model of indexes, large merges will >>> naturally not occur. Shards will reach a specified size and new >>> documents will be sent to new shards. >>> >>> -J >>> >>> On Sun, Sep 20, 2009 at 11:12 PM, John Wang <john.w...@gmail.com> wrote: >>> > The current default Lucene MergePolicy does not handle frequent updates >>> > well. >>> > >>> > We have done some performance analysis with that and a custom merge >>> policy: >>> > >>> > http://code.google.com/p/zoie/wiki/ZoieMergePolicy >>> > >>> > -John >>> > >>> > On Mon, Sep 21, 2009 at 1:08 PM, Jason Rutherglen < >>> > jason.rutherg...@gmail.com> wrote: >>> > >>> >> I opened SOLR-1447 for this >>> >> >>> >> 2009/9/18 Noble Paul നോബിള് नोब्ळ् <noble.p...@corp.aol.com>: >>> >> > We can use a simple reflection based implementation to simplify >>> >> > reading too many parameters. >>> >> > >>> >> > What I wish to emphasize is that Solr should be agnostic of xml >>> >> > altogether. It should only be aware of specific Objects and >>> >> > interfaces. If users wish to plugin something else in some other way >>> , >>> >> > it should be fine >>> >> > >>> >> > >>> >> > There is a huge learning involved in learning the current >>> >> > solrconfig.xml . Let us not make people throw away that . >>> >> > >>> >> > On Sat, Sep 19, 2009 at 1:59 AM, Jason Rutherglen >>> >> > <jason.rutherg...@gmail.com> wrote: >>> >> >> Over the weekend I may write a patch to allow simple reflection >>> based >>> >> >> injection from within solrconfig. >>> >> >> >>> >> >> On Fri, Sep 18, 2009 at 8:10 AM, Yonik Seeley >>> >> >> <yo...@lucidimagination.com> wrote: >>> >> >>> On Thu, Sep 17, 2009 at 4:30 PM, Shalin Shekhar Mangar >>> >> >>> <shalinman...@gmail.com> wrote: >>> >> >>>>> I was wondering if there is a way I can modify >>> calibrateSizeByDeletes >>> >> just >>> >> >>>>> by configuration ? >>> >> >>>>> >>> >> >>>> >>> >> >>>> Alas, no. The only option that I see for you is to sub-class >>> >> >>>> LogByteSizeMergePolicy and set calibrateSizeByDeletes to true in >>> the >>> >> >>>> constructor. However, please open a Jira issue and so we don't >>> forget >>> >> about >>> >> >>>> it. >>> >> >>> >>> >> >>> It's the continuing stuff like this that makes me feel like we >>> should >>> >> >>> be Spring (or equivalent) based someday... I'm just not sure how >>> we're >>> >> >>> going to get there. >>> >> >>> >>> >> >>> -Yonik >>> >> >>> http://www.lucidimagination.com >>> >> >>> >>> >> >> >>> >> > >>> >> > >>> >> > >>> >> > -- >>> >> > ----------------------------------------------------- >>> >> > Noble Paul | Principal Engineer| AOL | http://aol.com >>> >> > >>> >> >>> > >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-dev-h...@lucene.apache.org >>> >>> >> > > > -- > Ted Dunning, CTO > DeepDyve > >