John are you using IndexWriter.setMergedSegmentWarmer, so that a newly merged segment is warmed before it's "put into production" (returned by getReader)?
Mike On Mon, Sep 21, 2009 at 9:35 PM, John Wang <john.w...@gmail.com> wrote: > Jason: > > You are missing the point. > > The idea is to avoid merging of large segments. The point of this > MergePolicy is to balance segment merges across the index. The aim is not to > have 1 large segment, it is to have n segments with balanced sizes. > > When the large segment is out of the IO cache, replacing it is very > costly. What we have done is to split the cost over time by having more > frequent but faster merges. > > I am not suggesting Lucene's default mergePolicy isn't good, it is just > not suitable for our case where there are high updates introducing tons of > deletes. The fact that the api is nice enough to allow MergePolicies to be > plgged it is a good thing. > > Please DO read the wiki. > > -John > > On Tue, Sep 22, 2009 at 8:58 AM, Jason Rutherglen > <jason.rutherg...@gmail.com> wrote: >> >> I'm not sure I communicated the idea properly. If CMS is set to >> 1 thread, no matter how intensive the CPU for a merge, it's >> limited to 1 core of what is in many cases a 4 or 8 core server. >> That leaves the other 3 or 7 cores for queries, which if slow, >> indicates that it isn't the merging that's slowing down queries, >> but the dumping of the queried segments from the system IO cache. >> >> This holds true regardless of the merge policy used. So while a >> new merge policy sounds great, unless the system IO cache >> problem is solved, there will always be a lingering problem in >> regards to large merges with a regularly updated index. Avoiding >> large merges probably isn't the answer. And >> LogByteSizeMergePolicy somewhat allows managing the size of the >> segments merged already. I would personally prefer being able to >> merge segments up to a given estimated size, which requires >> LUCENE-1076 to do well. >> >> > is rather different from Lucene benchmark as we are testing >> high updates in a realtime environment >> >> Lucene's benchmark allows this. NearRealtimeReaderTask is a good >> place to start. >> >> On Mon, Sep 21, 2009 at 4:50 PM, John Wang <john.w...@gmail.com> wrote: >> > Jason: >> > >> > Before jumping into any conclusions, let me describe the test setup. >> > It >> > is rather different from Lucene benchmark as we are testing high updates >> > in >> > a realtime environment: >> > >> > We took a public corpus: medline, indexed to approximately 3 million >> > docs. And update all the docs over and over again for a 10 hour >> > duration. >> > >> > Only differences in code used where the different MergePolicy >> > settings >> > were applied. >> > >> > Taking the variable of HW/OS out of the equation, let's igonored the >> > absolute numbers and compare the relative numbers between the two runs. >> > >> > The spike is due to merging of a large segment when we accumulate. >> > The >> > graph/perf numbers fit our hypothesis that the default MergePolicy >> > chooses >> > to merge small segments before large ones and does not handle segmens >> > with >> > high number of deletes well. >> > >> > Merging is BOTH IO and CPU intensive. Especially large ones. >> > >> > I think the wiki explains it pretty well. >> > >> > What are you saying is true with IO cache w.r.t. merge. Everytime >> > new >> > files are created, old files in IO cache is invalided. As the experiment >> > shows, this is detrimental to query performance when large segmens are >> > being >> > merged. >> > >> > "As we move to a sharded model of indexes, large merges will >> > naturally not occur." Our test is on a 3 million document index, not >> > very >> > large for a single shard. Some katta people have run it on a much much >> > larger index per shard. Saying large merges will not occur on indexes of >> > this size IMHO is unfounded. >> > >> > -John >> > >> > On Tue, Sep 22, 2009 at 2:34 AM, Jason Rutherglen >> > <jason.rutherg...@gmail.com> wrote: >> >> >> >> John, >> >> >> >> It would be great if Lucene's benchmark were used so everyone >> >> could execute the test in their own environment and verify. It's >> >> not clear the settings or code used to generate the results so >> >> it's difficult to draw any reliable conclusions. >> >> >> >> The steep spike shows greater evidence for the IO cache being >> >> cleared during large merges resulting in search performance >> >> degradation. See: >> >> http://www.lucidimagination.com/search/?q=madvise >> >> >> >> Merging is IO intensive, less CPU intensive, if the >> >> ConcurrentMergeScheduler is used, which defaults to 3 threads, >> >> then the CPU could be maxed out. Using a single thread on >> >> synchronous spinning magnetic media seems more logical. Queries >> >> are usually the inverse, CPU intensive, not IO intensive when >> >> the index is in the IO cache. After merging a large segment (or >> >> during), queries would start hitting disk, and the results >> >> clearly show that. The queries are suddenly more time consuming >> >> as they seek on disk at a time when IO activity is at it's peak >> >> from merging large segments. Using madvise would prevent usable >> >> indexes from being swapped to disk during a merge, query >> >> performance would continue unabated. >> >> >> >> As we move to a sharded model of indexes, large merges will >> >> naturally not occur. Shards will reach a specified size and new >> >> documents will be sent to new shards. >> >> >> >> -J >> >> >> >> On Sun, Sep 20, 2009 at 11:12 PM, John Wang <john.w...@gmail.com> >> >> wrote: >> >> > The current default Lucene MergePolicy does not handle frequent >> >> > updates >> >> > well. >> >> > >> >> > We have done some performance analysis with that and a custom merge >> >> > policy: >> >> > >> >> > http://code.google.com/p/zoie/wiki/ZoieMergePolicy >> >> > >> >> > -John >> >> > >> >> > On Mon, Sep 21, 2009 at 1:08 PM, Jason Rutherglen < >> >> > jason.rutherg...@gmail.com> wrote: >> >> > >> >> >> I opened SOLR-1447 for this >> >> >> >> >> >> 2009/9/18 Noble Paul നോബിള് नोब्ळ् <noble.p...@corp.aol.com>: >> >> >> > We can use a simple reflection based implementation to simplify >> >> >> > reading too many parameters. >> >> >> > >> >> >> > What I wish to emphasize is that Solr should be agnostic of xml >> >> >> > altogether. It should only be aware of specific Objects and >> >> >> > interfaces. If users wish to plugin something else in some other >> >> >> > way >> >> >> > , >> >> >> > it should be fine >> >> >> > >> >> >> > >> >> >> > There is a huge learning involved in learning the current >> >> >> > solrconfig.xml . Let us not make people throw away that . >> >> >> > >> >> >> > On Sat, Sep 19, 2009 at 1:59 AM, Jason Rutherglen >> >> >> > <jason.rutherg...@gmail.com> wrote: >> >> >> >> Over the weekend I may write a patch to allow simple reflection >> >> >> >> based >> >> >> >> injection from within solrconfig. >> >> >> >> >> >> >> >> On Fri, Sep 18, 2009 at 8:10 AM, Yonik Seeley >> >> >> >> <yo...@lucidimagination.com> wrote: >> >> >> >>> On Thu, Sep 17, 2009 at 4:30 PM, Shalin Shekhar Mangar >> >> >> >>> <shalinman...@gmail.com> wrote: >> >> >> >>>>> I was wondering if there is a way I can modify >> >> >> >>>>> calibrateSizeByDeletes >> >> >> just >> >> >> >>>>> by configuration ? >> >> >> >>>>> >> >> >> >>>> >> >> >> >>>> Alas, no. The only option that I see for you is to sub-class >> >> >> >>>> LogByteSizeMergePolicy and set calibrateSizeByDeletes to true >> >> >> >>>> in >> >> >> >>>> the >> >> >> >>>> constructor. However, please open a Jira issue and so we don't >> >> >> >>>> forget >> >> >> about >> >> >> >>>> it. >> >> >> >>> >> >> >> >>> It's the continuing stuff like this that makes me feel like we >> >> >> >>> should >> >> >> >>> be Spring (or equivalent) based someday... I'm just not sure how >> >> >> >>> we're >> >> >> >>> going to get there. >> >> >> >>> >> >> >> >>> -Yonik >> >> >> >>> http://www.lucidimagination.com >> >> >> >>> >> >> >> >> >> >> >> > >> >> >> > >> >> >> > >> >> >> > -- >> >> >> > ----------------------------------------------------- >> >> >> > Noble Paul | Principal Engineer| AOL | http://aol.com >> >> >> > >> >> >> >> >> > >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >> >> For additional commands, e-mail: java-dev-h...@lucene.apache.org >> >> >> > >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-dev-h...@lucene.apache.org >> > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org