Jason: I am not sure what "parameters" are you referring to either. Are you responding to the right email?
Anyhoot, I used everything for the default for both MergePolicies. LogMergePolicy.setCalibrateSizeByDeletes was a contribution by us from ZMP for normalize segment size using deleted doc counts. So it was part of ZMP. The idea with ZMP is to have a set of balanced-sized segments instead of 1 large segment. (as I have been repeatedly describing on this email thread) To get this balance, we represent every point before the merge as a state modeled in a Viterbi alg with a cost function for each type of merge, this is used to select the desired segment to merge. I hate to hijack a Lucene thread to discuss Zoie, feel free to post questions on the Zoie group for details. -John On Wed, Sep 23, 2009 at 1:56 AM, Jason Rutherglen < jason.rutherg...@gmail.com> wrote: > John, > > I have a few questions in order to better understand, as the > wiki does not reflect the entirety of what you're trying to > describe. > > > But it is required to set up several parameters carefully to > get desired behavior. > > Which parameters are you referring to? > > What were the ZMP parameters used for the test? > > What was the number CMS of threads? > > It would be helpful to see a time based table of the data used > to generate the chart at the bottom with the segment infos at > regular intervals. > > What is the difference between how ZMP and > LogMergePolicy.setCalibrateSizeByDeletes handles deletes? > > Are the queries using Zoie or Lucene's index searcher? > > Can you explain why the Viterbi algorithm was used and how it > works in this context? > > -J > > On Mon, Sep 21, 2009 at 6:35 PM, John Wang <john.w...@gmail.com> wrote: > > Jason: > > > > You are missing the point. > > > > The idea is to avoid merging of large segments. The point of this > > MergePolicy is to balance segment merges across the index. The aim is not > to > > have 1 large segment, it is to have n segments with balanced sizes. > > > > When the large segment is out of the IO cache, replacing it is very > > costly. What we have done is to split the cost over time by having more > > frequent but faster merges. > > > > I am not suggesting Lucene's default mergePolicy isn't good, it is > just > > not suitable for our case where there are high updates introducing tons > of > > deletes. The fact that the api is nice enough to allow MergePolicies to > be > > plgged it is a good thing. > > > > Please DO read the wiki. > > > > -John > > > > On Tue, Sep 22, 2009 at 8:58 AM, Jason Rutherglen > > <jason.rutherg...@gmail.com> wrote: > >> > >> I'm not sure I communicated the idea properly. If CMS is set to > >> 1 thread, no matter how intensive the CPU for a merge, it's > >> limited to 1 core of what is in many cases a 4 or 8 core server. > >> That leaves the other 3 or 7 cores for queries, which if slow, > >> indicates that it isn't the merging that's slowing down queries, > >> but the dumping of the queried segments from the system IO cache. > >> > >> This holds true regardless of the merge policy used. So while a > >> new merge policy sounds great, unless the system IO cache > >> problem is solved, there will always be a lingering problem in > >> regards to large merges with a regularly updated index. Avoiding > >> large merges probably isn't the answer. And > >> LogByteSizeMergePolicy somewhat allows managing the size of the > >> segments merged already. I would personally prefer being able to > >> merge segments up to a given estimated size, which requires > >> LUCENE-1076 to do well. > >> > >> > is rather different from Lucene benchmark as we are testing > >> high updates in a realtime environment > >> > >> Lucene's benchmark allows this. NearRealtimeReaderTask is a good > >> place to start. > >> > >> On Mon, Sep 21, 2009 at 4:50 PM, John Wang <john.w...@gmail.com> wrote: > >> > Jason: > >> > > >> > Before jumping into any conclusions, let me describe the test > setup. > >> > It > >> > is rather different from Lucene benchmark as we are testing high > updates > >> > in > >> > a realtime environment: > >> > > >> > We took a public corpus: medline, indexed to approximately 3 > million > >> > docs. And update all the docs over and over again for a 10 hour > >> > duration. > >> > > >> > Only differences in code used where the different MergePolicy > >> > settings > >> > were applied. > >> > > >> > Taking the variable of HW/OS out of the equation, let's igonored > the > >> > absolute numbers and compare the relative numbers between the two > runs. > >> > > >> > The spike is due to merging of a large segment when we accumulate. > >> > The > >> > graph/perf numbers fit our hypothesis that the default MergePolicy > >> > chooses > >> > to merge small segments before large ones and does not handle segmens > >> > with > >> > high number of deletes well. > >> > > >> > Merging is BOTH IO and CPU intensive. Especially large ones. > >> > > >> > I think the wiki explains it pretty well. > >> > > >> > What are you saying is true with IO cache w.r.t. merge. Everytime > >> > new > >> > files are created, old files in IO cache is invalided. As the > experiment > >> > shows, this is detrimental to query performance when large segmens are > >> > being > >> > merged. > >> > > >> > "As we move to a sharded model of indexes, large merges will > >> > naturally not occur." Our test is on a 3 million document index, not > >> > very > >> > large for a single shard. Some katta people have run it on a much much > >> > larger index per shard. Saying large merges will not occur on indexes > of > >> > this size IMHO is unfounded. > >> > > >> > -John > >> > > >> > On Tue, Sep 22, 2009 at 2:34 AM, Jason Rutherglen > >> > <jason.rutherg...@gmail.com> wrote: > >> >> > >> >> John, > >> >> > >> >> It would be great if Lucene's benchmark were used so everyone > >> >> could execute the test in their own environment and verify. It's > >> >> not clear the settings or code used to generate the results so > >> >> it's difficult to draw any reliable conclusions. > >> >> > >> >> The steep spike shows greater evidence for the IO cache being > >> >> cleared during large merges resulting in search performance > >> >> degradation. See: > >> >> http://www.lucidimagination.com/search/?q=madvise > >> >> > >> >> Merging is IO intensive, less CPU intensive, if the > >> >> ConcurrentMergeScheduler is used, which defaults to 3 threads, > >> >> then the CPU could be maxed out. Using a single thread on > >> >> synchronous spinning magnetic media seems more logical. Queries > >> >> are usually the inverse, CPU intensive, not IO intensive when > >> >> the index is in the IO cache. After merging a large segment (or > >> >> during), queries would start hitting disk, and the results > >> >> clearly show that. The queries are suddenly more time consuming > >> >> as they seek on disk at a time when IO activity is at it's peak > >> >> from merging large segments. Using madvise would prevent usable > >> >> indexes from being swapped to disk during a merge, query > >> >> performance would continue unabated. > >> >> > >> >> As we move to a sharded model of indexes, large merges will > >> >> naturally not occur. Shards will reach a specified size and new > >> >> documents will be sent to new shards. > >> >> > >> >> -J > >> >> > >> >> On Sun, Sep 20, 2009 at 11:12 PM, John Wang <john.w...@gmail.com> > >> >> wrote: > >> >> > The current default Lucene MergePolicy does not handle frequent > >> >> > updates > >> >> > well. > >> >> > > >> >> > We have done some performance analysis with that and a custom merge > >> >> > policy: > >> >> > > >> >> > http://code.google.com/p/zoie/wiki/ZoieMergePolicy > >> >> > > >> >> > -John > >> >> > > >> >> > On Mon, Sep 21, 2009 at 1:08 PM, Jason Rutherglen < > >> >> > jason.rutherg...@gmail.com> wrote: > >> >> > > >> >> >> I opened SOLR-1447 for this > >> >> >> > >> >> >> 2009/9/18 Noble Paul നോബിള് नोब्ळ् <noble.p...@corp.aol.com>: > >> >> >> > We can use a simple reflection based implementation to simplify > >> >> >> > reading too many parameters. > >> >> >> > > >> >> >> > What I wish to emphasize is that Solr should be agnostic of xml > >> >> >> > altogether. It should only be aware of specific Objects and > >> >> >> > interfaces. If users wish to plugin something else in some other > >> >> >> > way > >> >> >> > , > >> >> >> > it should be fine > >> >> >> > > >> >> >> > > >> >> >> > There is a huge learning involved in learning the current > >> >> >> > solrconfig.xml . Let us not make people throw away that . > >> >> >> > > >> >> >> > On Sat, Sep 19, 2009 at 1:59 AM, Jason Rutherglen > >> >> >> > <jason.rutherg...@gmail.com> wrote: > >> >> >> >> Over the weekend I may write a patch to allow simple reflection > >> >> >> >> based > >> >> >> >> injection from within solrconfig. > >> >> >> >> > >> >> >> >> On Fri, Sep 18, 2009 at 8:10 AM, Yonik Seeley > >> >> >> >> <yo...@lucidimagination.com> wrote: > >> >> >> >>> On Thu, Sep 17, 2009 at 4:30 PM, Shalin Shekhar Mangar > >> >> >> >>> <shalinman...@gmail.com> wrote: > >> >> >> >>>>> I was wondering if there is a way I can modify > >> >> >> >>>>> calibrateSizeByDeletes > >> >> >> just > >> >> >> >>>>> by configuration ? > >> >> >> >>>>> > >> >> >> >>>> > >> >> >> >>>> Alas, no. The only option that I see for you is to sub-class > >> >> >> >>>> LogByteSizeMergePolicy and set calibrateSizeByDeletes to true > >> >> >> >>>> in > >> >> >> >>>> the > >> >> >> >>>> constructor. However, please open a Jira issue and so we > don't > >> >> >> >>>> forget > >> >> >> about > >> >> >> >>>> it. > >> >> >> >>> > >> >> >> >>> It's the continuing stuff like this that makes me feel like we > >> >> >> >>> should > >> >> >> >>> be Spring (or equivalent) based someday... I'm just not sure > how > >> >> >> >>> we're > >> >> >> >>> going to get there. > >> >> >> >>> > >> >> >> >>> -Yonik > >> >> >> >>> http://www.lucidimagination.com > >> >> >> >>> > >> >> >> >> > >> >> >> > > >> >> >> > > >> >> >> > > >> >> >> > -- > >> >> >> > ----------------------------------------------------- > >> >> >> > Noble Paul | Principal Engineer| AOL | http://aol.com > >> >> >> > > >> >> >> > >> >> > > >> >> > >> >> --------------------------------------------------------------------- > >> >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > >> >> For additional commands, e-mail: java-dev-h...@lucene.apache.org > >> >> > >> > > >> > > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-dev-h...@lucene.apache.org > >> > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > >