Re: Solr Merge during off peak times
On 5/4/2012 8:10 PM, Lance Norskog wrote: Optimize takes a 'maxSegments' option. This tells it to stop when there are N segments instead of just one. If you use a very high mergeFactor and then call optimize with a sane number like 50, it only merges the little teeny segments. When I optimize, I want only one segment. My main concern in doing occasional optimizes is removing deleted documents. Whatever speedup I get from having only one segment is just a nice bonus. When it comes to only merging the small segments, I am concerned about that happening when regular indexing builds up enough segments to do a merge. If I start with one large optimized segment, then do indexing operations such that I reach segmentsPerTier, will it leave the large segment alone and just work on the little ones? I am using Solr 3.5 with the following config: 35 35 105 Thanks, Shawn
Re: Solr Merge during off peak times
Optimize takes a 'maxSegments' option. This tells it to stop when there are N segments instead of just one. If you use a very high mergeFactor and then call optimize with a sane number like 50, it only merges the little teeny segments. On Thu, May 3, 2012 at 8:28 PM, Shawn Heisey wrote: > On 5/2/2012 5:54 AM, Prakashganesh, Prabhu wrote: >> >> We have a fairly large scale system - about 200 million docs and fairly >> high indexing activity - about 300k docs per day with peak ingestion rates >> of about 20 docs per sec. I want to work out what a good mergeFactor setting >> would be by testing with different mergeFactor settings. I think the default >> of 10 might be high, I want to try with 5 and compare. Unless I know when a >> merge starts and finishes, it would be quite difficult to work out the >> impact of changing mergeFactor. I want to be able to measure how long merges >> take, run queries during the merge activity and see what the response times >> are etc.. > > > With a lot of indexing activity, if you are attempting to avoid large > merges, I would think you would want a higher mergeFactor, not a lower one, > and do occasional optimizes during non-peak hours. With a small > mergeFactor, you will be merging a lot more often, and you are more likely > to encounter merges of already-merged segments, which can be very slow. > > My index is nearing 70 million documents. I've got seven shards - six large > indexes with about 11.5 million docs each, and a small index that I try to > keep below half a million documents. The small index contains the newest > documents, between 3.5 and 7 days worth. With this setup and the way I > manage it, large merges pretty much never happen. > > Once a minute, I do an update cycle. This looks for and applies deletions, > reinserts, and new document inserts. New document inserts happen only on > the small index, and there are usually a few dozen documents to insert on > each update cycle. Deletions and reinserts can happen on any of the seven > shards, but there are not usually deletions and reinserts on every update > cycle, and the number of reinserts is usually very very small. Once an > hour, I optimize the small index, which takes about 30 seconds. Once a day, > I optimize one of the large indexes during non-peak hours, so every large > index gets optimized once every six days. This takes about 15 minutes, > during which deletes and reinserts are not applied, but new document inserts > continue to happen. > > My mergeFactor is set to 35. I wanted a large value here, and this > particular number has a side effect -- uniformity in segment filenames on > the disk during full rebuilds. Lucene uses a base-36 segment numbering > scheme. I usually end up with less than 10 segments in the larger indexes, > which means they don't do merges. The small index does do merges, but I > have never had a problem with those merges going slowly. > > Because I do occasionally optimize, I am fairly sure that even when I do > have merges, they happen with 35 very small segment files, and leave the > large initial segment alone. I have not tested this theory, but it seems > the most sensible way to do things, and I've found that Lucene/Solr usually > does things in a sensible manner. If I am wrong here (using 3.5 and its > improved merging), I would appreciate knowing. > > Thanks, > Shawn > -- Lance Norskog goks...@gmail.com
Re: Solr Merge during off peak times
On 5/2/2012 5:54 AM, Prakashganesh, Prabhu wrote: We have a fairly large scale system - about 200 million docs and fairly high indexing activity - about 300k docs per day with peak ingestion rates of about 20 docs per sec. I want to work out what a good mergeFactor setting would be by testing with different mergeFactor settings. I think the default of 10 might be high, I want to try with 5 and compare. Unless I know when a merge starts and finishes, it would be quite difficult to work out the impact of changing mergeFactor. I want to be able to measure how long merges take, run queries during the merge activity and see what the response times are etc.. With a lot of indexing activity, if you are attempting to avoid large merges, I would think you would want a higher mergeFactor, not a lower one, and do occasional optimizes during non-peak hours. With a small mergeFactor, you will be merging a lot more often, and you are more likely to encounter merges of already-merged segments, which can be very slow. My index is nearing 70 million documents. I've got seven shards - six large indexes with about 11.5 million docs each, and a small index that I try to keep below half a million documents. The small index contains the newest documents, between 3.5 and 7 days worth. With this setup and the way I manage it, large merges pretty much never happen. Once a minute, I do an update cycle. This looks for and applies deletions, reinserts, and new document inserts. New document inserts happen only on the small index, and there are usually a few dozen documents to insert on each update cycle. Deletions and reinserts can happen on any of the seven shards, but there are not usually deletions and reinserts on every update cycle, and the number of reinserts is usually very very small. Once an hour, I optimize the small index, which takes about 30 seconds. Once a day, I optimize one of the large indexes during non-peak hours, so every large index gets optimized once every six days. This takes about 15 minutes, during which deletes and reinserts are not applied, but new document inserts continue to happen. My mergeFactor is set to 35. I wanted a large value here, and this particular number has a side effect -- uniformity in segment filenames on the disk during full rebuilds. Lucene uses a base-36 segment numbering scheme. I usually end up with less than 10 segments in the larger indexes, which means they don't do merges. The small index does do merges, but I have never had a problem with those merges going slowly. Because I do occasionally optimize, I am fairly sure that even when I do have merges, they happen with 35 very small segment files, and leave the large initial segment alone. I have not tested this theory, but it seems the most sensible way to do things, and I've found that Lucene/Solr usually does things in a sensible manner. If I am wrong here (using 3.5 and its improved merging), I would appreciate knowing. Thanks, Shawn
Re: Solr Merge during off peak times
Ahhh, you're right. Shows what happens when I work from memory Thanks. Erick On Wed, May 2, 2012 at 4:26 PM, Jason Rutherglen wrote: >> BTW, in 4.0, there's DocumentWriterPerThread that >> merges in the background > > It flushes without pausing, but does not perform merges. Maybe you're > thinking of ConcurrentMergeScheduler? > > On Wed, May 2, 2012 at 7:26 AM, Erick Erickson > wrote: >> Optimizing is much less important query-speed wise >> than historically, essentially it's not recommended much >> any more. >> >> A significant effect of optimize _used_ to be purging >> obsolete data (i.e. that from deleted docs) from the >> index, but that is now done on merge. >> >> There's no harm in optimizing on off-peak hours, and >> combined with an appropriate merge policy that may make >> indexing a little better (I'm thinking of not doing >> as many massive merges here). >> >> BTW, in 4.0, there's DocumentWriterPerThread that >> merges in the background and pretty much removes >> even this as a motivation for optimizing. >> >> All that said, optimizing isn't _bad_, it's just often >> unnecessary. >> >> Best >> Erick >> >> On Wed, May 2, 2012 at 9:29 AM, Prakashganesh, Prabhu >> wrote: >>> Actually we are not thinking of a M/S setup >>> We are planning to have x number of shards on N number of servers, each of >>> the shard handling both indexing and searching >>> The expected query volume is not that high, so don't think we would need to >>> replicate to slaves. We think each shard will be able to handle its share >>> of the indexing and searching. If we need to scale query capacity in >>> future, yeah probably need to do it by replicating each shard to its slaves >>> >>> I agree autoCommit settings would be good to set up appropriately >>> >>> Another question I had is pros/cons of optimising the index. We would be >>> purging old content every week and am thinking whether to run an index >>> optimise in the weekend after purging old data. Because we are going to be >>> continuously indexing data which would be mix of adds, updates, deletes, >>> not sure if the benefit of optimising would last long enough to be worth >>> doing it. Maybe setting a low mergeFactor would be good enough. Optimising >>> makes sense if the index is more static, perhaps? Thoughts? >>> >>> Thanks >>> Prabhu >>> >>> >>> -Original Message- >>> From: Erick Erickson [mailto:erickerick...@gmail.com] >>> Sent: 02 May 2012 13:15 >>> To: solr-user@lucene.apache.org >>> Subject: Re: Solr Merge during off peak times >>> >>> But again, with a master/slave setup merging should >>> be relatively benign. And at 200M docs, having a M/S >>> setup is probably indicated. >>> >>> Here's a good writeup of mergepolicy >>> http://juanggrande.wordpress.com/2011/02/07/merge-policy-internals/ >>> >>> If you're indexing and searching on a single machine, merging >>> is much less important than how often you commit. If a M/S >>> situation, then you're polling interval on the slave is important. >>> >>> I'd look at commit frequency long before I worried about merging, >>> that's usually where people shoot themselves in the foot - by >>> committing too often. >>> >>> Overall, your mergeFactor is probably less important than other >>> parts of how you perform indexing/searching, but it does have >>> some effect for sure... >>> >>> Best >>> Erick >>> >>> On Wed, May 2, 2012 at 7:54 AM, Prakashganesh, Prabhu >>> wrote: >>>> We have a fairly large scale system - about 200 million docs and fairly >>>> high indexing activity - about 300k docs per day with peak ingestion rates >>>> of about 20 docs per sec. I want to work out what a good mergeFactor >>>> setting would be by testing with different mergeFactor settings. I think >>>> the default of 10 might be high, I want to try with 5 and compare. Unless >>>> I know when a merge starts and finishes, it would be quite difficult to >>>> work out the impact of changing mergeFactor. I want to be able to measure >>>> how long merges take, run queries during the merge activity and see what >>>> the response times are etc.. >>>> >>>> Thanks >>&g
RE: Solr Merge during off peak times
Great, thanks Otis and Erick for your responses I will take a look at SPM Thanks Prabhu -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: 03 May 2012 00:02 To: solr-user@lucene.apache.org Subject: Re: Solr Merge during off peak times Hello Prabhu, Look at SPM for Solr (URL in sig below). It includes Index Statistics graphs, and from these graphs you can tell: * how many docs are in your index * how many docs are deleted * size of index on disk * number of index segments * number of index files * maybe something else I'm forgetting now So from size, # of segments, and index files you will be able to tell when merges happened and before/after size, segment and index file count. Otis Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm > > From: "Prakashganesh, Prabhu" >To: "solr-user@lucene.apache.org" ; Otis >Gospodnetic >Sent: Wednesday, May 2, 2012 7:22 AM >Subject: RE: Solr Merge during off peak times > >Ok, thanks Otis >Another question on merging >What is the best way to monitor merging? >Is there something in the log file that I can look for? >It seems like I have to monitor the system resources - read/write IOPS etc.. >and work out when a merge happened >It would be great if I can do it by looking at log files or in the admin UI. >Do you know if this can be done or if there is some tool for this? > >Thanks >Prabhu > >-Original Message- >From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] >Sent: 01 May 2012 15:12 >To: solr-user@lucene.apache.org >Subject: Re: Solr Merge during off peak times > >Hi Prabhu, > >I don't think such a merge policy exists, but it would be nice to have this >option and I imagine it wouldn't be hard to write if you really just base the >merge or no merge decision on the time of day (and maybe day of the week). > >Note that this should go into Lucene, not Solr, so if you decide to contribute >your work, please see http://wiki.apache.org/lucene-java/HowToContribute > >Otis > >Performance Monitoring for Solr - http://sematext.com/spm > > > > >> >> From: "Prakashganesh, Prabhu" >>To: "solr-user@lucene.apache.org" >>Sent: Tuesday, May 1, 2012 8:45 AM >>Subject: Solr Merge during off peak times >> >>Hi, >> I would like to know if there is a way to configure index merge policy in >>solr so that the merging happens during off peak hours. Can you please let me >>know if such a merge policy configuration exists? >> >>Thanks >>Prabhu >> >> >> > > >
Re: Solr Merge during off peak times
Hello Prabhu, Look at SPM for Solr (URL in sig below). It includes Index Statistics graphs, and from these graphs you can tell: * how many docs are in your index * how many docs are deleted * size of index on disk * number of index segments * number of index files * maybe something else I'm forgetting now So from size, # of segments, and index files you will be able to tell when merges happened and before/after size, segment and index file count. Otis Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm > > From: "Prakashganesh, Prabhu" >To: "solr-user@lucene.apache.org" ; Otis >Gospodnetic >Sent: Wednesday, May 2, 2012 7:22 AM >Subject: RE: Solr Merge during off peak times > >Ok, thanks Otis >Another question on merging >What is the best way to monitor merging? >Is there something in the log file that I can look for? >It seems like I have to monitor the system resources - read/write IOPS etc.. >and work out when a merge happened >It would be great if I can do it by looking at log files or in the admin UI. >Do you know if this can be done or if there is some tool for this? > >Thanks >Prabhu > >-Original Message- >From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] >Sent: 01 May 2012 15:12 >To: solr-user@lucene.apache.org >Subject: Re: Solr Merge during off peak times > >Hi Prabhu, > >I don't think such a merge policy exists, but it would be nice to have this >option and I imagine it wouldn't be hard to write if you really just base the >merge or no merge decision on the time of day (and maybe day of the week). > >Note that this should go into Lucene, not Solr, so if you decide to contribute >your work, please see http://wiki.apache.org/lucene-java/HowToContribute > >Otis > >Performance Monitoring for Solr - http://sematext.com/spm > > > > >> >> From: "Prakashganesh, Prabhu" >>To: "solr-user@lucene.apache.org" >>Sent: Tuesday, May 1, 2012 8:45 AM >>Subject: Solr Merge during off peak times >> >>Hi, >> I would like to know if there is a way to configure index merge policy in >>solr so that the merging happens during off peak hours. Can you please let me >>know if such a merge policy configuration exists? >> >>Thanks >>Prabhu >> >> >> > > >
Re: Solr Merge during off peak times
> BTW, in 4.0, there's DocumentWriterPerThread that > merges in the background It flushes without pausing, but does not perform merges. Maybe you're thinking of ConcurrentMergeScheduler? On Wed, May 2, 2012 at 7:26 AM, Erick Erickson wrote: > Optimizing is much less important query-speed wise > than historically, essentially it's not recommended much > any more. > > A significant effect of optimize _used_ to be purging > obsolete data (i.e. that from deleted docs) from the > index, but that is now done on merge. > > There's no harm in optimizing on off-peak hours, and > combined with an appropriate merge policy that may make > indexing a little better (I'm thinking of not doing > as many massive merges here). > > BTW, in 4.0, there's DocumentWriterPerThread that > merges in the background and pretty much removes > even this as a motivation for optimizing. > > All that said, optimizing isn't _bad_, it's just often > unnecessary. > > Best > Erick > > On Wed, May 2, 2012 at 9:29 AM, Prakashganesh, Prabhu > wrote: >> Actually we are not thinking of a M/S setup >> We are planning to have x number of shards on N number of servers, each of >> the shard handling both indexing and searching >> The expected query volume is not that high, so don't think we would need to >> replicate to slaves. We think each shard will be able to handle its share of >> the indexing and searching. If we need to scale query capacity in future, >> yeah probably need to do it by replicating each shard to its slaves >> >> I agree autoCommit settings would be good to set up appropriately >> >> Another question I had is pros/cons of optimising the index. We would be >> purging old content every week and am thinking whether to run an index >> optimise in the weekend after purging old data. Because we are going to be >> continuously indexing data which would be mix of adds, updates, deletes, not >> sure if the benefit of optimising would last long enough to be worth doing >> it. Maybe setting a low mergeFactor would be good enough. Optimising makes >> sense if the index is more static, perhaps? Thoughts? >> >> Thanks >> Prabhu >> >> >> -Original Message- >> From: Erick Erickson [mailto:erickerick...@gmail.com] >> Sent: 02 May 2012 13:15 >> To: solr-user@lucene.apache.org >> Subject: Re: Solr Merge during off peak times >> >> But again, with a master/slave setup merging should >> be relatively benign. And at 200M docs, having a M/S >> setup is probably indicated. >> >> Here's a good writeup of mergepolicy >> http://juanggrande.wordpress.com/2011/02/07/merge-policy-internals/ >> >> If you're indexing and searching on a single machine, merging >> is much less important than how often you commit. If a M/S >> situation, then you're polling interval on the slave is important. >> >> I'd look at commit frequency long before I worried about merging, >> that's usually where people shoot themselves in the foot - by >> committing too often. >> >> Overall, your mergeFactor is probably less important than other >> parts of how you perform indexing/searching, but it does have >> some effect for sure... >> >> Best >> Erick >> >> On Wed, May 2, 2012 at 7:54 AM, Prakashganesh, Prabhu >> wrote: >>> We have a fairly large scale system - about 200 million docs and fairly >>> high indexing activity - about 300k docs per day with peak ingestion rates >>> of about 20 docs per sec. I want to work out what a good mergeFactor >>> setting would be by testing with different mergeFactor settings. I think >>> the default of 10 might be high, I want to try with 5 and compare. Unless I >>> know when a merge starts and finishes, it would be quite difficult to work >>> out the impact of changing mergeFactor. I want to be able to measure how >>> long merges take, run queries during the merge activity and see what the >>> response times are etc.. >>> >>> Thanks >>> Prabhu >>> >>> -Original Message- >>> From: Erick Erickson [mailto:erickerick...@gmail.com] >>> Sent: 02 May 2012 12:40 >>> To: solr-user@lucene.apache.org >>> Subject: Re: Solr Merge during off peak times >>> >>> Why do you care? Merging is generally a background process, or are >>> you doing heavy indexing? In a master/slave setup, >>> it's usually not really relevant except that (with 3.x), massive merges >>> may
Re: Solr Merge during off peak times
Optimizing is much less important query-speed wise than historically, essentially it's not recommended much any more. A significant effect of optimize _used_ to be purging obsolete data (i.e. that from deleted docs) from the index, but that is now done on merge. There's no harm in optimizing on off-peak hours, and combined with an appropriate merge policy that may make indexing a little better (I'm thinking of not doing as many massive merges here). BTW, in 4.0, there's DocumentWriterPerThread that merges in the background and pretty much removes even this as a motivation for optimizing. All that said, optimizing isn't _bad_, it's just often unnecessary. Best Erick On Wed, May 2, 2012 at 9:29 AM, Prakashganesh, Prabhu wrote: > Actually we are not thinking of a M/S setup > We are planning to have x number of shards on N number of servers, each of > the shard handling both indexing and searching > The expected query volume is not that high, so don't think we would need to > replicate to slaves. We think each shard will be able to handle its share of > the indexing and searching. If we need to scale query capacity in future, > yeah probably need to do it by replicating each shard to its slaves > > I agree autoCommit settings would be good to set up appropriately > > Another question I had is pros/cons of optimising the index. We would be > purging old content every week and am thinking whether to run an index > optimise in the weekend after purging old data. Because we are going to be > continuously indexing data which would be mix of adds, updates, deletes, not > sure if the benefit of optimising would last long enough to be worth doing > it. Maybe setting a low mergeFactor would be good enough. Optimising makes > sense if the index is more static, perhaps? Thoughts? > > Thanks > Prabhu > > > -Original Message- > From: Erick Erickson [mailto:erickerick...@gmail.com] > Sent: 02 May 2012 13:15 > To: solr-user@lucene.apache.org > Subject: Re: Solr Merge during off peak times > > But again, with a master/slave setup merging should > be relatively benign. And at 200M docs, having a M/S > setup is probably indicated. > > Here's a good writeup of mergepolicy > http://juanggrande.wordpress.com/2011/02/07/merge-policy-internals/ > > If you're indexing and searching on a single machine, merging > is much less important than how often you commit. If a M/S > situation, then you're polling interval on the slave is important. > > I'd look at commit frequency long before I worried about merging, > that's usually where people shoot themselves in the foot - by > committing too often. > > Overall, your mergeFactor is probably less important than other > parts of how you perform indexing/searching, but it does have > some effect for sure... > > Best > Erick > > On Wed, May 2, 2012 at 7:54 AM, Prakashganesh, Prabhu > wrote: >> We have a fairly large scale system - about 200 million docs and fairly high >> indexing activity - about 300k docs per day with peak ingestion rates of >> about 20 docs per sec. I want to work out what a good mergeFactor setting >> would be by testing with different mergeFactor settings. I think the default >> of 10 might be high, I want to try with 5 and compare. Unless I know when a >> merge starts and finishes, it would be quite difficult to work out the >> impact of changing mergeFactor. I want to be able to measure how long merges >> take, run queries during the merge activity and see what the response times >> are etc.. >> >> Thanks >> Prabhu >> >> -Original Message- >> From: Erick Erickson [mailto:erickerick...@gmail.com] >> Sent: 02 May 2012 12:40 >> To: solr-user@lucene.apache.org >> Subject: Re: Solr Merge during off peak times >> >> Why do you care? Merging is generally a background process, or are >> you doing heavy indexing? In a master/slave setup, >> it's usually not really relevant except that (with 3.x), massive merges >> may temporarily stop indexing. Is that the problem? >> >> Look at the merge policys, there are configurations that make >> this less painful. >> >> In trunk, DocumentWriterPerThread makes merges happen in the >> background, which helps the long-pause-while-indexing problem. >> >> Best >> Erick >> >> On Wed, May 2, 2012 at 7:22 AM, Prakashganesh, Prabhu >> wrote: >>> Ok, thanks Otis >>> Another question on merging >>> What is the best way to monitor merging? >>> Is there something in the log file that I can look for? >>> It seems like I have to monitor
RE: Solr Merge during off peak times
Actually we are not thinking of a M/S setup We are planning to have x number of shards on N number of servers, each of the shard handling both indexing and searching The expected query volume is not that high, so don't think we would need to replicate to slaves. We think each shard will be able to handle its share of the indexing and searching. If we need to scale query capacity in future, yeah probably need to do it by replicating each shard to its slaves I agree autoCommit settings would be good to set up appropriately Another question I had is pros/cons of optimising the index. We would be purging old content every week and am thinking whether to run an index optimise in the weekend after purging old data. Because we are going to be continuously indexing data which would be mix of adds, updates, deletes, not sure if the benefit of optimising would last long enough to be worth doing it. Maybe setting a low mergeFactor would be good enough. Optimising makes sense if the index is more static, perhaps? Thoughts? Thanks Prabhu -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: 02 May 2012 13:15 To: solr-user@lucene.apache.org Subject: Re: Solr Merge during off peak times But again, with a master/slave setup merging should be relatively benign. And at 200M docs, having a M/S setup is probably indicated. Here's a good writeup of mergepolicy http://juanggrande.wordpress.com/2011/02/07/merge-policy-internals/ If you're indexing and searching on a single machine, merging is much less important than how often you commit. If a M/S situation, then you're polling interval on the slave is important. I'd look at commit frequency long before I worried about merging, that's usually where people shoot themselves in the foot - by committing too often. Overall, your mergeFactor is probably less important than other parts of how you perform indexing/searching, but it does have some effect for sure... Best Erick On Wed, May 2, 2012 at 7:54 AM, Prakashganesh, Prabhu wrote: > We have a fairly large scale system - about 200 million docs and fairly high > indexing activity - about 300k docs per day with peak ingestion rates of > about 20 docs per sec. I want to work out what a good mergeFactor setting > would be by testing with different mergeFactor settings. I think the default > of 10 might be high, I want to try with 5 and compare. Unless I know when a > merge starts and finishes, it would be quite difficult to work out the impact > of changing mergeFactor. I want to be able to measure how long merges take, > run queries during the merge activity and see what the response times are > etc.. > > Thanks > Prabhu > > -Original Message- > From: Erick Erickson [mailto:erickerick...@gmail.com] > Sent: 02 May 2012 12:40 > To: solr-user@lucene.apache.org > Subject: Re: Solr Merge during off peak times > > Why do you care? Merging is generally a background process, or are > you doing heavy indexing? In a master/slave setup, > it's usually not really relevant except that (with 3.x), massive merges > may temporarily stop indexing. Is that the problem? > > Look at the merge policys, there are configurations that make > this less painful. > > In trunk, DocumentWriterPerThread makes merges happen in the > background, which helps the long-pause-while-indexing problem. > > Best > Erick > > On Wed, May 2, 2012 at 7:22 AM, Prakashganesh, Prabhu > wrote: >> Ok, thanks Otis >> Another question on merging >> What is the best way to monitor merging? >> Is there something in the log file that I can look for? >> It seems like I have to monitor the system resources - read/write IOPS etc.. >> and work out when a merge happened >> It would be great if I can do it by looking at log files or in the admin UI. >> Do you know if this can be done or if there is some tool for this? >> >> Thanks >> Prabhu >> >> -Original Message- >> From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] >> Sent: 01 May 2012 15:12 >> To: solr-user@lucene.apache.org >> Subject: Re: Solr Merge during off peak times >> >> Hi Prabhu, >> >> I don't think such a merge policy exists, but it would be nice to have this >> option and I imagine it wouldn't be hard to write if you really just base >> the merge or no merge decision on the time of day (and maybe day of the >> week). >> >> Note that this should go into Lucene, not Solr, so if you decide to >> contribute your work, please >> see http://wiki.apache.org/lucene-java/HowToContribute >> >> Otis >> >> Performance Monitoring for Solr - http://sematext.com/spm >> >> >> >> &
Re: Solr Merge during off peak times
But again, with a master/slave setup merging should be relatively benign. And at 200M docs, having a M/S setup is probably indicated. Here's a good writeup of mergepolicy http://juanggrande.wordpress.com/2011/02/07/merge-policy-internals/ If you're indexing and searching on a single machine, merging is much less important than how often you commit. If a M/S situation, then you're polling interval on the slave is important. I'd look at commit frequency long before I worried about merging, that's usually where people shoot themselves in the foot - by committing too often. Overall, your mergeFactor is probably less important than other parts of how you perform indexing/searching, but it does have some effect for sure... Best Erick On Wed, May 2, 2012 at 7:54 AM, Prakashganesh, Prabhu wrote: > We have a fairly large scale system - about 200 million docs and fairly high > indexing activity - about 300k docs per day with peak ingestion rates of > about 20 docs per sec. I want to work out what a good mergeFactor setting > would be by testing with different mergeFactor settings. I think the default > of 10 might be high, I want to try with 5 and compare. Unless I know when a > merge starts and finishes, it would be quite difficult to work out the impact > of changing mergeFactor. I want to be able to measure how long merges take, > run queries during the merge activity and see what the response times are > etc.. > > Thanks > Prabhu > > -Original Message- > From: Erick Erickson [mailto:erickerick...@gmail.com] > Sent: 02 May 2012 12:40 > To: solr-user@lucene.apache.org > Subject: Re: Solr Merge during off peak times > > Why do you care? Merging is generally a background process, or are > you doing heavy indexing? In a master/slave setup, > it's usually not really relevant except that (with 3.x), massive merges > may temporarily stop indexing. Is that the problem? > > Look at the merge policys, there are configurations that make > this less painful. > > In trunk, DocumentWriterPerThread makes merges happen in the > background, which helps the long-pause-while-indexing problem. > > Best > Erick > > On Wed, May 2, 2012 at 7:22 AM, Prakashganesh, Prabhu > wrote: >> Ok, thanks Otis >> Another question on merging >> What is the best way to monitor merging? >> Is there something in the log file that I can look for? >> It seems like I have to monitor the system resources - read/write IOPS etc.. >> and work out when a merge happened >> It would be great if I can do it by looking at log files or in the admin UI. >> Do you know if this can be done or if there is some tool for this? >> >> Thanks >> Prabhu >> >> -Original Message- >> From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] >> Sent: 01 May 2012 15:12 >> To: solr-user@lucene.apache.org >> Subject: Re: Solr Merge during off peak times >> >> Hi Prabhu, >> >> I don't think such a merge policy exists, but it would be nice to have this >> option and I imagine it wouldn't be hard to write if you really just base >> the merge or no merge decision on the time of day (and maybe day of the >> week). >> >> Note that this should go into Lucene, not Solr, so if you decide to >> contribute your work, please >> see http://wiki.apache.org/lucene-java/HowToContribute >> >> Otis >> >> Performance Monitoring for Solr - http://sematext.com/spm >> >> >> >> >>> >>> From: "Prakashganesh, Prabhu" >>>To: "solr-user@lucene.apache.org" >>>Sent: Tuesday, May 1, 2012 8:45 AM >>>Subject: Solr Merge during off peak times >>> >>>Hi, >>> I would like to know if there is a way to configure index merge policy in >>>solr so that the merging happens during off peak hours. Can you please let >>>me know if such a merge policy configuration exists? >>> >>>Thanks >>>Prabhu >>> >>> >>>
RE: Solr Merge during off peak times
We have a fairly large scale system - about 200 million docs and fairly high indexing activity - about 300k docs per day with peak ingestion rates of about 20 docs per sec. I want to work out what a good mergeFactor setting would be by testing with different mergeFactor settings. I think the default of 10 might be high, I want to try with 5 and compare. Unless I know when a merge starts and finishes, it would be quite difficult to work out the impact of changing mergeFactor. I want to be able to measure how long merges take, run queries during the merge activity and see what the response times are etc.. Thanks Prabhu -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: 02 May 2012 12:40 To: solr-user@lucene.apache.org Subject: Re: Solr Merge during off peak times Why do you care? Merging is generally a background process, or are you doing heavy indexing? In a master/slave setup, it's usually not really relevant except that (with 3.x), massive merges may temporarily stop indexing. Is that the problem? Look at the merge policys, there are configurations that make this less painful. In trunk, DocumentWriterPerThread makes merges happen in the background, which helps the long-pause-while-indexing problem. Best Erick On Wed, May 2, 2012 at 7:22 AM, Prakashganesh, Prabhu wrote: > Ok, thanks Otis > Another question on merging > What is the best way to monitor merging? > Is there something in the log file that I can look for? > It seems like I have to monitor the system resources - read/write IOPS etc.. > and work out when a merge happened > It would be great if I can do it by looking at log files or in the admin UI. > Do you know if this can be done or if there is some tool for this? > > Thanks > Prabhu > > -Original Message- > From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] > Sent: 01 May 2012 15:12 > To: solr-user@lucene.apache.org > Subject: Re: Solr Merge during off peak times > > Hi Prabhu, > > I don't think such a merge policy exists, but it would be nice to have this > option and I imagine it wouldn't be hard to write if you really just base the > merge or no merge decision on the time of day (and maybe day of the week). > > Note that this should go into Lucene, not Solr, so if you decide to > contribute your work, please > see http://wiki.apache.org/lucene-java/HowToContribute > > Otis > > Performance Monitoring for Solr - http://sematext.com/spm > > > > >> >> From: "Prakashganesh, Prabhu" >>To: "solr-user@lucene.apache.org" >>Sent: Tuesday, May 1, 2012 8:45 AM >>Subject: Solr Merge during off peak times >> >>Hi, >> I would like to know if there is a way to configure index merge policy in >>solr so that the merging happens during off peak hours. Can you please let me >>know if such a merge policy configuration exists? >> >>Thanks >>Prabhu >> >> >>
Re: Solr Merge during off peak times
Why do you care? Merging is generally a background process, or are you doing heavy indexing? In a master/slave setup, it's usually not really relevant except that (with 3.x), massive merges may temporarily stop indexing. Is that the problem? Look at the merge policys, there are configurations that make this less painful. In trunk, DocumentWriterPerThread makes merges happen in the background, which helps the long-pause-while-indexing problem. Best Erick On Wed, May 2, 2012 at 7:22 AM, Prakashganesh, Prabhu wrote: > Ok, thanks Otis > Another question on merging > What is the best way to monitor merging? > Is there something in the log file that I can look for? > It seems like I have to monitor the system resources - read/write IOPS etc.. > and work out when a merge happened > It would be great if I can do it by looking at log files or in the admin UI. > Do you know if this can be done or if there is some tool for this? > > Thanks > Prabhu > > -Original Message- > From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] > Sent: 01 May 2012 15:12 > To: solr-user@lucene.apache.org > Subject: Re: Solr Merge during off peak times > > Hi Prabhu, > > I don't think such a merge policy exists, but it would be nice to have this > option and I imagine it wouldn't be hard to write if you really just base the > merge or no merge decision on the time of day (and maybe day of the week). > > Note that this should go into Lucene, not Solr, so if you decide to > contribute your work, please > see http://wiki.apache.org/lucene-java/HowToContribute > > Otis > > Performance Monitoring for Solr - http://sematext.com/spm > > > > >> >> From: "Prakashganesh, Prabhu" >>To: "solr-user@lucene.apache.org" >>Sent: Tuesday, May 1, 2012 8:45 AM >>Subject: Solr Merge during off peak times >> >>Hi, >> I would like to know if there is a way to configure index merge policy in >>solr so that the merging happens during off peak hours. Can you please let me >>know if such a merge policy configuration exists? >> >>Thanks >>Prabhu >> >> >>
RE: Solr Merge during off peak times
Ok, thanks Otis Another question on merging What is the best way to monitor merging? Is there something in the log file that I can look for? It seems like I have to monitor the system resources - read/write IOPS etc.. and work out when a merge happened It would be great if I can do it by looking at log files or in the admin UI. Do you know if this can be done or if there is some tool for this? Thanks Prabhu -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: 01 May 2012 15:12 To: solr-user@lucene.apache.org Subject: Re: Solr Merge during off peak times Hi Prabhu, I don't think such a merge policy exists, but it would be nice to have this option and I imagine it wouldn't be hard to write if you really just base the merge or no merge decision on the time of day (and maybe day of the week). Note that this should go into Lucene, not Solr, so if you decide to contribute your work, please see http://wiki.apache.org/lucene-java/HowToContribute Otis Performance Monitoring for Solr - http://sematext.com/spm > > From: "Prakashganesh, Prabhu" >To: "solr-user@lucene.apache.org" >Sent: Tuesday, May 1, 2012 8:45 AM >Subject: Solr Merge during off peak times > >Hi, > I would like to know if there is a way to configure index merge policy in >solr so that the merging happens during off peak hours. Can you please let me >know if such a merge policy configuration exists? > >Thanks >Prabhu > > >
Re: Solr Merge during off peak times
Hi Prabhu, I don't think such a merge policy exists, but it would be nice to have this option and I imagine it wouldn't be hard to write if you really just base the merge or no merge decision on the time of day (and maybe day of the week). Note that this should go into Lucene, not Solr, so if you decide to contribute your work, please see http://wiki.apache.org/lucene-java/HowToContribute Otis Performance Monitoring for Solr - http://sematext.com/spm > > From: "Prakashganesh, Prabhu" >To: "solr-user@lucene.apache.org" >Sent: Tuesday, May 1, 2012 8:45 AM >Subject: Solr Merge during off peak times > >Hi, > I would like to know if there is a way to configure index merge policy in >solr so that the merging happens during off peak hours. Can you please let me >know if such a merge policy configuration exists? > >Thanks >Prabhu > > >