Re: Solr Merge during off peak times

2012-05-05 Thread Shawn Heisey

On 5/4/2012 8:10 PM, Lance Norskog wrote:

Optimize takes a 'maxSegments' option. This tells it to stop when
there are N segments instead of just one.

If you use a very high mergeFactor and then call optimize with a sane
number like 50, it only merges the little teeny segments.


When I optimize, I want only one segment.  My main concern in doing 
occasional optimizes is removing deleted documents.  Whatever speedup I 
get from having only one segment is just a nice bonus.


When it comes to only merging the small segments, I am concerned about 
that happening when regular indexing builds up enough segments to do a 
merge.  If I start with one large optimized segment, then do indexing 
operations such that I reach segmentsPerTier, will it leave the large 
segment alone and just work on the little ones?  I am using Solr 3.5 
with the following config:



35
35
105


Thanks,
Shawn



Re: Solr Merge during off peak times

2012-05-04 Thread Lance Norskog
Optimize takes a 'maxSegments' option. This tells it to stop when
there are N segments instead of just one.

If you use a very high mergeFactor and then call optimize with a sane
number like 50, it only merges the little teeny segments.

On Thu, May 3, 2012 at 8:28 PM, Shawn Heisey  wrote:
> On 5/2/2012 5:54 AM, Prakashganesh, Prabhu wrote:
>>
>> We have a fairly large scale system - about 200 million docs and fairly
>> high indexing activity - about 300k docs per day with peak ingestion rates
>> of about 20 docs per sec. I want to work out what a good mergeFactor setting
>> would be by testing with different mergeFactor settings. I think the default
>> of 10 might be high, I want to try with 5 and compare. Unless I know when a
>> merge starts and finishes, it would be quite difficult to work out the
>> impact of changing mergeFactor. I want to be able to measure how long merges
>> take, run queries during the merge activity and see what the response times
>> are etc..
>
>
> With a lot of indexing activity, if you are attempting to avoid large
> merges, I would think you would want a higher mergeFactor, not a lower one,
> and do occasional optimizes during non-peak hours.  With a small
> mergeFactor, you will be merging a lot more often, and you are more likely
> to encounter merges of already-merged segments, which can be very slow.
>
> My index is nearing 70 million documents.  I've got seven shards - six large
> indexes with about 11.5 million docs each, and a small index that I try to
> keep below half a million documents.  The small index contains the newest
> documents, between 3.5 and 7 days worth.  With this setup and the way I
> manage it, large merges pretty much never happen.
>
> Once a minute, I do an update cycle.  This looks for and applies deletions,
> reinserts, and new document inserts.  New document inserts happen only on
> the small index, and there are usually a few dozen documents to insert on
> each update cycle.  Deletions and reinserts can happen on any of the seven
> shards, but there are not usually deletions and reinserts on every update
> cycle, and the number of reinserts is usually very very small.  Once an
> hour, I optimize the small index, which takes about 30 seconds.  Once a day,
> I optimize one of the large indexes during non-peak hours, so every large
> index gets optimized once every six days.  This takes about 15 minutes,
> during which deletes and reinserts are not applied, but new document inserts
> continue to happen.
>
> My mergeFactor is set to 35.  I wanted a large value here, and this
> particular number has a side effect -- uniformity in segment filenames on
> the disk during full rebuilds.  Lucene uses a base-36 segment numbering
> scheme.  I usually end up with less than 10 segments in the larger indexes,
> which means they don't do merges.  The small index does do merges, but I
> have never had a problem with those merges going slowly.
>
> Because I do occasionally optimize, I am fairly sure that even when I do
> have merges, they happen with 35 very small segment files, and leave the
> large initial segment alone.  I have not tested this theory, but it seems
> the most sensible way to do things, and I've found that Lucene/Solr usually
> does things in a sensible manner.  If I am wrong here (using 3.5 and its
> improved merging), I would appreciate knowing.
>
> Thanks,
> Shawn
>



-- 
Lance Norskog
goks...@gmail.com


Re: Solr Merge during off peak times

2012-05-03 Thread Shawn Heisey

On 5/2/2012 5:54 AM, Prakashganesh, Prabhu wrote:

We have a fairly large scale system - about 200 million docs and fairly high 
indexing activity - about 300k docs per day with peak ingestion rates of about 
20 docs per sec. I want to work out what a good mergeFactor setting would be by 
testing with different mergeFactor settings. I think the default of 10 might be 
high, I want to try with 5 and compare. Unless I know when a merge starts and 
finishes, it would be quite difficult to work out the impact of changing 
mergeFactor. I want to be able to measure how long merges take, run queries 
during the merge activity and see what the response times are etc..


With a lot of indexing activity, if you are attempting to avoid large 
merges, I would think you would want a higher mergeFactor, not a lower 
one, and do occasional optimizes during non-peak hours.  With a small 
mergeFactor, you will be merging a lot more often, and you are more 
likely to encounter merges of already-merged segments, which can be very 
slow.


My index is nearing 70 million documents.  I've got seven shards - six 
large indexes with about 11.5 million docs each, and a small index that 
I try to keep below half a million documents.  The small index contains 
the newest documents, between 3.5 and 7 days worth.  With this setup and 
the way I manage it, large merges pretty much never happen.


Once a minute, I do an update cycle.  This looks for and applies 
deletions, reinserts, and new document inserts.  New document inserts 
happen only on the small index, and there are usually a few dozen 
documents to insert on each update cycle.  Deletions and reinserts can 
happen on any of the seven shards, but there are not usually deletions 
and reinserts on every update cycle, and the number of reinserts is 
usually very very small.  Once an hour, I optimize the small index, 
which takes about 30 seconds.  Once a day, I optimize one of the large 
indexes during non-peak hours, so every large index gets optimized once 
every six days.  This takes about 15 minutes, during which deletes and 
reinserts are not applied, but new document inserts continue to happen.


My mergeFactor is set to 35.  I wanted a large value here, and this 
particular number has a side effect -- uniformity in segment filenames 
on the disk during full rebuilds.  Lucene uses a base-36 segment 
numbering scheme.  I usually end up with less than 10 segments in the 
larger indexes, which means they don't do merges.  The small index does 
do merges, but I have never had a problem with those merges going slowly.


Because I do occasionally optimize, I am fairly sure that even when I do 
have merges, they happen with 35 very small segment files, and leave the 
large initial segment alone.  I have not tested this theory, but it 
seems the most sensible way to do things, and I've found that 
Lucene/Solr usually does things in a sensible manner.  If I am wrong 
here (using 3.5 and its improved merging), I would appreciate knowing.


Thanks,
Shawn



Re: Solr Merge during off peak times

2012-05-03 Thread Erick Erickson
Ahhh, you're right. Shows what happens when I work from memory

Thanks.
Erick

On Wed, May 2, 2012 at 4:26 PM, Jason Rutherglen
 wrote:
>> BTW, in 4.0, there's DocumentWriterPerThread that
>> merges in the background
>
> It flushes without pausing, but does not perform merges.  Maybe you're
> thinking of ConcurrentMergeScheduler?
>
> On Wed, May 2, 2012 at 7:26 AM, Erick Erickson  
> wrote:
>> Optimizing is much less important query-speed wise
>> than historically, essentially it's not recommended much
>> any more.
>>
>> A significant effect of optimize _used_ to be purging
>> obsolete data (i.e. that from deleted docs) from the
>> index, but that is now done on merge.
>>
>> There's no harm in optimizing on off-peak hours, and
>> combined with an appropriate merge policy that may make
>> indexing a little better (I'm thinking of not doing
>> as many massive merges here).
>>
>> BTW, in 4.0, there's DocumentWriterPerThread that
>> merges in the background and pretty much removes
>> even this as a motivation for optimizing.
>>
>> All that said, optimizing isn't _bad_, it's just often
>> unnecessary.
>>
>> Best
>> Erick
>>
>> On Wed, May 2, 2012 at 9:29 AM, Prakashganesh, Prabhu
>>  wrote:
>>> Actually we are not thinking of a M/S setup
>>> We are planning to have x number of shards on N number of servers, each of 
>>> the shard handling both indexing and searching
>>> The expected query volume is not that high, so don't think we would need to 
>>> replicate to slaves. We think each shard will be able to handle its share 
>>> of the indexing and searching. If we need to scale query capacity in 
>>> future, yeah probably need to do it by replicating each shard to its slaves
>>>
>>> I agree autoCommit settings would be good to set up appropriately
>>>
>>> Another question I had is pros/cons of optimising the index. We would be 
>>> purging old content every week and am thinking whether to run an index 
>>> optimise in the weekend after purging old data. Because we are going to be 
>>> continuously indexing data which would be mix of adds, updates, deletes, 
>>> not sure if the benefit of optimising would last long enough to be worth 
>>> doing it. Maybe setting a low mergeFactor would be good enough. Optimising 
>>> makes sense if the index is more static, perhaps? Thoughts?
>>>
>>> Thanks
>>> Prabhu
>>>
>>>
>>> -Original Message-
>>> From: Erick Erickson [mailto:erickerick...@gmail.com]
>>> Sent: 02 May 2012 13:15
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Solr Merge during off peak times
>>>
>>> But again, with a master/slave setup merging should
>>> be relatively benign. And at 200M docs, having a M/S
>>> setup is probably indicated.
>>>
>>> Here's a good writeup of mergepolicy
>>> http://juanggrande.wordpress.com/2011/02/07/merge-policy-internals/
>>>
>>> If you're indexing and searching on a single machine, merging
>>> is much less important than how often you commit. If a M/S
>>> situation, then you're polling interval on the slave is important.
>>>
>>> I'd look at commit frequency long before I worried about merging,
>>> that's usually where people shoot themselves in the foot - by
>>> committing too often.
>>>
>>> Overall, your mergeFactor is probably less important than other
>>> parts of how you perform indexing/searching, but it does have
>>> some effect for sure...
>>>
>>> Best
>>> Erick
>>>
>>> On Wed, May 2, 2012 at 7:54 AM, Prakashganesh, Prabhu
>>>  wrote:
>>>> We have a fairly large scale system - about 200 million docs and fairly 
>>>> high indexing activity - about 300k docs per day with peak ingestion rates 
>>>> of about 20 docs per sec. I want to work out what a good mergeFactor 
>>>> setting would be by testing with different mergeFactor settings. I think 
>>>> the default of 10 might be high, I want to try with 5 and compare. Unless 
>>>> I know when a merge starts and finishes, it would be quite difficult to 
>>>> work out the impact of changing mergeFactor. I want to be able to measure 
>>>> how long merges take, run queries during the merge activity and see what 
>>>> the response times are etc..
>>>>
>>>> Thanks
>>&g

RE: Solr Merge during off peak times

2012-05-03 Thread Prakashganesh, Prabhu
Great, thanks Otis and Erick for your responses
I will take a look at SPM

Thanks
Prabhu

-Original Message-
From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
Sent: 03 May 2012 00:02
To: solr-user@lucene.apache.org
Subject: Re: Solr Merge during off peak times

Hello Prabhu,

Look at SPM for Solr (URL in sig below).  It includes Index Statistics graphs, 
and from these graphs you can tell:

* how many docs are in your index
* how many docs are deleted
* size of index on disk
* number of index segments
* number of index files
* maybe something else I'm forgetting now

So from size, # of segments, and index files you will be able to tell when 
merges happened and before/after size, segment and index file count.

Otis 

Performance Monitoring for Solr / ElasticSearch / HBase - 
http://sematext.com/spm 




>
> From: "Prakashganesh, Prabhu" 
>To: "solr-user@lucene.apache.org" ; Otis 
>Gospodnetic  
>Sent: Wednesday, May 2, 2012 7:22 AM
>Subject: RE: Solr Merge during off peak times
> 
>Ok, thanks Otis
>Another question on merging
>What is the best way to monitor merging?
>Is there something in the log file that I can look for? 
>It seems like I have to monitor the system resources - read/write IOPS etc.. 
>and work out when a merge happened
>It would be great if I can do it by looking at log files or in the admin UI. 
>Do you know if this can be done or if there is some tool for this?
>
>Thanks
>Prabhu
>
>-Original Message-
>From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
>Sent: 01 May 2012 15:12
>To: solr-user@lucene.apache.org
>Subject: Re: Solr Merge during off peak times
>
>Hi Prabhu,
>
>I don't think such a merge policy exists, but it would be nice to have this 
>option and I imagine it wouldn't be hard to write if you really just base the 
>merge or no merge decision on the time of day (and maybe day of the week).
>
>Note that this should go into Lucene, not Solr, so if you decide to contribute 
>your work, please see http://wiki.apache.org/lucene-java/HowToContribute
>
>Otis
>
>Performance Monitoring for Solr - http://sematext.com/spm
>
>
>
>
>>
>> From: "Prakashganesh, Prabhu" 
>>To: "solr-user@lucene.apache.org"  
>>Sent: Tuesday, May 1, 2012 8:45 AM
>>Subject: Solr Merge during off peak times
>> 
>>Hi,
>>  I would like to know if there is a way to configure index merge policy in 
>>solr so that the merging happens during off peak hours. Can you please let me 
>>know if such a merge policy configuration exists?
>>
>>Thanks
>>Prabhu
>>
>>
>>
>
>
>


Re: Solr Merge during off peak times

2012-05-02 Thread Otis Gospodnetic
Hello Prabhu,

Look at SPM for Solr (URL in sig below).  It includes Index Statistics graphs, 
and from these graphs you can tell:

* how many docs are in your index
* how many docs are deleted
* size of index on disk
* number of index segments
* number of index files
* maybe something else I'm forgetting now

So from size, # of segments, and index files you will be able to tell when 
merges happened and before/after size, segment and index file count.

Otis 

Performance Monitoring for Solr / ElasticSearch / HBase - 
http://sematext.com/spm 




>
> From: "Prakashganesh, Prabhu" 
>To: "solr-user@lucene.apache.org" ; Otis 
>Gospodnetic  
>Sent: Wednesday, May 2, 2012 7:22 AM
>Subject: RE: Solr Merge during off peak times
> 
>Ok, thanks Otis
>Another question on merging
>What is the best way to monitor merging?
>Is there something in the log file that I can look for? 
>It seems like I have to monitor the system resources - read/write IOPS etc.. 
>and work out when a merge happened
>It would be great if I can do it by looking at log files or in the admin UI. 
>Do you know if this can be done or if there is some tool for this?
>
>Thanks
>Prabhu
>
>-Original Message-
>From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
>Sent: 01 May 2012 15:12
>To: solr-user@lucene.apache.org
>Subject: Re: Solr Merge during off peak times
>
>Hi Prabhu,
>
>I don't think such a merge policy exists, but it would be nice to have this 
>option and I imagine it wouldn't be hard to write if you really just base the 
>merge or no merge decision on the time of day (and maybe day of the week).
>
>Note that this should go into Lucene, not Solr, so if you decide to contribute 
>your work, please see http://wiki.apache.org/lucene-java/HowToContribute
>
>Otis
>
>Performance Monitoring for Solr - http://sematext.com/spm
>
>
>
>
>>
>> From: "Prakashganesh, Prabhu" 
>>To: "solr-user@lucene.apache.org"  
>>Sent: Tuesday, May 1, 2012 8:45 AM
>>Subject: Solr Merge during off peak times
>> 
>>Hi,
>>  I would like to know if there is a way to configure index merge policy in 
>>solr so that the merging happens during off peak hours. Can you please let me 
>>know if such a merge policy configuration exists?
>>
>>Thanks
>>Prabhu
>>
>>
>>
>
>
>

Re: Solr Merge during off peak times

2012-05-02 Thread Jason Rutherglen
> BTW, in 4.0, there's DocumentWriterPerThread that
> merges in the background

It flushes without pausing, but does not perform merges.  Maybe you're
thinking of ConcurrentMergeScheduler?

On Wed, May 2, 2012 at 7:26 AM, Erick Erickson  wrote:
> Optimizing is much less important query-speed wise
> than historically, essentially it's not recommended much
> any more.
>
> A significant effect of optimize _used_ to be purging
> obsolete data (i.e. that from deleted docs) from the
> index, but that is now done on merge.
>
> There's no harm in optimizing on off-peak hours, and
> combined with an appropriate merge policy that may make
> indexing a little better (I'm thinking of not doing
> as many massive merges here).
>
> BTW, in 4.0, there's DocumentWriterPerThread that
> merges in the background and pretty much removes
> even this as a motivation for optimizing.
>
> All that said, optimizing isn't _bad_, it's just often
> unnecessary.
>
> Best
> Erick
>
> On Wed, May 2, 2012 at 9:29 AM, Prakashganesh, Prabhu
>  wrote:
>> Actually we are not thinking of a M/S setup
>> We are planning to have x number of shards on N number of servers, each of 
>> the shard handling both indexing and searching
>> The expected query volume is not that high, so don't think we would need to 
>> replicate to slaves. We think each shard will be able to handle its share of 
>> the indexing and searching. If we need to scale query capacity in future, 
>> yeah probably need to do it by replicating each shard to its slaves
>>
>> I agree autoCommit settings would be good to set up appropriately
>>
>> Another question I had is pros/cons of optimising the index. We would be 
>> purging old content every week and am thinking whether to run an index 
>> optimise in the weekend after purging old data. Because we are going to be 
>> continuously indexing data which would be mix of adds, updates, deletes, not 
>> sure if the benefit of optimising would last long enough to be worth doing 
>> it. Maybe setting a low mergeFactor would be good enough. Optimising makes 
>> sense if the index is more static, perhaps? Thoughts?
>>
>> Thanks
>> Prabhu
>>
>>
>> -Original Message-
>> From: Erick Erickson [mailto:erickerick...@gmail.com]
>> Sent: 02 May 2012 13:15
>> To: solr-user@lucene.apache.org
>> Subject: Re: Solr Merge during off peak times
>>
>> But again, with a master/slave setup merging should
>> be relatively benign. And at 200M docs, having a M/S
>> setup is probably indicated.
>>
>> Here's a good writeup of mergepolicy
>> http://juanggrande.wordpress.com/2011/02/07/merge-policy-internals/
>>
>> If you're indexing and searching on a single machine, merging
>> is much less important than how often you commit. If a M/S
>> situation, then you're polling interval on the slave is important.
>>
>> I'd look at commit frequency long before I worried about merging,
>> that's usually where people shoot themselves in the foot - by
>> committing too often.
>>
>> Overall, your mergeFactor is probably less important than other
>> parts of how you perform indexing/searching, but it does have
>> some effect for sure...
>>
>> Best
>> Erick
>>
>> On Wed, May 2, 2012 at 7:54 AM, Prakashganesh, Prabhu
>>  wrote:
>>> We have a fairly large scale system - about 200 million docs and fairly 
>>> high indexing activity - about 300k docs per day with peak ingestion rates 
>>> of about 20 docs per sec. I want to work out what a good mergeFactor 
>>> setting would be by testing with different mergeFactor settings. I think 
>>> the default of 10 might be high, I want to try with 5 and compare. Unless I 
>>> know when a merge starts and finishes, it would be quite difficult to work 
>>> out the impact of changing mergeFactor. I want to be able to measure how 
>>> long merges take, run queries during the merge activity and see what the 
>>> response times are etc..
>>>
>>> Thanks
>>> Prabhu
>>>
>>> -Original Message-
>>> From: Erick Erickson [mailto:erickerick...@gmail.com]
>>> Sent: 02 May 2012 12:40
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Solr Merge during off peak times
>>>
>>> Why do you care? Merging is generally a background process, or are
>>> you doing heavy indexing? In a master/slave setup,
>>> it's usually not really relevant except that (with 3.x), massive merges
>>> may

Re: Solr Merge during off peak times

2012-05-02 Thread Erick Erickson
Optimizing is much less important query-speed wise
than historically, essentially it's not recommended much
any more.

A significant effect of optimize _used_ to be purging
obsolete data (i.e. that from deleted docs) from the
index, but that is now done on merge.

There's no harm in optimizing on off-peak hours, and
combined with an appropriate merge policy that may make
indexing a little better (I'm thinking of not doing
as many massive merges here).

BTW, in 4.0, there's DocumentWriterPerThread that
merges in the background and pretty much removes
even this as a motivation for optimizing.

All that said, optimizing isn't _bad_, it's just often
unnecessary.

Best
Erick

On Wed, May 2, 2012 at 9:29 AM, Prakashganesh, Prabhu
 wrote:
> Actually we are not thinking of a M/S setup
> We are planning to have x number of shards on N number of servers, each of 
> the shard handling both indexing and searching
> The expected query volume is not that high, so don't think we would need to 
> replicate to slaves. We think each shard will be able to handle its share of 
> the indexing and searching. If we need to scale query capacity in future, 
> yeah probably need to do it by replicating each shard to its slaves
>
> I agree autoCommit settings would be good to set up appropriately
>
> Another question I had is pros/cons of optimising the index. We would be 
> purging old content every week and am thinking whether to run an index 
> optimise in the weekend after purging old data. Because we are going to be 
> continuously indexing data which would be mix of adds, updates, deletes, not 
> sure if the benefit of optimising would last long enough to be worth doing 
> it. Maybe setting a low mergeFactor would be good enough. Optimising makes 
> sense if the index is more static, perhaps? Thoughts?
>
> Thanks
> Prabhu
>
>
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: 02 May 2012 13:15
> To: solr-user@lucene.apache.org
> Subject: Re: Solr Merge during off peak times
>
> But again, with a master/slave setup merging should
> be relatively benign. And at 200M docs, having a M/S
> setup is probably indicated.
>
> Here's a good writeup of mergepolicy
> http://juanggrande.wordpress.com/2011/02/07/merge-policy-internals/
>
> If you're indexing and searching on a single machine, merging
> is much less important than how often you commit. If a M/S
> situation, then you're polling interval on the slave is important.
>
> I'd look at commit frequency long before I worried about merging,
> that's usually where people shoot themselves in the foot - by
> committing too often.
>
> Overall, your mergeFactor is probably less important than other
> parts of how you perform indexing/searching, but it does have
> some effect for sure...
>
> Best
> Erick
>
> On Wed, May 2, 2012 at 7:54 AM, Prakashganesh, Prabhu
>  wrote:
>> We have a fairly large scale system - about 200 million docs and fairly high 
>> indexing activity - about 300k docs per day with peak ingestion rates of 
>> about 20 docs per sec. I want to work out what a good mergeFactor setting 
>> would be by testing with different mergeFactor settings. I think the default 
>> of 10 might be high, I want to try with 5 and compare. Unless I know when a 
>> merge starts and finishes, it would be quite difficult to work out the 
>> impact of changing mergeFactor. I want to be able to measure how long merges 
>> take, run queries during the merge activity and see what the response times 
>> are etc..
>>
>> Thanks
>> Prabhu
>>
>> -Original Message-
>> From: Erick Erickson [mailto:erickerick...@gmail.com]
>> Sent: 02 May 2012 12:40
>> To: solr-user@lucene.apache.org
>> Subject: Re: Solr Merge during off peak times
>>
>> Why do you care? Merging is generally a background process, or are
>> you doing heavy indexing? In a master/slave setup,
>> it's usually not really relevant except that (with 3.x), massive merges
>> may temporarily stop indexing. Is that the problem?
>>
>> Look at the merge policys, there are configurations that make
>> this less painful.
>>
>> In trunk, DocumentWriterPerThread makes merges happen in the
>> background, which helps the long-pause-while-indexing problem.
>>
>> Best
>> Erick
>>
>> On Wed, May 2, 2012 at 7:22 AM, Prakashganesh, Prabhu
>>  wrote:
>>> Ok, thanks Otis
>>> Another question on merging
>>> What is the best way to monitor merging?
>>> Is there something in the log file that I can look for?
>>> It seems like I have to monitor

RE: Solr Merge during off peak times

2012-05-02 Thread Prakashganesh, Prabhu
Actually we are not thinking of a M/S setup
We are planning to have x number of shards on N number of servers, each of the 
shard handling both indexing and searching
The expected query volume is not that high, so don't think we would need to 
replicate to slaves. We think each shard will be able to handle its share of 
the indexing and searching. If we need to scale query capacity in future, yeah 
probably need to do it by replicating each shard to its slaves

I agree autoCommit settings would be good to set up appropriately

Another question I had is pros/cons of optimising the index. We would be 
purging old content every week and am thinking whether to run an index optimise 
in the weekend after purging old data. Because we are going to be continuously 
indexing data which would be mix of adds, updates, deletes, not sure if the 
benefit of optimising would last long enough to be worth doing it. Maybe 
setting a low mergeFactor would be good enough. Optimising makes sense if the 
index is more static, perhaps? Thoughts?

Thanks
Prabhu 


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: 02 May 2012 13:15
To: solr-user@lucene.apache.org
Subject: Re: Solr Merge during off peak times

But again, with a master/slave setup merging should
be relatively benign. And at 200M docs, having a M/S
setup is probably indicated.

Here's a good writeup of mergepolicy
http://juanggrande.wordpress.com/2011/02/07/merge-policy-internals/

If you're indexing and searching on a single machine, merging
is much less important than how often you commit. If a M/S
situation, then you're polling interval on the slave is important.

I'd look at commit frequency long before I worried about merging,
that's usually where people shoot themselves in the foot - by
committing too often.

Overall, your mergeFactor is probably less important than other
parts of how you perform indexing/searching, but it does have
some effect for sure...

Best
Erick

On Wed, May 2, 2012 at 7:54 AM, Prakashganesh, Prabhu
 wrote:
> We have a fairly large scale system - about 200 million docs and fairly high 
> indexing activity - about 300k docs per day with peak ingestion rates of 
> about 20 docs per sec. I want to work out what a good mergeFactor setting 
> would be by testing with different mergeFactor settings. I think the default 
> of 10 might be high, I want to try with 5 and compare. Unless I know when a 
> merge starts and finishes, it would be quite difficult to work out the impact 
> of changing mergeFactor. I want to be able to measure how long merges take, 
> run queries during the merge activity and see what the response times are 
> etc..
>
> Thanks
> Prabhu
>
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: 02 May 2012 12:40
> To: solr-user@lucene.apache.org
> Subject: Re: Solr Merge during off peak times
>
> Why do you care? Merging is generally a background process, or are
> you doing heavy indexing? In a master/slave setup,
> it's usually not really relevant except that (with 3.x), massive merges
> may temporarily stop indexing. Is that the problem?
>
> Look at the merge policys, there are configurations that make
> this less painful.
>
> In trunk, DocumentWriterPerThread makes merges happen in the
> background, which helps the long-pause-while-indexing problem.
>
> Best
> Erick
>
> On Wed, May 2, 2012 at 7:22 AM, Prakashganesh, Prabhu
>  wrote:
>> Ok, thanks Otis
>> Another question on merging
>> What is the best way to monitor merging?
>> Is there something in the log file that I can look for?
>> It seems like I have to monitor the system resources - read/write IOPS etc.. 
>> and work out when a merge happened
>> It would be great if I can do it by looking at log files or in the admin UI. 
>> Do you know if this can be done or if there is some tool for this?
>>
>> Thanks
>> Prabhu
>>
>> -Original Message-
>> From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com]
>> Sent: 01 May 2012 15:12
>> To: solr-user@lucene.apache.org
>> Subject: Re: Solr Merge during off peak times
>>
>> Hi Prabhu,
>>
>> I don't think such a merge policy exists, but it would be nice to have this 
>> option and I imagine it wouldn't be hard to write if you really just base 
>> the merge or no merge decision on the time of day (and maybe day of the 
>> week).
>>
>> Note that this should go into Lucene, not Solr, so if you decide to 
>> contribute your work, please 
>> see http://wiki.apache.org/lucene-java/HowToContribute
>>
>> Otis
>> 
>> Performance Monitoring for Solr - http://sematext.com/spm
>>
>>
>>
>>
&

Re: Solr Merge during off peak times

2012-05-02 Thread Erick Erickson
But again, with a master/slave setup merging should
be relatively benign. And at 200M docs, having a M/S
setup is probably indicated.

Here's a good writeup of mergepolicy
http://juanggrande.wordpress.com/2011/02/07/merge-policy-internals/

If you're indexing and searching on a single machine, merging
is much less important than how often you commit. If a M/S
situation, then you're polling interval on the slave is important.

I'd look at commit frequency long before I worried about merging,
that's usually where people shoot themselves in the foot - by
committing too often.

Overall, your mergeFactor is probably less important than other
parts of how you perform indexing/searching, but it does have
some effect for sure...

Best
Erick

On Wed, May 2, 2012 at 7:54 AM, Prakashganesh, Prabhu
 wrote:
> We have a fairly large scale system - about 200 million docs and fairly high 
> indexing activity - about 300k docs per day with peak ingestion rates of 
> about 20 docs per sec. I want to work out what a good mergeFactor setting 
> would be by testing with different mergeFactor settings. I think the default 
> of 10 might be high, I want to try with 5 and compare. Unless I know when a 
> merge starts and finishes, it would be quite difficult to work out the impact 
> of changing mergeFactor. I want to be able to measure how long merges take, 
> run queries during the merge activity and see what the response times are 
> etc..
>
> Thanks
> Prabhu
>
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: 02 May 2012 12:40
> To: solr-user@lucene.apache.org
> Subject: Re: Solr Merge during off peak times
>
> Why do you care? Merging is generally a background process, or are
> you doing heavy indexing? In a master/slave setup,
> it's usually not really relevant except that (with 3.x), massive merges
> may temporarily stop indexing. Is that the problem?
>
> Look at the merge policys, there are configurations that make
> this less painful.
>
> In trunk, DocumentWriterPerThread makes merges happen in the
> background, which helps the long-pause-while-indexing problem.
>
> Best
> Erick
>
> On Wed, May 2, 2012 at 7:22 AM, Prakashganesh, Prabhu
>  wrote:
>> Ok, thanks Otis
>> Another question on merging
>> What is the best way to monitor merging?
>> Is there something in the log file that I can look for?
>> It seems like I have to monitor the system resources - read/write IOPS etc.. 
>> and work out when a merge happened
>> It would be great if I can do it by looking at log files or in the admin UI. 
>> Do you know if this can be done or if there is some tool for this?
>>
>> Thanks
>> Prabhu
>>
>> -Original Message-
>> From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com]
>> Sent: 01 May 2012 15:12
>> To: solr-user@lucene.apache.org
>> Subject: Re: Solr Merge during off peak times
>>
>> Hi Prabhu,
>>
>> I don't think such a merge policy exists, but it would be nice to have this 
>> option and I imagine it wouldn't be hard to write if you really just base 
>> the merge or no merge decision on the time of day (and maybe day of the 
>> week).
>>
>> Note that this should go into Lucene, not Solr, so if you decide to 
>> contribute your work, please 
>> see http://wiki.apache.org/lucene-java/HowToContribute
>>
>> Otis
>> 
>> Performance Monitoring for Solr - http://sematext.com/spm
>>
>>
>>
>>
>>>
>>> From: "Prakashganesh, Prabhu" 
>>>To: "solr-user@lucene.apache.org" 
>>>Sent: Tuesday, May 1, 2012 8:45 AM
>>>Subject: Solr Merge during off peak times
>>>
>>>Hi,
>>>  I would like to know if there is a way to configure index merge policy in 
>>>solr so that the merging happens during off peak hours. Can you please let 
>>>me know if such a merge policy configuration exists?
>>>
>>>Thanks
>>>Prabhu
>>>
>>>
>>>


RE: Solr Merge during off peak times

2012-05-02 Thread Prakashganesh, Prabhu
We have a fairly large scale system - about 200 million docs and fairly high 
indexing activity - about 300k docs per day with peak ingestion rates of about 
20 docs per sec. I want to work out what a good mergeFactor setting would be by 
testing with different mergeFactor settings. I think the default of 10 might be 
high, I want to try with 5 and compare. Unless I know when a merge starts and 
finishes, it would be quite difficult to work out the impact of changing 
mergeFactor. I want to be able to measure how long merges take, run queries 
during the merge activity and see what the response times are etc..

Thanks
Prabhu

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: 02 May 2012 12:40
To: solr-user@lucene.apache.org
Subject: Re: Solr Merge during off peak times

Why do you care? Merging is generally a background process, or are
you doing heavy indexing? In a master/slave setup,
it's usually not really relevant except that (with 3.x), massive merges
may temporarily stop indexing. Is that the problem?

Look at the merge policys, there are configurations that make
this less painful.

In trunk, DocumentWriterPerThread makes merges happen in the
background, which helps the long-pause-while-indexing problem.

Best
Erick

On Wed, May 2, 2012 at 7:22 AM, Prakashganesh, Prabhu
 wrote:
> Ok, thanks Otis
> Another question on merging
> What is the best way to monitor merging?
> Is there something in the log file that I can look for?
> It seems like I have to monitor the system resources - read/write IOPS etc.. 
> and work out when a merge happened
> It would be great if I can do it by looking at log files or in the admin UI. 
> Do you know if this can be done or if there is some tool for this?
>
> Thanks
> Prabhu
>
> -Original Message-
> From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com]
> Sent: 01 May 2012 15:12
> To: solr-user@lucene.apache.org
> Subject: Re: Solr Merge during off peak times
>
> Hi Prabhu,
>
> I don't think such a merge policy exists, but it would be nice to have this 
> option and I imagine it wouldn't be hard to write if you really just base the 
> merge or no merge decision on the time of day (and maybe day of the week).
>
> Note that this should go into Lucene, not Solr, so if you decide to 
> contribute your work, please 
> see http://wiki.apache.org/lucene-java/HowToContribute
>
> Otis
> 
> Performance Monitoring for Solr - http://sematext.com/spm
>
>
>
>
>>
>> From: "Prakashganesh, Prabhu" 
>>To: "solr-user@lucene.apache.org" 
>>Sent: Tuesday, May 1, 2012 8:45 AM
>>Subject: Solr Merge during off peak times
>>
>>Hi,
>>  I would like to know if there is a way to configure index merge policy in 
>>solr so that the merging happens during off peak hours. Can you please let me 
>>know if such a merge policy configuration exists?
>>
>>Thanks
>>Prabhu
>>
>>
>>


Re: Solr Merge during off peak times

2012-05-02 Thread Erick Erickson
Why do you care? Merging is generally a background process, or are
you doing heavy indexing? In a master/slave setup,
it's usually not really relevant except that (with 3.x), massive merges
may temporarily stop indexing. Is that the problem?

Look at the merge policys, there are configurations that make
this less painful.

In trunk, DocumentWriterPerThread makes merges happen in the
background, which helps the long-pause-while-indexing problem.

Best
Erick

On Wed, May 2, 2012 at 7:22 AM, Prakashganesh, Prabhu
 wrote:
> Ok, thanks Otis
> Another question on merging
> What is the best way to monitor merging?
> Is there something in the log file that I can look for?
> It seems like I have to monitor the system resources - read/write IOPS etc.. 
> and work out when a merge happened
> It would be great if I can do it by looking at log files or in the admin UI. 
> Do you know if this can be done or if there is some tool for this?
>
> Thanks
> Prabhu
>
> -Original Message-
> From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com]
> Sent: 01 May 2012 15:12
> To: solr-user@lucene.apache.org
> Subject: Re: Solr Merge during off peak times
>
> Hi Prabhu,
>
> I don't think such a merge policy exists, but it would be nice to have this 
> option and I imagine it wouldn't be hard to write if you really just base the 
> merge or no merge decision on the time of day (and maybe day of the week).
>
> Note that this should go into Lucene, not Solr, so if you decide to 
> contribute your work, please 
> see http://wiki.apache.org/lucene-java/HowToContribute
>
> Otis
> 
> Performance Monitoring for Solr - http://sematext.com/spm
>
>
>
>
>>
>> From: "Prakashganesh, Prabhu" 
>>To: "solr-user@lucene.apache.org" 
>>Sent: Tuesday, May 1, 2012 8:45 AM
>>Subject: Solr Merge during off peak times
>>
>>Hi,
>>  I would like to know if there is a way to configure index merge policy in 
>>solr so that the merging happens during off peak hours. Can you please let me 
>>know if such a merge policy configuration exists?
>>
>>Thanks
>>Prabhu
>>
>>
>>


RE: Solr Merge during off peak times

2012-05-02 Thread Prakashganesh, Prabhu
Ok, thanks Otis
Another question on merging
What is the best way to monitor merging?
Is there something in the log file that I can look for? 
It seems like I have to monitor the system resources - read/write IOPS etc.. 
and work out when a merge happened
It would be great if I can do it by looking at log files or in the admin UI. Do 
you know if this can be done or if there is some tool for this?

Thanks
Prabhu

-Original Message-
From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
Sent: 01 May 2012 15:12
To: solr-user@lucene.apache.org
Subject: Re: Solr Merge during off peak times

Hi Prabhu,

I don't think such a merge policy exists, but it would be nice to have this 
option and I imagine it wouldn't be hard to write if you really just base the 
merge or no merge decision on the time of day (and maybe day of the week).

Note that this should go into Lucene, not Solr, so if you decide to contribute 
your work, please see http://wiki.apache.org/lucene-java/HowToContribute

Otis

Performance Monitoring for Solr - http://sematext.com/spm




>
> From: "Prakashganesh, Prabhu" 
>To: "solr-user@lucene.apache.org"  
>Sent: Tuesday, May 1, 2012 8:45 AM
>Subject: Solr Merge during off peak times
> 
>Hi,
>  I would like to know if there is a way to configure index merge policy in 
>solr so that the merging happens during off peak hours. Can you please let me 
>know if such a merge policy configuration exists?
>
>Thanks
>Prabhu
>
>
>


Re: Solr Merge during off peak times

2012-05-01 Thread Otis Gospodnetic
Hi Prabhu,

I don't think such a merge policy exists, but it would be nice to have this 
option and I imagine it wouldn't be hard to write if you really just base the 
merge or no merge decision on the time of day (and maybe day of the week).

Note that this should go into Lucene, not Solr, so if you decide to contribute 
your work, please see http://wiki.apache.org/lucene-java/HowToContribute

Otis

Performance Monitoring for Solr - http://sematext.com/spm




>
> From: "Prakashganesh, Prabhu" 
>To: "solr-user@lucene.apache.org"  
>Sent: Tuesday, May 1, 2012 8:45 AM
>Subject: Solr Merge during off peak times
> 
>Hi,
>  I would like to know if there is a way to configure index merge policy in 
>solr so that the merging happens during off peak hours. Can you please let me 
>know if such a merge policy configuration exists?
>
>Thanks
>Prabhu
>
>
>