Re: optimize and mergeFactor

2011-02-16 Thread Markus Jelsma
 In my own Solr 1.4, I am pretty sure that running an index optimize does
 give me significant better performance. Perhaps because I use some
 largeish (not huge, maybe as large as 200k) stored fields.

200.000 stored fields? I asume that number includes your number of documents? 
Sounds crazy =)

 
 So I'm interested in always keeping my index optimized.
 
 Am I right that if I set mergeFactor to '1', essentially my index will
 always be optimized after every commit, and actually running 'optimize'
 will be redundant?

You can set mergeFactor to 2, not lower. 

 
 What are the possible negative repurcussions of setting mergeFactor to
 1? Is this a really bad idea?  If not 1, what about some other
 lower-than-usually-recommended value like 2 or 3?  Anyone done this?
 I imagine it will slow down my commits, but if the alternative is
 running optimize a lot anyway I wonder at what point I get 'break
 even' (if I optimize after every single commit, clearly might as well
 just set the mergeFactor low, right? But if I optimize after every X
 documents or Y commits don't know what X/Y are break-even).

This depends on commit rate and if there are a lot of updates and deletes 
instead of adds. Setting it very low will indeed cause a lot of merging and 
slow commits. It will also be very slow in replication because merged files are 
copied over again and again, causing high I/O on your slaves.

There is always a `break even` but it depends (as usual) on your scenario and 
business demands.

 
 Jonathan


Re: optimize and mergeFactor

2011-02-16 Thread Jonathan Rochkind

Thanks for the answers, more questions below.

On 2/16/2011 3:37 PM, Markus Jelsma wrote:



200.000 stored fields? I asume that number includes your number of documents?
Sounds crazy =)


Nope, I wasn't clear. I have less than a dozen stored field, but the 
value of a stored field can sometimes be as large as 200kb.




You can set mergeFactor to 2, not lower.


Am I right though that manually running an 'optimize' is the equivalent 
of a mergeFactor=1?  So there's no way to get Solr to keep the index in 
an 'always optimized' state, if I'm understanding correctly? Cool. Just 
want to understand what's going on.



This depends on commit rate and if there are a lot of updates and deletes
instead of adds. Setting it very low will indeed cause a lot of merging and
slow commits. It will also be very slow in replication because merged files are
copied over again and again, causing high I/O on your slaves.

There is always a `break even` but it depends (as usual) on your scenario and
business demands.



There are indeed sadly lots of updates and deletes, which is why I need 
to run optimize periodically. I am aware that this will cause more work 
for replication -- I think this is true whether I manually issue an 
optimize before replication _or_ whether I just keep the mergeFactor 
very low, right? Same issue either way.


So... if I'm going to do lots of updates and deletes, and my other 
option is running an optimize before replication anyway   is there 
any reason it's going to be completely stupid to set the mergeFactor to 
2 on the master?  I realize it'll mean all index files are going to have 
to be replicated, but that would be the case if I ran a manual optimize 
in the same situation before replication too, I think.


Jonathan


Re: optimize and mergeFactor

2011-02-16 Thread Markus Jelsma

 Thanks for the answers, more questions below.
 
 On 2/16/2011 3:37 PM, Markus Jelsma wrote:
  200.000 stored fields? I asume that number includes your number of
  documents? Sounds crazy =)
 
 Nope, I wasn't clear. I have less than a dozen stored field, but the
 value of a stored field can sometimes be as large as 200kb.
 
  You can set mergeFactor to 2, not lower.
 
 Am I right though that manually running an 'optimize' is the equivalent
 of a mergeFactor=1?  So there's no way to get Solr to keep the index in
 an 'always optimized' state, if I'm understanding correctly? Cool. Just
 want to understand what's going on.

That should be it. If i remember correctly a second segment is always written, 
new updates aren't merged immediately. 

 
  This depends on commit rate and if there are a lot of updates and deletes
  instead of adds. Setting it very low will indeed cause a lot of merging
  and slow commits. It will also be very slow in replication because
  merged files are copied over again and again, causing high I/O on your
  slaves.
  
  There is always a `break even` but it depends (as usual) on your scenario
  and business demands.
 
 There are indeed sadly lots of updates and deletes, which is why I need
 to run optimize periodically. I am aware that this will cause more work
 for replication -- I think this is true whether I manually issue an
 optimize before replication _or_ whether I just keep the mergeFactor
 very low, right? Same issue either way.

Yes. But having several segments shouldn't make that much of a difference. If 
search latency is just a few addidional milliseconds than i'd rather have a 
few more segments being copied over more quickly.

 
 So... if I'm going to do lots of updates and deletes, and my other
 option is running an optimize before replication anyway   is there
 any reason it's going to be completely stupid to set the mergeFactor to
 2 on the master?  I realize it'll mean all index files are going to have
 to be replicated, but that would be the case if I ran a manual optimize
 in the same situation before replication too, I think.

No, it's not stupid if you allow for slow indexing and slow copying of files 
but want a very quick search.

 
 Jonathan