Re: optimize and mergeFactor
In my own Solr 1.4, I am pretty sure that running an index optimize does give me significant better performance. Perhaps because I use some largeish (not huge, maybe as large as 200k) stored fields. 200.000 stored fields? I asume that number includes your number of documents? Sounds crazy =) So I'm interested in always keeping my index optimized. Am I right that if I set mergeFactor to '1', essentially my index will always be optimized after every commit, and actually running 'optimize' will be redundant? You can set mergeFactor to 2, not lower. What are the possible negative repurcussions of setting mergeFactor to 1? Is this a really bad idea? If not 1, what about some other lower-than-usually-recommended value like 2 or 3? Anyone done this? I imagine it will slow down my commits, but if the alternative is running optimize a lot anyway I wonder at what point I get 'break even' (if I optimize after every single commit, clearly might as well just set the mergeFactor low, right? But if I optimize after every X documents or Y commits don't know what X/Y are break-even). This depends on commit rate and if there are a lot of updates and deletes instead of adds. Setting it very low will indeed cause a lot of merging and slow commits. It will also be very slow in replication because merged files are copied over again and again, causing high I/O on your slaves. There is always a `break even` but it depends (as usual) on your scenario and business demands. Jonathan
Re: optimize and mergeFactor
Thanks for the answers, more questions below. On 2/16/2011 3:37 PM, Markus Jelsma wrote: 200.000 stored fields? I asume that number includes your number of documents? Sounds crazy =) Nope, I wasn't clear. I have less than a dozen stored field, but the value of a stored field can sometimes be as large as 200kb. You can set mergeFactor to 2, not lower. Am I right though that manually running an 'optimize' is the equivalent of a mergeFactor=1? So there's no way to get Solr to keep the index in an 'always optimized' state, if I'm understanding correctly? Cool. Just want to understand what's going on. This depends on commit rate and if there are a lot of updates and deletes instead of adds. Setting it very low will indeed cause a lot of merging and slow commits. It will also be very slow in replication because merged files are copied over again and again, causing high I/O on your slaves. There is always a `break even` but it depends (as usual) on your scenario and business demands. There are indeed sadly lots of updates and deletes, which is why I need to run optimize periodically. I am aware that this will cause more work for replication -- I think this is true whether I manually issue an optimize before replication _or_ whether I just keep the mergeFactor very low, right? Same issue either way. So... if I'm going to do lots of updates and deletes, and my other option is running an optimize before replication anyway is there any reason it's going to be completely stupid to set the mergeFactor to 2 on the master? I realize it'll mean all index files are going to have to be replicated, but that would be the case if I ran a manual optimize in the same situation before replication too, I think. Jonathan
Re: optimize and mergeFactor
Thanks for the answers, more questions below. On 2/16/2011 3:37 PM, Markus Jelsma wrote: 200.000 stored fields? I asume that number includes your number of documents? Sounds crazy =) Nope, I wasn't clear. I have less than a dozen stored field, but the value of a stored field can sometimes be as large as 200kb. You can set mergeFactor to 2, not lower. Am I right though that manually running an 'optimize' is the equivalent of a mergeFactor=1? So there's no way to get Solr to keep the index in an 'always optimized' state, if I'm understanding correctly? Cool. Just want to understand what's going on. That should be it. If i remember correctly a second segment is always written, new updates aren't merged immediately. This depends on commit rate and if there are a lot of updates and deletes instead of adds. Setting it very low will indeed cause a lot of merging and slow commits. It will also be very slow in replication because merged files are copied over again and again, causing high I/O on your slaves. There is always a `break even` but it depends (as usual) on your scenario and business demands. There are indeed sadly lots of updates and deletes, which is why I need to run optimize periodically. I am aware that this will cause more work for replication -- I think this is true whether I manually issue an optimize before replication _or_ whether I just keep the mergeFactor very low, right? Same issue either way. Yes. But having several segments shouldn't make that much of a difference. If search latency is just a few addidional milliseconds than i'd rather have a few more segments being copied over more quickly. So... if I'm going to do lots of updates and deletes, and my other option is running an optimize before replication anyway is there any reason it's going to be completely stupid to set the mergeFactor to 2 on the master? I realize it'll mean all index files are going to have to be replicated, but that would be the case if I ran a manual optimize in the same situation before replication too, I think. No, it's not stupid if you allow for slow indexing and slow copying of files but want a very quick search. Jonathan