On 6/3/2013 1:06 AM, Yoni Amir wrote: > Solrconfig.xml -> http://apaste.info/dsbv > > Schema.xml -> http://apaste.info/67PI > > This solrconfig.xml file has optimization enabled. I had another file which I > can't locate at the moment, in which I defined a custom merge scheduler in > order to disable optimization. > > When I say 1000 segments, I mean that's the number I saw in Solr UI. I assume > there were much more files than that.
I think we have a terminology problem happening here. There's nothing you can put in a solrconfig.xml file to enable optimization. Solr will only optimize when you explicitly send an optimize command to it. There is segment merging, but that's not the same thing. Segment merging is completely normal. Normally it's in the background and indexing will continue while it's occurring, but if you get too many merges happening at once, that can stop indexing. I have a solution for that: At the following URL s my indexConfig section, geared towards heavy indexing. The TieredMergePolicy settings are the equivalent of a legacy mergeFactor of 35. I've gone with a lower-than-default ramBufferSizeMB here, to reduce memory usage. The default value for this setting as of version 4.1 is 100: http://apaste.info/4gaD One thing that this configuration does which might directly impact on your setup is increase the maxMergeCount. I believe the default value for this is 3. This means that if you get more than three "levels" of merging happening at the same time, indexing will stop until until the number of levels drops. Because Solr always does the biggest merge first, this can really take a long time. The combination of a large mergeFactor and a larger-than-normal maxMergeCount will ensure that this situation never happens. If you are not using SSD, don't increase maxThreadCount beyond one. The random-access characteristics of regular hard disks will make things go slower with more threads, not faster. With SSD, increasing the threads can make things go faster. There's a few high memory use things going on in your config/schema. The first thing that jumped out at me is facets. They use a lot of memory. You can greatly reduce the memory use by adding &facet.method=enum to the query. The default for the method is fc, which means fieldcache. The size of the Lucene fieldcache cannot be directly controlled by Solr, unlike Solr's own caches. It gets as big as it needs to be, and facets using the fc method will put all the facet data for the entire index in the fieldcache. The second thing that jumped out at me is the fact that all_text is being stored. Apparently this is for highlighting. I will admit that I do not know anything about highlighting, so you might need separate help there. You are using edismax for your query parser, which is perfectly capable of searching all the fields that make up all_text, so in my mind, all_text doesn't need to exist at all. If you wrote a custom merge scheduler that disables merging, that's probably why you're getting over 1000 segments. Having a really huge number of segments can also cause memory issues, because each one needs its own memory structures. If you've got a few dozen of them, that's no big deal, but 1000+ is. One side issue: I noticed that you edited and duplicated the /browse handler. The velocity templates are not meant for production use. They are made to illustrate Solr's capability without a lot of coding. In order for your audience to use the /browse handler, you have to open your Solr instance up to the entire audience, which might be the entire Internet. Anyone who can reach the Solr interface is capable of erasing your index and changing your index. Even if you take steps to prevent that, they can send denial of service query attacks. Thanks, Shawn