Thanks Shawn, This was very helpful. Indeed I had some terminology problem regarding the segment merging. In any case, I tweaked those parameters that you recommended and it helped a lot.
I was wondering about your recommendation to use facet.method=enum? Can you explain what is the trade-off here? I understand that I gain a benefit by using less memory, but what with I lose? Is it speed? Also, do you know if there is an answer to my original question in this thread? Solr has a queue of incoming requests, which, in my case, kept on growing. I looked at the code but couldn't find it, I think maybe it is an implicit queue in the form of Java's concurrent thread pool or something like that. Is it possible to limit the size of this queue, or to determine its size during runtime? This is the last issue that I am trying to figure out right now. Also, to answer your question about the field all_text: all the fields are stored in order to support partial-update of documents. Most of the fields are used for highlighting, all_text is used for searching. I'll gladly omit all_text from being stored, but then partial-update won't work. The reason I didn't use edismax to search all the fields, is because the list of all fields is very long. Can edismax handle several hundred fields in the list? What about dynamic fields? Edismax requires the list to be fixed in the configuration file, so I can't include dynamic fields there. I can pass along the full list in the 'qf' parameter in every search request, but this seems like a waste? Also, what about performance? I was told that the best practice in this case (you have lots of fields and want to search everything) is to copy everything to a catch-all field. Thanks again, Yoni -----Original Message----- From: Shawn Heisey [mailto:s...@elyograg.org] Sent: Monday, June 03, 2013 17:08 To: solr-user@lucene.apache.org Subject: Re: out of memory during indexing do to large incoming queue On 6/3/2013 1:06 AM, Yoni Amir wrote: > Solrconfig.xml -> http://apaste.info/dsbv > > Schema.xml -> http://apaste.info/67PI > > This solrconfig.xml file has optimization enabled. I had another file which I > can't locate at the moment, in which I defined a custom merge scheduler in > order to disable optimization. > > When I say 1000 segments, I mean that's the number I saw in Solr UI. I assume > there were much more files than that. I think we have a terminology problem happening here. There's nothing you can put in a solrconfig.xml file to enable optimization. Solr will only optimize when you explicitly send an optimize command to it. There is segment merging, but that's not the same thing. Segment merging is completely normal. Normally it's in the background and indexing will continue while it's occurring, but if you get too many merges happening at once, that can stop indexing. I have a solution for that: At the following URL s my indexConfig section, geared towards heavy indexing. The TieredMergePolicy settings are the equivalent of a legacy mergeFactor of 35. I've gone with a lower-than-default ramBufferSizeMB here, to reduce memory usage. The default value for this setting as of version 4.1 is 100: http://apaste.info/4gaD One thing that this configuration does which might directly impact on your setup is increase the maxMergeCount. I believe the default value for this is 3. This means that if you get more than three "levels" of merging happening at the same time, indexing will stop until until the number of levels drops. Because Solr always does the biggest merge first, this can really take a long time. The combination of a large mergeFactor and a larger-than-normal maxMergeCount will ensure that this situation never happens. If you are not using SSD, don't increase maxThreadCount beyond one. The random-access characteristics of regular hard disks will make things go slower with more threads, not faster. With SSD, increasing the threads can make things go faster. There's a few high memory use things going on in your config/schema. The first thing that jumped out at me is facets. They use a lot of memory. You can greatly reduce the memory use by adding &facet.method=enum to the query. The default for the method is fc, which means fieldcache. The size of the Lucene fieldcache cannot be directly controlled by Solr, unlike Solr's own caches. It gets as big as it needs to be, and facets using the fc method will put all the facet data for the entire index in the fieldcache. The second thing that jumped out at me is the fact that all_text is being stored. Apparently this is for highlighting. I will admit that I do not know anything about highlighting, so you might need separate help there. You are using edismax for your query parser, which is perfectly capable of searching all the fields that make up all_text, so in my mind, all_text doesn't need to exist at all. If you wrote a custom merge scheduler that disables merging, that's probably why you're getting over 1000 segments. Having a really huge number of segments can also cause memory issues, because each one needs its own memory structures. If you've got a few dozen of them, that's no big deal, but 1000+ is. One side issue: I noticed that you edited and duplicated the /browse handler. The velocity templates are not meant for production use. They are made to illustrate Solr's capability without a lot of coding. In order for your audience to use the /browse handler, you have to open your Solr instance up to the entire audience, which might be the entire Internet. Anyone who can reach the Solr interface is capable of erasing your index and changing your index. Even if you take steps to prevent that, they can send denial of service query attacks. Thanks, Shawn Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately. Monitoring: NICE Actimize may monitor incoming and outgoing e-mails. Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.