RE: out of memory during indexing do to large incoming queue
Thanks Shawn, This was very helpful. Indeed I had some terminology problem regarding the segment merging. In any case, I tweaked those parameters that you recommended and it helped a lot. I was wondering about your recommendation to use facet.method=enum? Can you explain what is the trade-off here? I understand that I gain a benefit by using less memory, but what with I lose? Is it speed? Also, do you know if there is an answer to my original question in this thread? Solr has a queue of incoming requests, which, in my case, kept on growing. I looked at the code but couldn't find it, I think maybe it is an implicit queue in the form of Java's concurrent thread pool or something like that. Is it possible to limit the size of this queue, or to determine its size during runtime? This is the last issue that I am trying to figure out right now. Also, to answer your question about the field all_text: all the fields are stored in order to support partial-update of documents. Most of the fields are used for highlighting, all_text is used for searching. I'll gladly omit all_text from being stored, but then partial-update won't work. The reason I didn't use edismax to search all the fields, is because the list of all fields is very long. Can edismax handle several hundred fields in the list? What about dynamic fields? Edismax requires the list to be fixed in the configuration file, so I can't include dynamic fields there. I can pass along the full list in the 'qf' parameter in every search request, but this seems like a waste? Also, what about performance? I was told that the best practice in this case (you have lots of fields and want to search everything) is to copy everything to a catch-all field. Thanks again, Yoni -Original Message- From: Shawn Heisey [mailto:s...@elyograg.org] Sent: Monday, June 03, 2013 17:08 To: solr-user@lucene.apache.org Subject: Re: out of memory during indexing do to large incoming queue On 6/3/2013 1:06 AM, Yoni Amir wrote: Solrconfig.xml - http://apaste.info/dsbv Schema.xml - http://apaste.info/67PI This solrconfig.xml file has optimization enabled. I had another file which I can't locate at the moment, in which I defined a custom merge scheduler in order to disable optimization. When I say 1000 segments, I mean that's the number I saw in Solr UI. I assume there were much more files than that. I think we have a terminology problem happening here. There's nothing you can put in a solrconfig.xml file to enable optimization. Solr will only optimize when you explicitly send an optimize command to it. There is segment merging, but that's not the same thing. Segment merging is completely normal. Normally it's in the background and indexing will continue while it's occurring, but if you get too many merges happening at once, that can stop indexing. I have a solution for that: At the following URL s my indexConfig section, geared towards heavy indexing. The TieredMergePolicy settings are the equivalent of a legacy mergeFactor of 35. I've gone with a lower-than-default ramBufferSizeMB here, to reduce memory usage. The default value for this setting as of version 4.1 is 100: http://apaste.info/4gaD One thing that this configuration does which might directly impact on your setup is increase the maxMergeCount. I believe the default value for this is 3. This means that if you get more than three levels of merging happening at the same time, indexing will stop until until the number of levels drops. Because Solr always does the biggest merge first, this can really take a long time. The combination of a large mergeFactor and a larger-than-normal maxMergeCount will ensure that this situation never happens. If you are not using SSD, don't increase maxThreadCount beyond one. The random-access characteristics of regular hard disks will make things go slower with more threads, not faster. With SSD, increasing the threads can make things go faster. There's a few high memory use things going on in your config/schema. The first thing that jumped out at me is facets. They use a lot of memory. You can greatly reduce the memory use by adding facet.method=enum to the query. The default for the method is fc, which means fieldcache. The size of the Lucene fieldcache cannot be directly controlled by Solr, unlike Solr's own caches. It gets as big as it needs to be, and facets using the fc method will put all the facet data for the entire index in the fieldcache. The second thing that jumped out at me is the fact that all_text is being stored. Apparently this is for highlighting. I will admit that I do not know anything about highlighting, so you might need separate help there. You are using edismax for your query parser, which is perfectly capable of searching all the fields that make up all_text, so in my mind, all_text doesn't need to exist at all. If you wrote a custom merge scheduler that disables
Re: out of memory during indexing do to large incoming queue
On 6/17/2013 4:32 AM, Yoni Amir wrote: I was wondering about your recommendation to use facet.method=enum? Can you explain what is the trade-off here? I understand that I gain a benefit by using less memory, but what with I lose? Is it speed? The problem with facet.method=fc (the default) and memeory is that every field and query that you use for faceting ends up separately cached in the FieldCache, and the memory required grows as your index grows. If you only use facets on one or two fields, then the normal method is fine, and subsequent facets will be faster. It does eat a lot of java heap memory, though ... and the bigger your java heap is, the more problems you'll have with garbage collection. With enum, it must gather the data out of the index for every facet run. If you have plenty of extra memory for the OS disk cache, this is not normally a major issue, because it will be pulled out of RAM, similar to what happens with fc, except that it's not java heap memory. The OS is a lot more efficient with how it uses memory than Java is. Also, do you know if there is an answer to my original question in this thread? Solr has a queue of incoming requests, which, in my case, kept on growing. I looked at the code but couldn't find it, I think maybe it is an implicit queue in the form of Java's concurrent thread pool or something like that. Is it possible to limit the size of this queue, or to determine its size during runtime? This is the last issue that I am trying to figure out right now. I do not know the answer to this. Also, to answer your question about the field all_text: all the fields are stored in order to support partial-update of documents. Most of the fields are used for highlighting, all_text is used for searching. I'll gladly omit all_text from being stored, but then partial-update won't work. Your copyFields will still work just fine with atomic updates even if they are not stored. Behind the scenes, an atomic update is a delete and an add with the stored data plus the changes... if all your source fields are stored, then the copyField should be generated correctly from all the source fields. The wiki page on the subject actually says that copyField destinations *MUST* be set to stored=false. http://wiki.apache.org/solr/Atomic_Updates#Caveats_and_Limitations The reason I didn't use edismax to search all the fields, is because the list of all fields is very long. Can edismax handle several hundred fields in the list? What about dynamic fields? Edismax requires the list to be fixed in the configuration file, so I can't include dynamic fields there. I can pass along the full list in the 'qf' parameter in every search request, but this seems like a waste? Also, what about performance? I was told that the best practice in this case (you have lots of fields and want to search everything) is to copy everything to a catch-all field. If there is ever any situation where you can come up with some searches that only need to search against some of the fields and other searches that need to search against different fields, then you might consider creating different search handlers with different qf lists. If you always want to search against all the fields, then it's probably more efficient to keep your current method. Thanks, Shawn
RE: out of memory during indexing do to large incoming queue
Solrconfig.xml - http://apaste.info/dsbv Schema.xml - http://apaste.info/67PI This solrconfig.xml file has optimization enabled. I had another file which I can't locate at the moment, in which I defined a custom merge scheduler in order to disable optimization. When I say 1000 segments, I mean that's the number I saw in Solr UI. I assume there were much more files than that. Thanks, Yoni -Original Message- From: Shawn Heisey [mailto:s...@elyograg.org] Sent: Sunday, June 02, 2013 22:53 To: solr-user@lucene.apache.org Subject: Re: out of memory during indexing do to large incoming queue On 6/2/2013 12:25 PM, Yoni Amir wrote: Hi Shawn and Shreejay, thanks for the response. Here is some more information: 1) The machine is a virtual machine on ESX server. It has 4 CPUs and 8GB of RAM. I don't remember what CPU but something modern enough. It is running Java 7 without any special parameters, and 4GB allocated for Java (-Xmx) 2) After successful indexing, I have 2.5 Million documents, 117GB index size. This is the size after it was optimized. 3) I plan to upgrade to 4.3 just didn't have time. 4.0 beta is what was available at the time that we had a release deadline. 4) The setup with master-slave replication, not Solr Cloud. The server that I am discussing is the indexing server, and in these tests there were actually no slaves involved, and virtually zero searches performed. 5) Attached is my configuration. I tried to disable the warm-up and opening of searchers, it didn't change anything. The commits are done by Solr, using autocommit. The client sends the updates without a commit command. 6) I want to disable optimization, but when I disabled it, the OOME occurred even faster. The number of segments reached around a thousand within an hour or so. I don't know if it's normal or not, but at that point if I restarted Solr it immediately took about 1GB of heap space just on start-up, instead of the usual 50MB or so. If I commit less frequently, don't I increase the risk of losing data, e.g., if the power goes down, etc.? If I disable optimization, is it necessary to avoid such a large number of segments? Is it possible? Last part first: Losing data is much less of a risk with Solr 4.x, if you have enabled the updateLog. We'll need some more info. See the end of the message for specifics. Right off the bat, I can tell you that with an index that's 117GB, you're going to need a LOT of RAM. Each of my 4.2.1 servers has 42GB of index and about 37 million documents between all the index shards. The web application never uses facets, which tend to use a lot of memory. My index is a lot smaller than yours, and I need a 6GB heap, seeing OOM errors if it's only 4GB. You probably need at least an 8GB heap, and possibly larger. Beyond the amount of memory that Solr itself uses, for good performance you will also need a large amount of memory for OS disk caching. Unless the server is using SSD, you need to allocate at least 64GB of real memory to the virtual machine. If you've got your index on SSD, 32GB might be enough. I've got 64GB total on my servers. http://wiki.apache.org/solr/SolrPerformanceProblems When you say that there are over 1000 segments, are you seeing 1000 files, or are there literally 1000 segments, giving you between 12000 and 15000 files? Even if your mergeFactor were higher than the default 10, that just shouldn't happen. Can you share your solrconfig.xml and schema.xml? Use a paste website like http://apaste.info and share the URLs. Thanks, Shawn Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately. Monitoring: NICE Actimize may monitor incoming and outgoing e-mails. Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.
Re: out of memory during indexing do to large incoming queue
A couple of things: 1) can you give some more details about your setup ? Like whether its cloud or single instance . How many nodes if its cloud. The hardware - memory per machine , JVM options. Etc 2) any specific reason for using 4.0 beta? The latest version is 4.3. I used 4.0 for a few weeks and there were a lot if bugs related to memory and communication between nodes ( zookeeper) 3) if you haven't seen it already , please go through this wiki page . It's an excellent starting point for troubleshooting memory n indexing issues. Specially section 3 to 7 http://wiki.apache.org/solr/SolrPerformanceFactors#Optimization_Considerations -- Shreejay On Sunday, June 2, 2013 at 7:16, Yoni Amir wrote: Hello, I am receiving OutOfMemoryError during indexing, and after investigating the heap dump, I am still missing some information, and I thought this might be a good place for help. I am using Solr 4.0 beta, and I have 5 threads that send update requests to Solr. Each request is a bulk of 100 SolrInputDocuments (using solrj), and my goal is to index around 2.5 million documents. Solr is configured to do a hard-commit every 10 seconds, so initially I thought that it can only accumulate in memory 10 seconds worth of updates, but that's not the case. I can see in a profiler how it accumulates memory over time, even with 4 to 6 GB of memory. It is also configured to optimize with mergeFactor=10. At first I thought that optimization is a blocking, synchronous operation. It is, in the sense that the index can't be updated during optimization. However, it is not synchronous, in the sense that the update request coming from my code is not blocked - Solr just returns an OK response, even while the index is optimizing. This indicates that Solr has an internal queue of inbound requests, and that the OK response just means that it is in the queue. I get confirmation for this from a friend who is a Solr expert (or so I hope). My main question is: how can I put a bound on this internal queue, and make update requests synchronous in case the queue is full? Put it another way, I need to know if Solr is really ready to receive more requests, so I don't overload it and cause OOME. I performed several tests, with slow and fast disks, and on the really fasts disk the problem didn't occur. However, I can't demand such fast disk from all the clients, and also even with a fast disk the problem will occur eventually when I try to index 10 million documents. I also tried to perform indexing with optimization disabled, but it didn't help. Thanks, Yoni Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately. Monitoring: NICE Actimize may monitor incoming and outgoing e-mails. Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.
Re: out of memory during indexing do to large incoming queue
On 6/2/2013 8:16 AM, Yoni Amir wrote: Hello, I am receiving OutOfMemoryError during indexing, and after investigating the heap dump, I am still missing some information, and I thought this might be a good place for help. I am using Solr 4.0 beta, and I have 5 threads that send update requests to Solr. Each request is a bulk of 100 SolrInputDocuments (using solrj), and my goal is to index around 2.5 million documents. Solr is configured to do a hard-commit every 10 seconds, so initially I thought that it can only accumulate in memory 10 seconds worth of updates, but that's not the case. I can see in a profiler how it accumulates memory over time, even with 4 to 6 GB of memory. It is also configured to optimize with mergeFactor=10. 4.0-BETA came out several months ago. Even at the time, support for the alpha and beta releases was limited. Now it has been superseded by 4.0.0, 4.1.0, 4.2.0, 4.2.1, and 4.3.0, all of which are full releases. There is a 4.3.1 release currently in the works. Please upgrade. Ten seconds is a very short interval for hard commits, even if you have openSearcher=false. Frequent hard commits can cause a whole host of problems. It's better to have an interval of several minutes, and I wouldn't go less than a minute. Soft commits can be much more frequent, but if you are frequently opening new searchers, you'll probably want to disable cache warming. On optimization: don't do it unless you absolutely must. Most of the time, optimization is only needed if you delete a lot of documents and you need to get them removed from your index. If you must optimize to get rid of deleted documents, do it on a very long interval (once a day, once a week) and pause indexing during optimization. You haven't said anything about your index size, java heap size, total RAM, etc. With those numbers I could offer some guesses about what you need, but I'll warn you that they would only be guesses - watching a system with real data under load is the only way to get concrete information. Here are some basic guidelines on performance problems and RAM information: http://wiki.apache.org/solr/SolrPerformanceProblems Thanks, Shawn
RE: out of memory during indexing do to large incoming queue
Hi Shawn and Shreejay, thanks for the response. Here is some more information: 1) The machine is a virtual machine on ESX server. It has 4 CPUs and 8GB of RAM. I don't remember what CPU but something modern enough. It is running Java 7 without any special parameters, and 4GB allocated for Java (-Xmx) 2) After successful indexing, I have 2.5 Million documents, 117GB index size. This is the size after it was optimized. 3) I plan to upgrade to 4.3 just didn't have time. 4.0 beta is what was available at the time that we had a release deadline. 4) The setup with master-slave replication, not Solr Cloud. The server that I am discussing is the indexing server, and in these tests there were actually no slaves involved, and virtually zero searches performed. 5) Attached is my configuration. I tried to disable the warm-up and opening of searchers, it didn't change anything. The commits are done by Solr, using autocommit. The client sends the updates without a commit command. 6) I want to disable optimization, but when I disabled it, the OOME occurred even faster. The number of segments reached around a thousand within an hour or so. I don't know if it's normal or not, but at that point if I restarted Solr it immediately took about 1GB of heap space just on start-up, instead of the usual 50MB or so. If I commit less frequently, don't I increase the risk of losing data, e.g., if the power goes down, etc.? If I disable optimization, is it necessary to avoid such a large number of segments? Is it possible? Thanks again, Yoni -Original Message- From: Shawn Heisey [mailto:s...@elyograg.org] Sent: Sunday, June 02, 2013 18:05 To: solr-user@lucene.apache.org Subject: Re: out of memory during indexing do to large incoming queue On 6/2/2013 8:16 AM, Yoni Amir wrote: Hello, I am receiving OutOfMemoryError during indexing, and after investigating the heap dump, I am still missing some information, and I thought this might be a good place for help. I am using Solr 4.0 beta, and I have 5 threads that send update requests to Solr. Each request is a bulk of 100 SolrInputDocuments (using solrj), and my goal is to index around 2.5 million documents. Solr is configured to do a hard-commit every 10 seconds, so initially I thought that it can only accumulate in memory 10 seconds worth of updates, but that's not the case. I can see in a profiler how it accumulates memory over time, even with 4 to 6 GB of memory. It is also configured to optimize with mergeFactor=10. 4.0-BETA came out several months ago. Even at the time, support for the alpha and beta releases was limited. Now it has been superseded by 4.0.0, 4.1.0, 4.2.0, 4.2.1, and 4.3.0, all of which are full releases. There is a 4.3.1 release currently in the works. Please upgrade. Ten seconds is a very short interval for hard commits, even if you have openSearcher=false. Frequent hard commits can cause a whole host of problems. It's better to have an interval of several minutes, and I wouldn't go less than a minute. Soft commits can be much more frequent, but if you are frequently opening new searchers, you'll probably want to disable cache warming. On optimization: don't do it unless you absolutely must. Most of the time, optimization is only needed if you delete a lot of documents and you need to get them removed from your index. If you must optimize to get rid of deleted documents, do it on a very long interval (once a day, once a week) and pause indexing during optimization. You haven't said anything about your index size, java heap size, total RAM, etc. With those numbers I could offer some guesses about what you need, but I'll warn you that they would only be guesses - watching a system with real data under load is the only way to get concrete information. Here are some basic guidelines on performance problems and RAM information: http://wiki.apache.org/solr/SolrPerformanceProblems Thanks, Shawn Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately. Monitoring: NICE Actimize may monitor incoming and outgoing e-mails. Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free. ?xml version=1.0 encoding=UTF-8 ? config luceneMatchVersionLUCENE_40/luceneMatchVersion dataDir${solr.data.dir:}/dataDir directoryFactory name=DirectoryFactory class=${solr.directoryFactory:solr.NRTCachingDirectoryFactory} / indexConfig !-- ramBufferSizeMB sets the amount
Re: out of memory during indexing do to large incoming queue
On 6/2/2013 12:25 PM, Yoni Amir wrote: Hi Shawn and Shreejay, thanks for the response. Here is some more information: 1) The machine is a virtual machine on ESX server. It has 4 CPUs and 8GB of RAM. I don't remember what CPU but something modern enough. It is running Java 7 without any special parameters, and 4GB allocated for Java (-Xmx) 2) After successful indexing, I have 2.5 Million documents, 117GB index size. This is the size after it was optimized. 3) I plan to upgrade to 4.3 just didn't have time. 4.0 beta is what was available at the time that we had a release deadline. 4) The setup with master-slave replication, not Solr Cloud. The server that I am discussing is the indexing server, and in these tests there were actually no slaves involved, and virtually zero searches performed. 5) Attached is my configuration. I tried to disable the warm-up and opening of searchers, it didn't change anything. The commits are done by Solr, using autocommit. The client sends the updates without a commit command. 6) I want to disable optimization, but when I disabled it, the OOME occurred even faster. The number of segments reached around a thousand within an hour or so. I don't know if it's normal or not, but at that point if I restarted Solr it immediately took about 1GB of heap space just on start-up, instead of the usual 50MB or so. If I commit less frequently, don't I increase the risk of losing data, e.g., if the power goes down, etc.? If I disable optimization, is it necessary to avoid such a large number of segments? Is it possible? Last part first: Losing data is much less of a risk with Solr 4.x, if you have enabled the updateLog. We'll need some more info. See the end of the message for specifics. Right off the bat, I can tell you that with an index that's 117GB, you're going to need a LOT of RAM. Each of my 4.2.1 servers has 42GB of index and about 37 million documents between all the index shards. The web application never uses facets, which tend to use a lot of memory. My index is a lot smaller than yours, and I need a 6GB heap, seeing OOM errors if it's only 4GB. You probably need at least an 8GB heap, and possibly larger. Beyond the amount of memory that Solr itself uses, for good performance you will also need a large amount of memory for OS disk caching. Unless the server is using SSD, you need to allocate at least 64GB of real memory to the virtual machine. If you've got your index on SSD, 32GB might be enough. I've got 64GB total on my servers. http://wiki.apache.org/solr/SolrPerformanceProblems When you say that there are over 1000 segments, are you seeing 1000 files, or are there literally 1000 segments, giving you between 12000 and 15000 files? Even if your mergeFactor were higher than the default 10, that just shouldn't happen. Can you share your solrconfig.xml and schema.xml? Use a paste website like http://apaste.info and share the URLs. Thanks, Shawn