Re: Solr caching the index file make server refuse serving
10 billion documents on 12 cores is over 800M documents/shard at best. This is _very_ aggressive for a shard. Could you give more information about your setup? I've seen 250M docs fit in 12G memory. I've also seen 10M documents strain 32G of memory. Details matter a lot. The only way I've been able to determine what a reasonable number of docs with my queries on my data is to do "the sizing exercise", which I've outlined here: https://lucidworks.com/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/ While this was written over 5 years ago, it's still accurate. Best, Erick On Thu, Aug 24, 2017 at 6:10 PM, 陈永龙wrote: > Hello, > > ENV: solrcloud 6.3 > > 3*dell server > > 128G 12cores 4.3T /server > > 3 solr node /server > > 20G /node (with parameter –m 20G) > > 10 billlion documents totle > > Problem: > > When we start solrcloud ,the cached index will make memory 98% or > more used . And if we continue to index document (batch commit 10 000 > documents),one or more server will refuse serving.Cannot login wia ssh,even > refuse the monitor. > > So,how can I limit the solr’s caching index to memory behavior? > > Anyone thanks! >
Solr caching the index file make server refuse serving
Hello, ENV: solrcloud 6.3 3*dell server 128G 12cores 4.3T /server 3 solr node /server 20G /node (with parameter �Cm 20G) 10 billlion documents totle Problem: When we start solrcloud ,the cached index will make memory 98% or more used . And if we continue to index document (batch commit 10 000 documents),one or more server will refuse serving.Cannot login wia ssh,even refuse the monitor. So,how can I limit the solr’s caching index to memory behavior? Anyone thanks!
Re: Solr Caching (documentCache) not working
I think this is expected. As Shawn mentioned, your hard commits have openSearcher=false, so they flush changes to disk, but don't force a re-open of the active searcher. By contrast softCommit, sets openSearcher=true, the point of softCommit is to make the changes visible so do to that you have to re-open a searcher. Currently, most of the Solr Caches are searcher-based, so opening a new searcher means creating (and optionally warming) a new cache. I know there is work in progress to make these caches more segment-based (which would lead to more re-use between searchers) but currently each commit would create a new set of caches. You can warm those caches, but that's where the trade-off comes, warming a cache with records takes time and processing, so if you are committing frequently, the warming doesn't really have time to take effect. https://wiki.apache.org/solr/SolrCaching explains this (probably better than me) and from that page, the DocumentCache can't be auto-warmed since DocIds can change between searchers. On 18 August 2015 at 06:19, Maulin Rathod mrat...@asite.com wrote: Hi Shawn, Thanks for your feedback. In our scenario documents are added frequently (Approx 10 documents added in 1 minute) and we want to make it available for search near realtime (within 5 second). Even if we set autosoftcommit 5 second (so that document will be available for search after 5 second), it flushes all documents from documentCache. Just wanted to understand if we are doing something wrong or its solr expected behavior. autoSoftCommit maxTime5000/maxTime /autoSoftCommit Regards, Maulin -Original Message- From: Shawn Heisey [mailto:apa...@elyograg.org] Sent: 17 August 2015 19:02 To: solr-user@lucene.apache.org Subject: Re: Solr Caching (documentCache) not working On 8/17/2015 7:04 AM, Maulin Rathod wrote: We have observed that Intermittently querying become slower when documentCache become empty. The documentCache is getting flushed whenever new document added to the collection. Is there any way by which we can ensure that newly added documents are visible without losing data in documentCache? We are trying to use soft commit but it also flushes all documents in documentCache. snip autoSoftCommit maxTime50/maxTime /autoSoftCommit You are doing a soft commit within 50 milliseconds of adding a new document. Solr can have severe performance problems when autoSoftCommit is set to 1000 -- one second. 50 milliseconds is one twentieth of a very low value that is known to cause problems. It can make the problem much more than 20 times worse. Please read this article: http://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ Note one particular section, which says the following: Don’t listen to your product manager who says we need no more than 1 second latency. You need to set your commit interval as long as you possibly can. I personally wouldn't go longer than 60 seconds, 30 seconds if the commits complete particularly fast. It should be several minutes if that will meet your needs. When your commit interval is very low, Solr's caches can become useless, as you've noticed. TL;DR info: Your autoCommit settings have openSearcher set to false, so they do not matter for the problem you have described. I would probably increase that to 5 minutes rather than 15 seconds, but that is not very important here, and 15 seconds for hard commits that don't open a new searcher is known to have a low impact on performance. Low impact isn't the same as NO impact, so I keep this interval long as well. Thanks, Shawn
Re: Solr Caching (documentCache) not working
On 8/18/2015 2:30 AM, Daniel Collins wrote: I think this is expected. As Shawn mentioned, your hard commits have openSearcher=false, so they flush changes to disk, but don't force a re-open of the active searcher. By contrast softCommit, sets openSearcher=true, the point of softCommit is to make the changes visible so do to that you have to re-open a searcher. Currently, most of the Solr Caches are searcher-based, so opening a new searcher means creating (and optionally warming) a new cache. I know there is work in progress to make these caches more segment-based (which would lead to more re-use between searchers) but currently each commit would create a new set of caches. You can warm those caches, but that's where the trade-off comes, warming a cache with records takes time and processing, so if you are committing frequently, the warming doesn't really have time to take effect. https://wiki.apache.org/solr/SolrCaching explains this (probably better than me) and from that page, the DocumentCache can't be auto-warmed since DocIds can change between searchers. On 18 August 2015 at 06:19, Maulin Rathod mrat...@asite.com wrote: Hi Shawn, Thanks for your feedback. In our scenario documents are added frequently (Approx 10 documents added in 1 minute) and we want to make it available for search near realtime (within 5 second). Even if we set autosoftcommit 5 second (so that document will be available for search after 5 second), it flushes all documents from documentCache. Just wanted to understand if we are doing something wrong or its solr expected behavior. 10 documents per minute won't even make Solr breathe hard. Under the right conditions, Solr can index thousands of documents per second. Unless there are a very large number of of new documents, the length of time that a commit takes usually has no correlation to the number of documents that were added. It has more to do with the number of documents in the entire index, the total index size, and how you have configured your cache warming. Daniel has provided good information regarding how searchers and caches interact. Thanks, Shawn
Re: Solr Caching (documentCache) not working
On Mon, Aug 17, 2015 at 4:36 PM, Daniel Collins danwcoll...@gmail.com wrote: we had to turn off ALL the Solr caches (warming is useless at that kind of frequency Warming and caching are related, but different. Caching still normally makes sense without warming, and Solr is generally written with the assumption that caches are present. -Yonik
Re: Solr Caching (documentCache) not working
On 8/17/2015 7:04 AM, Maulin Rathod wrote: We have observed that Intermittently querying become slower when documentCache become empty. The documentCache is getting flushed whenever new document added to the collection. Is there any way by which we can ensure that newly added documents are visible without losing data in documentCache? We are trying to use soft commit but it also flushes all documents in documentCache. snip autoSoftCommit maxTime50/maxTime /autoSoftCommit You are doing a soft commit within 50 milliseconds of adding a new document. Solr can have severe performance problems when autoSoftCommit is set to 1000 -- one second. 50 milliseconds is one twentieth of a very low value that is known to cause problems. It can make the problem much more than 20 times worse. Please read this article: http://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ Note one particular section, which says the following: Don’t listen to your product manager who says we need no more than 1 second latency. You need to set your commit interval as long as you possibly can. I personally wouldn't go longer than 60 seconds, 30 seconds if the commits complete particularly fast. It should be several minutes if that will meet your needs. When your commit interval is very low, Solr's caches can become useless, as you've noticed. TL;DR info: Your autoCommit settings have openSearcher set to false, so they do not matter for the problem you have described. I would probably increase that to 5 minutes rather than 15 seconds, but that is not very important here, and 15 seconds for hard commits that don't open a new searcher is known to have a low impact on performance. Low impact isn't the same as NO impact, so I keep this interval long as well. Thanks, Shawn
Solr Caching (documentCache) not working
Hi, We are using solr cloud 5.2 version. We have observed that Intermittently querying become slower when documentCache become empty. The documentCache is getting flushed whenever new document added to the collection. Is there any way by which we can ensure that newly added documents are visible without losing data in documentCache? We are trying to use soft commit but it also flushes all documents in documentCache. We have following setting in solrconfig.xml. autoCommit maxTime${solr.autoCommit.maxTime:15000}/maxTime openSearcherfalse/openSearcher /autoCommit autoSoftCommit maxTime50/maxTime /autoSoftCommit Regards, Maulin [CC Award Winners 2014]
RE: Solr Caching (documentCache) not working
Hi Shawn, Thanks for your feedback. In our scenario documents are added frequently (Approx 10 documents added in 1 minute) and we want to make it available for search near realtime (within 5 second). Even if we set autosoftcommit 5 second (so that document will be available for search after 5 second), it flushes all documents from documentCache. Just wanted to understand if we are doing something wrong or its solr expected behavior. autoSoftCommit maxTime5000/maxTime /autoSoftCommit Regards, Maulin -Original Message- From: Shawn Heisey [mailto:apa...@elyograg.org] Sent: 17 August 2015 19:02 To: solr-user@lucene.apache.org Subject: Re: Solr Caching (documentCache) not working On 8/17/2015 7:04 AM, Maulin Rathod wrote: We have observed that Intermittently querying become slower when documentCache become empty. The documentCache is getting flushed whenever new document added to the collection. Is there any way by which we can ensure that newly added documents are visible without losing data in documentCache? We are trying to use soft commit but it also flushes all documents in documentCache. snip autoSoftCommit maxTime50/maxTime /autoSoftCommit You are doing a soft commit within 50 milliseconds of adding a new document. Solr can have severe performance problems when autoSoftCommit is set to 1000 -- one second. 50 milliseconds is one twentieth of a very low value that is known to cause problems. It can make the problem much more than 20 times worse. Please read this article: http://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ Note one particular section, which says the following: Don’t listen to your product manager who says we need no more than 1 second latency. You need to set your commit interval as long as you possibly can. I personally wouldn't go longer than 60 seconds, 30 seconds if the commits complete particularly fast. It should be several minutes if that will meet your needs. When your commit interval is very low, Solr's caches can become useless, as you've noticed. TL;DR info: Your autoCommit settings have openSearcher set to false, so they do not matter for the problem you have described. I would probably increase that to 5 minutes rather than 15 seconds, but that is not very important here, and 15 seconds for hard commits that don't open a new searcher is known to have a low impact on performance. Low impact isn't the same as NO impact, so I keep this interval long as well. Thanks, Shawn
Re: Solr Caching (documentCache) not working
On Mon, Aug 17, 2015 at 11:36 PM, Daniel Collins danwcoll...@gmail.com wrote: Just to open the can of worms, it *can* be possible to have very low commit times, we have 250ms currently and are in production with that. But it does come with pain (no such thing as a free lunch!), we had to turn off ALL the Solr caches Gentlemen, Excuse me for hijacking, here are the small segmentation lunchbox: - segmented filters, which much cheaper for commit: http://blog.griddynamics.com/2014/01/segmented-filter-cache-in-solr.html - docvalues facets on steroids https://issues.apache.org/jira/browse/SOLR-7730 (since 5.3) But document's cache wasn't sliced yet. Thus, I prefer https://issues.apache.org/jira/browse/SOLR-7937 hangs for a while and to be pursued later. (warming is useless at that kind of frequency, it will take longer to warm the cache than the time before the next commit), and throw a lot of RAM and expensive SSDs at the problem. That said, Shawn's advice is correct, anything less than 1s commit shouldn't be needed for most users, and I would concur with staying away from it unless you absolutely decide you have to have it. You only go that route if you are prepared to commit (no pun intended!) a fair amount of time, money and resources to investigating and dealing with issues. We will have a talk at Revolution this year about some of the scale and latency issues we have to deal with (blatant plug for my team lead who's giving the talk!) On 17 August 2015 at 14:31, Shawn Heisey apa...@elyograg.org wrote: On 8/17/2015 7:04 AM, Maulin Rathod wrote: We have observed that Intermittently querying become slower when documentCache become empty. The documentCache is getting flushed whenever new document added to the collection. Is there any way by which we can ensure that newly added documents are visible without losing data in documentCache? We are trying to use soft commit but it also flushes all documents in documentCache. snip autoSoftCommit maxTime50/maxTime /autoSoftCommit You are doing a soft commit within 50 milliseconds of adding a new document. Solr can have severe performance problems when autoSoftCommit is set to 1000 -- one second. 50 milliseconds is one twentieth of a very low value that is known to cause problems. It can make the problem much more than 20 times worse. Please read this article: http://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ Note one particular section, which says the following: Don’t listen to your product manager who says we need no more than 1 second latency. You need to set your commit interval as long as you possibly can. I personally wouldn't go longer than 60 seconds, 30 seconds if the commits complete particularly fast. It should be several minutes if that will meet your needs. When your commit interval is very low, Solr's caches can become useless, as you've noticed. TL;DR info: Your autoCommit settings have openSearcher set to false, so they do not matter for the problem you have described. I would probably increase that to 5 minutes rather than 15 seconds, but that is not very important here, and 15 seconds for hard commits that don't open a new searcher is known to have a low impact on performance. Low impact isn't the same as NO impact, so I keep this interval long as well. Thanks, Shawn -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Solr Caching (documentCache) not working
Just to open the can of worms, it *can* be possible to have very low commit times, we have 250ms currently and are in production with that. But it does come with pain (no such thing as a free lunch!), we had to turn off ALL the Solr caches (warming is useless at that kind of frequency, it will take longer to warm the cache than the time before the next commit), and throw a lot of RAM and expensive SSDs at the problem. That said, Shawn's advice is correct, anything less than 1s commit shouldn't be needed for most users, and I would concur with staying away from it unless you absolutely decide you have to have it. You only go that route if you are prepared to commit (no pun intended!) a fair amount of time, money and resources to investigating and dealing with issues. We will have a talk at Revolution this year about some of the scale and latency issues we have to deal with (blatant plug for my team lead who's giving the talk!) On 17 August 2015 at 14:31, Shawn Heisey apa...@elyograg.org wrote: On 8/17/2015 7:04 AM, Maulin Rathod wrote: We have observed that Intermittently querying become slower when documentCache become empty. The documentCache is getting flushed whenever new document added to the collection. Is there any way by which we can ensure that newly added documents are visible without losing data in documentCache? We are trying to use soft commit but it also flushes all documents in documentCache. snip autoSoftCommit maxTime50/maxTime /autoSoftCommit You are doing a soft commit within 50 milliseconds of adding a new document. Solr can have severe performance problems when autoSoftCommit is set to 1000 -- one second. 50 milliseconds is one twentieth of a very low value that is known to cause problems. It can make the problem much more than 20 times worse. Please read this article: http://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ Note one particular section, which says the following: Don’t listen to your product manager who says we need no more than 1 second latency. You need to set your commit interval as long as you possibly can. I personally wouldn't go longer than 60 seconds, 30 seconds if the commits complete particularly fast. It should be several minutes if that will meet your needs. When your commit interval is very low, Solr's caches can become useless, as you've noticed. TL;DR info: Your autoCommit settings have openSearcher set to false, so they do not matter for the problem you have described. I would probably increase that to 5 minutes rather than 15 seconds, but that is not very important here, and 15 seconds for hard commits that don't open a new searcher is known to have a low impact on performance. Low impact isn't the same as NO impact, so I keep this interval long as well. Thanks, Shawn
Re: Question on Solr Caching
Thanks Shawn, Can you please re-direct me to any wiki which describes (in detail) the differences between MMapDirectoryFactory and NRTCachingDirectoryFactory? I found this blog http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html very helpful which describes about MMapDirectory. I want to know in detail about NRTCachingFactory as well. Also, when I ran this rest request solr/admin/cores?action=STATUS, I got the below result (pasted partial result only). I have set the DirectoryFactory as NRTCachingDirectory in solrconfig.xml. But, it also shows MMapDirectory in the below element. Does this means NRTCachingDirectory is using MMapDirectory internally?? str name=directory org.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(MMapDirectory@/instance/solr/collection1_shard2_replica1/data/index lockFactory=NativeFSLockFactory@/instance/solr/collection1_shard2_replica1/data/index; maxCacheMB=48.0 maxMergeSizeMB=4.0)/str What does maxCacheMB and maxMergeSizeMB indicate? How to control it? Thanks, Manohar On Fri, Dec 5, 2014 at 11:04 AM, Shawn Heisey apa...@elyograg.org wrote: On 12/4/2014 10:06 PM, Manohar Sripada wrote: If you use MMapDirectory, Lucene will map the files into memory off heap and the OS's disk cache will cache the files in memory for you. Don't use RAMDirectory, it's not better than MMapDirectory for any use I'm aware of. Will that mean it will cache the Inverted index as well to OS disk's cache? The reason I am asking is, Solr searches this Inverted Index first to get the data. How about if we can keep this in memory? If you have enough memory, the operating system will cache *everything*. It does so by simply loading the data that's on the disk into RAM ... it is not aware that certain parts are the inverted index, it simply caches whatever data gets read. A subsequent read will come out of memory, the disk heads will never even move. If certain data in the index is never accessed, then it will not get cached. http://en.wikipedia.org/wiki/Page_cache Thanks, Shawn
Re: Question on Solr Caching
On 12/8/2014 2:42 AM, Manohar Sripada wrote: Can you please re-direct me to any wiki which describes (in detail) the differences between MMapDirectoryFactory and NRTCachingDirectoryFactory? I found this blog http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html very helpful which describes about MMapDirectory. I want to know in detail about NRTCachingFactory as well. Also, when I ran this rest request solr/admin/cores?action=STATUS, I got the below result (pasted partial result only). I have set the DirectoryFactory as NRTCachingDirectory in solrconfig.xml. But, it also shows MMapDirectory in the below element. Does this means NRTCachingDirectory is using MMapDirectory internally?? str name=directory org.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(MMapDirectory@/instance/solr/collection1_shard2_replica1/data/index lockFactory=NativeFSLockFactory@/instance/solr/collection1_shard2_replica1/data/index; maxCacheMB=48.0 maxMergeSizeMB=4.0)/str What does maxCacheMB and maxMergeSizeMB indicate? How to control it? NRTCachingDirectoryFactory creates instances of NRTCachingDirectory. This is is a wrapper on top of another Directory implementation. Normally it wraps MMapDirectory, so you get all the MMap advantages. The javadoc for NRTCachingDirectory says that it Wraps a RAMDirectory around any provided delegate directory, to be used during NRT search. http://lucene.apache.org/core/4_10_0/core/org/apache/lucene/store/NRTCachingDirectory.html Further down in that javadoc, the constructor documentation has this to say: We will cache a newly created output if 1) it's a flush or a merge and the estimated size of the merged segment is = maxMergeSizeMB, and 2) the total cached bytes is = maxCachedMB Basically, if a newly created or merged segment is small enough, it won't be written to disk right away, it will be saved into RAM until another cacheable segment won't fit in available RAM and the oldest cached segment must be flushed to disk. Near Real Time search becomes easier. This DirectoryFactory implementation is default in 4.x, so as I understand it, it's critically important for Solr to have a replayable transaction log ... without it, any data that is cached in RAM will be lost if the program crashes or exits. The main Solr example *does* have the transaction log enabled. Thanks, Shawn
Question on Solr Caching
Hi, I am working on implementing Solr in my product. I have a few questions on caching. 1. Does posting-list and term-list of the index reside in the memory? If not, how to load this to memory. I don't want to load entire data, like using DocumentCache. Either I want to use RAMDirectoryFactory as the data will be lost if you restart 2. For FilterCache, there is a way to specify whether the filter should be cached or not in the query. Similarly, Is there a way where I can specify the list of stored fields to be loaded to Document Cache? I know Document Cache is not associated to query. Just curious to know. 3. Similarly, Is there a way I can specify list of fields to be cached for FieldCache? Thanks, Manohar
Re: Question on Solr Caching
Hi, Manohar, 1. Does posting-list and term-list of the index reside in the memory? If not, how to load this to memory. I don't want to load entire data, like using DocumentCache. Either I want to use RAMDirectoryFactory as the data will be lost if you restart If you use MMapDirectory, Lucene will map the files into memory off heap and the OS's disk cache will cache the files in memory for you. Don't use RAMDirectory, it's not better than MMapDirectory for any use I'm aware of. 2. For FilterCache, there is a way to specify whether the filter should be cached or not in the query. If you add {!cache=false} to your filter query, it will bypass the cache. I'm fairly certain it will not subsequently be cached. Similarly, Is there a way where I can specify the list of stored fields to be loaded to Document Cache? If you have lazy loading enabled, the DocumentCache will only have the fields you asked for in it. 3. Similarly, Is there a way I can specify list of fields to be cached for FieldCache? Thanks, Manohar You basically don't have much control over the FieldCache in Solr other than warming it with queries. You should check out this wiki page, it will probably answer some questions: https://wiki.apache.org/solr/SolrCaching I hope that helps! Michael
Re: Question on Solr Caching
Thanks Micheal for the response. If you use MMapDirectory, Lucene will map the files into memory off heap and the OS's disk cache will cache the files in memory for you. Don't use RAMDirectory, it's not better than MMapDirectory for any use I'm aware of. Will that mean it will cache the Inverted index as well to OS disk's cache? The reason I am asking is, Solr searches this Inverted Index first to get the data. How about if we can keep this in memory? Thanks, Manohar On Thu, Dec 4, 2014 at 10:54 PM, Michael Della Bitta michael.della.bi...@appinions.com wrote: Hi, Manohar, 1. Does posting-list and term-list of the index reside in the memory? If not, how to load this to memory. I don't want to load entire data, like using DocumentCache. Either I want to use RAMDirectoryFactory as the data will be lost if you restart If you use MMapDirectory, Lucene will map the files into memory off heap and the OS's disk cache will cache the files in memory for you. Don't use RAMDirectory, it's not better than MMapDirectory for any use I'm aware of. 2. For FilterCache, there is a way to specify whether the filter should be cached or not in the query. If you add {!cache=false} to your filter query, it will bypass the cache. I'm fairly certain it will not subsequently be cached. Similarly, Is there a way where I can specify the list of stored fields to be loaded to Document Cache? If you have lazy loading enabled, the DocumentCache will only have the fields you asked for in it. 3. Similarly, Is there a way I can specify list of fields to be cached for FieldCache? Thanks, Manohar You basically don't have much control over the FieldCache in Solr other than warming it with queries. You should check out this wiki page, it will probably answer some questions: https://wiki.apache.org/solr/SolrCaching I hope that helps! Michael
Re: Question on Solr Caching
On 12/4/2014 10:06 PM, Manohar Sripada wrote: If you use MMapDirectory, Lucene will map the files into memory off heap and the OS's disk cache will cache the files in memory for you. Don't use RAMDirectory, it's not better than MMapDirectory for any use I'm aware of. Will that mean it will cache the Inverted index as well to OS disk's cache? The reason I am asking is, Solr searches this Inverted Index first to get the data. How about if we can keep this in memory? If you have enough memory, the operating system will cache *everything*. It does so by simply loading the data that's on the disk into RAM ... it is not aware that certain parts are the inverted index, it simply caches whatever data gets read. A subsequent read will come out of memory, the disk heads will never even move. If certain data in the index is never accessed, then it will not get cached. http://en.wikipedia.org/wiki/Page_cache Thanks, Shawn
Re: Solr caching clarifications
Manuel: First off, anything that Mike McCandless says about low-level details should override anything I say. The memory savings he's talking about there are actually something he tutored me in once on a chat. The savings there, as I understand it, aren't huge. For large sets I think it's a 25% savings (if I calculated right). But consider that even without those savings, 8 filter cache entries will be more than the entire structure that JIRA talks about As to your fq question, absolutely! Any yes/no clause that, as you say, contribute to the score is a candidate to be moved to a fq clause. There are a couple of things to be aware of though. 1 be a little careful of using NOW. If you don't use it correctly, fq clauses will not be re-used. See: http://searchhub.org/2012/02/23/date-math-now-and-filter-queries/ 2 How you usually do this is through the UI, not the users entering a query. For instance if you have a date-range picker your a;; constructs the fq clause from that. Or you append fq clauses to the links you create when you display facets or No, there's no automatic tool for this. There's not likely to be one since there's no way to infer the intent. Say you put in a clause like q=a AND b. That scores things. It would give the same result set as q=*:*fq=1fq=b which would compute no scores. How could a tool infer when this was or wasn't OK? Best Erick On Sun, Jul 14, 2013 at 6:10 PM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: Alright, thanks Erick. For the question about memory usage of merges, taken from Mike McCandless Blog The big thing that stays in RAM is a logical int[] mapping old docIDs to new docIDs, but in more recent versions of Lucene (4.x) we use a much more efficient structure than a simple int[] ... see https://issues.apache.org/jira/browse/LUCENE-2357 How much RAM is required is mostly a function of how many documents (lots of tiny docs use more RAM than fewer huge docs). A related clarification As my users are not aware of the fq possibility, i was wondering how do I make the best out of this field cache. Would if be efficient transforming implicitly their query to a filter query on fields that are boolean searches (date range etc. that do not affect the score of a document). Is this a good practice? Is there any plugin for a query parser that makes it? Inline On Thu, Jul 11, 2013 at 8:36 AM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: Hello, As a result of frequent java OOM exceptions, I try to investigate more into the solr jvm memory heap usage. Please correct me if I am mistaking, this is my understanding of usages for the heap (per replica on a solr instance): 1. Buffers for indexing - bounded by ramBufferSize 2. Solr caches 3. Segment merge 4. Miscellaneous- buffers for Tlogs, servlet overhead etc. Particularly I'm concerned by Solr caches and segment merges. 1. How much memory consuming (bytes per doc) are FilterCaches (bitDocSet) and queryResultCaches (DocList)? I understand it is related to the skip spaces between doc id's that match (so it's not saved as a bitmap). But basically, is every id saved as a java int? Different beasts. filterCache consumes, essentially, maxDoc/8 bytes (you can get the maxDoc number from your Solr admin page). Plus some overhead for storing the fq text, but that's usually not much. This is for each entry up to Size. queryResultCache is usually trivial unless you've configured it extravagantly. It's the query string length + queryResultWindowSize integers per entry (queryResultWindowSize is from solrconfig.xml). 2. QueryResultMaxDocsCached - (for example = 100) means that any query resulting in more than 100 docs will not be cached (at all) in the queryResultCache? Or does it have to do with the documentCache? It's just a limit on the queryResultCache entry size as far as I can tell. But again this cache is relatively small, I'd be surprised if it used significant resources. 3. DocumentCache - written on the wiki it should be greater than max_results*concurrent_queries. Max result is just the num of rows displayed (rows-start) param, right? Not the queryResultWindow. Yes. This a cache (I think) for the _contents_ of the documents you'll be returning to be manipulated by various components during the life of the query. 4. LazyFieldLoading=true - when quering for id's only (fl=id) will this cache be used? (on the expense of eviction of docs that were already loaded with stored fields) Not sure, but I don't think this will contribute much to memory pressure. This is about now many fields are loaded to get a single value from a doc in the results list, and since one is usually working with 20 or so docs this is usually a small amount of memory. 5. How large is the heap used by mergings? Assuming we have a merge of 10 segments of 500MB each (half inverted files - *.pos *.doc etc, half
Re: Solr caching clarifications
Great explanation and article. Yes, this buffer for merges seems very small, and still optimized. Thats impressive.
Re: Solr caching clarifications
Alright, thanks Erick. For the question about memory usage of merges, taken from Mike McCandless Blog The big thing that stays in RAM is a logical int[] mapping old docIDs to new docIDs, but in more recent versions of Lucene (4.x) we use a much more efficient structure than a simple int[] ... see https://issues.apache.org/jira/browse/LUCENE-2357 How much RAM is required is mostly a function of how many documents (lots of tiny docs use more RAM than fewer huge docs). A related clarification As my users are not aware of the fq possibility, i was wondering how do I make the best out of this field cache. Would if be efficient transforming implicitly their query to a filter query on fields that are boolean searches (date range etc. that do not affect the score of a document). Is this a good practice? Is there any plugin for a query parser that makes it? Inline On Thu, Jul 11, 2013 at 8:36 AM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: Hello, As a result of frequent java OOM exceptions, I try to investigate more into the solr jvm memory heap usage. Please correct me if I am mistaking, this is my understanding of usages for the heap (per replica on a solr instance): 1. Buffers for indexing - bounded by ramBufferSize 2. Solr caches 3. Segment merge 4. Miscellaneous- buffers for Tlogs, servlet overhead etc. Particularly I'm concerned by Solr caches and segment merges. 1. How much memory consuming (bytes per doc) are FilterCaches (bitDocSet) and queryResultCaches (DocList)? I understand it is related to the skip spaces between doc id's that match (so it's not saved as a bitmap). But basically, is every id saved as a java int? Different beasts. filterCache consumes, essentially, maxDoc/8 bytes (you can get the maxDoc number from your Solr admin page). Plus some overhead for storing the fq text, but that's usually not much. This is for each entry up to Size. queryResultCache is usually trivial unless you've configured it extravagantly. It's the query string length + queryResultWindowSize integers per entry (queryResultWindowSize is from solrconfig.xml). 2. QueryResultMaxDocsCached - (for example = 100) means that any query resulting in more than 100 docs will not be cached (at all) in the queryResultCache? Or does it have to do with the documentCache? It's just a limit on the queryResultCache entry size as far as I can tell. But again this cache is relatively small, I'd be surprised if it used significant resources. 3. DocumentCache - written on the wiki it should be greater than max_results*concurrent_queries. Max result is just the num of rows displayed (rows-start) param, right? Not the queryResultWindow. Yes. This a cache (I think) for the _contents_ of the documents you'll be returning to be manipulated by various components during the life of the query. 4. LazyFieldLoading=true - when quering for id's only (fl=id) will this cache be used? (on the expense of eviction of docs that were already loaded with stored fields) Not sure, but I don't think this will contribute much to memory pressure. This is about now many fields are loaded to get a single value from a doc in the results list, and since one is usually working with 20 or so docs this is usually a small amount of memory. 5. How large is the heap used by mergings? Assuming we have a merge of 10 segments of 500MB each (half inverted files - *.pos *.doc etc, half non inverted files - *.fdt, *.tvd), how much heap should be left unused for this merge? Again, I don't think this is much of a memory consumer, although I confess I don't know the internals. Merging is mostly about I/O. Thanks in advance, Manu But take a look at the admin page, you can see how much memory various caches are using by looking at the plugins/stats section. Best Erick
Re: Solr caching clarifications
Inline On Thu, Jul 11, 2013 at 8:36 AM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: Hello, As a result of frequent java OOM exceptions, I try to investigate more into the solr jvm memory heap usage. Please correct me if I am mistaking, this is my understanding of usages for the heap (per replica on a solr instance): 1. Buffers for indexing - bounded by ramBufferSize 2. Solr caches 3. Segment merge 4. Miscellaneous- buffers for Tlogs, servlet overhead etc. Particularly I'm concerned by Solr caches and segment merges. 1. How much memory consuming (bytes per doc) are FilterCaches (bitDocSet) and queryResultCaches (DocList)? I understand it is related to the skip spaces between doc id's that match (so it's not saved as a bitmap). But basically, is every id saved as a java int? Different beasts. filterCache consumes, essentially, maxDoc/8 bytes (you can get the maxDoc number from your Solr admin page). Plus some overhead for storing the fq text, but that's usually not much. This is for each entry up to Size. queryResultCache is usually trivial unless you've configured it extravagantly. It's the query string length + queryResultWindowSize integers per entry (queryResultWindowSize is from solrconfig.xml). 2. QueryResultMaxDocsCached - (for example = 100) means that any query resulting in more than 100 docs will not be cached (at all) in the queryResultCache? Or does it have to do with the documentCache? It's just a limit on the queryResultCache entry size as far as I can tell. But again this cache is relatively small, I'd be surprised if it used significant resources. 3. DocumentCache - written on the wiki it should be greater than max_results*concurrent_queries. Max result is just the num of rows displayed (rows-start) param, right? Not the queryResultWindow. Yes. This a cache (I think) for the _contents_ of the documents you'll be returning to be manipulated by various components during the life of the query. 4. LazyFieldLoading=true - when quering for id's only (fl=id) will this cache be used? (on the expense of eviction of docs that were already loaded with stored fields) Not sure, but I don't think this will contribute much to memory pressure. This is about now many fields are loaded to get a single value from a doc in the results list, and since one is usually working with 20 or so docs this is usually a small amount of memory. 5. How large is the heap used by mergings? Assuming we have a merge of 10 segments of 500MB each (half inverted files - *.pos *.doc etc, half non inverted files - *.fdt, *.tvd), how much heap should be left unused for this merge? Again, I don't think this is much of a memory consumer, although I confess I don't know the internals. Merging is mostly about I/O. Thanks in advance, Manu But take a look at the admin page, you can see how much memory various caches are using by looking at the plugins/stats section. Best Erick
Solr caching clarifications
Hello, As a result of frequent java OOM exceptions, I try to investigate more into the solr jvm memory heap usage. Please correct me if I am mistaking, this is my understanding of usages for the heap (per replica on a solr instance): 1. Buffers for indexing - bounded by ramBufferSize 2. Solr caches 3. Segment merge 4. Miscellaneous- buffers for Tlogs, servlet overhead etc. Particularly I'm concerned by Solr caches and segment merges. 1. How much memory consuming (bytes per doc) are FilterCaches (bitDocSet) and queryResultCaches (DocList)? I understand it is related to the skip spaces between doc id's that match (so it's not saved as a bitmap). But basically, is every id saved as a java int? 2. QueryResultMaxDocsCached - (for example = 100) means that any query resulting in more than 100 docs will not be cached (at all) in the queryResultCache? Or does it have to do with the documentCache? 3. DocumentCache - written on the wiki it should be greater than max_results*concurrent_queries. Max result is just the num of rows displayed (rows-start) param, right? Not the queryResultWindow. 4. LazyFieldLoading=true - when quering for id's only (fl=id) will this cache be used? (on the expense of eviction of docs that were already loaded with stored fields) 5. How large is the heap used by mergings? Assuming we have a merge of 10 segments of 500MB each (half inverted files - *.pos *.doc etc, half non inverted files - *.fdt, *.tvd), how much heap should be left unused for this merge? Thanks in advance, Manu
Solr Caching
I've just started to read about Solr caching. I want to learn one thing. Let's assume that I have given 4 GB RAM into my Solr application and I have 10 GB RAM. When Solr caching mechanism starts to work, does it use memory from that 4 GB part or lets operating system to cache it from 6 GB part of RAM that is remaining from Solr application?
Re: Solr Caching
On Apr 17, 2013, at 3:09 PM, Furkan KAMACI wrote: I've just started to read about Solr caching. I want to learn one thing. Let's assume that I have given 4 GB RAM into my Solr application and I have 10 GB RAM. When Solr caching mechanism starts to work, does it use memory from that 4 GB part or lets operating system to cache it from 6 GB part of RAM that is remaining from Solr application? Both. Solr manages caches of Java objects. These are stored in the Java heap. The OS manages caches of files. These are stored in file buffers managed by the OS. All are in RAM. wunder -- Walter Underwood wun...@wunderwood.org
Re: Solr Caching - how to tune, how much to increase, and any tips on using Solr with JDK7 and G1 GC?
4.0 is significantly more efficient memory-wise, both in the usage and number of objects allocated. See: http://searchhub.org/dev/2012/04/06/memory-comparisons-between-solr-3x-and-trunk/ Erick On Sun, Sep 30, 2012 at 12:25 AM, varun srivastava varunmail...@gmail.com wrote: Hi Erick, You mentioned for 4.0 memory pattern is much difference than 3.X . Can you elaborate whether its worse or better ? Does 4.0 tend to use more memory for similar index size as compared to 3.X ? Thanks Varun On Sat, Sep 29, 2012 at 1:58 PM, Erick Erickson erickerick...@gmail.comwrote: Well, I haven't had experience with JDK7, so I'll skip that part... But about caches. First, as far as memory is concerned, be sure to read Uwe's blog about MMapDirectory here: http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html As to the caches. Be a little careful here. Getting high hit rates on _all_ your caches is a waste. filterCache. This is the exception, you want as high a hit ratio as you can get for this one, it's where the results of all the fq= clauses go and is a major factor in speeding up QPS.. queryResultCache. Hmmm, given the lack of updates to your index, this one may actually get more hits than Id expect. But it's a very cheap cache memory wise. Think of it as a map where the key is the query and the value is an array of queryResultWindowSize longs (document IDs). It's really intended for paging mostly. It's also often the case that the chances of the exact same query (except for start and rows) being issued is actually relatively small. As always YMMV. I usually see hit rates on this cache 10%. Evictions merely mean it's been around a long time, bumping the size of this cache probably won't affect the hit rate unless your app somehow submits just a few queries. documentCache. Again, this often doesn't have a great hit ration. It's main use as I understand it is to keep various parts of a query component chain from having to re-access the disk. Each element in a query component is completely separate from the others, so if two or more components want values from the doc, having them cached is useful. The usual recommendation is (#docs returned to user) * (expected simultaneous queries), where # docs returned to user is really the rows value. One of the consequences of having huge amounts of memory allocated to the JVM can be really long garbage collections. They happen less frequently but have more work to do when they happen. Oh, and when you start using 4.0, the memory patterns are much different... Finally, here's a great post on solr memory tuning, too bad the image links are broken... http://searchhub.org/dev/2011/03/27/garbage-collection-bootcamp-1-0/ Best Erick On Sat, Sep 29, 2012 at 3:08 PM, Aaron Daubman daub...@gmail.com wrote: Greetings, I've recently moved to running some of our Solr (3.6.1) instances using JDK 7u7 with the G1 GC (playing with max pauses in the 20 to 100ms range). By and large, it has been working well (or, perhaps I should say that without requiring much tuning it works much better in general than my haphazard attempts to tune CMS). I have two instances in particular, one with a heap size of 14G and one with a heap size of 60G. I'm attempting to squeeze out additional performance by increasing Solr's cache sizes (I am still seeing the hit ratio go up as I increase max size size and decrease the number of evictions), and am guessing this is the cause of some recent situations where the 14G instance especially eventually (12-24 hrs later under 100s of queries per minute) makes it to 80%-90% of the heap and then spirals into major GC with long-pause territory. I am wondering: 1) if anybody has experience tuning the G1 GC, especially for use with Solr (what are decent max-pause times to use?) 2) how to better tune Solr's cache sizes - e.g. how to even tell the actual amount of memory used by each cache (not # entries as the stats sow, but # bits) 3) if there are any guidelines on when increasing a cache's size (even if it does continue to increase the hit ratio) runs into the law of diminishing returns or even starts to hurt - e.g. if the document cache has a current maxSize of 65536 and has seen 4409275 evictions, and currently has a hit ratio of 0.74, should the max be increased further? If so, how much ram needs to be added to the heap, and how much larger should its max size be made? I should mention that these solr instances are read-only (so cache is probably more valuable than in other scenarios - we only invalidate the searcher every 20-24hrs or so) and are also backed with indexes (6G and 70G for the 14G and 60G heap sizes) on IODrives, so I'm not as concerned about leaving RAM for linux to cache the index files (I'd much rather actually cache the post-transformed values). Thanks as always, Aaron
Solr Caching - how to tune, how much to increase, and any tips on using Solr with JDK7 and G1 GC?
Greetings, I've recently moved to running some of our Solr (3.6.1) instances using JDK 7u7 with the G1 GC (playing with max pauses in the 20 to 100ms range). By and large, it has been working well (or, perhaps I should say that without requiring much tuning it works much better in general than my haphazard attempts to tune CMS). I have two instances in particular, one with a heap size of 14G and one with a heap size of 60G. I'm attempting to squeeze out additional performance by increasing Solr's cache sizes (I am still seeing the hit ratio go up as I increase max size size and decrease the number of evictions), and am guessing this is the cause of some recent situations where the 14G instance especially eventually (12-24 hrs later under 100s of queries per minute) makes it to 80%-90% of the heap and then spirals into major GC with long-pause territory. I am wondering: 1) if anybody has experience tuning the G1 GC, especially for use with Solr (what are decent max-pause times to use?) 2) how to better tune Solr's cache sizes - e.g. how to even tell the actual amount of memory used by each cache (not # entries as the stats sow, but # bits) 3) if there are any guidelines on when increasing a cache's size (even if it does continue to increase the hit ratio) runs into the law of diminishing returns or even starts to hurt - e.g. if the document cache has a current maxSize of 65536 and has seen 4409275 evictions, and currently has a hit ratio of 0.74, should the max be increased further? If so, how much ram needs to be added to the heap, and how much larger should its max size be made? I should mention that these solr instances are read-only (so cache is probably more valuable than in other scenarios - we only invalidate the searcher every 20-24hrs or so) and are also backed with indexes (6G and 70G for the 14G and 60G heap sizes) on IODrives, so I'm not as concerned about leaving RAM for linux to cache the index files (I'd much rather actually cache the post-transformed values). Thanks as always, Aaron
Re: Solr Caching - how to tune, how much to increase, and any tips on using Solr with JDK7 and G1 GC?
Well, I haven't had experience with JDK7, so I'll skip that part... But about caches. First, as far as memory is concerned, be sure to read Uwe's blog about MMapDirectory here: http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html As to the caches. Be a little careful here. Getting high hit rates on _all_ your caches is a waste. filterCache. This is the exception, you want as high a hit ratio as you can get for this one, it's where the results of all the fq= clauses go and is a major factor in speeding up QPS.. queryResultCache. Hmmm, given the lack of updates to your index, this one may actually get more hits than Id expect. But it's a very cheap cache memory wise. Think of it as a map where the key is the query and the value is an array of queryResultWindowSize longs (document IDs). It's really intended for paging mostly. It's also often the case that the chances of the exact same query (except for start and rows) being issued is actually relatively small. As always YMMV. I usually see hit rates on this cache 10%. Evictions merely mean it's been around a long time, bumping the size of this cache probably won't affect the hit rate unless your app somehow submits just a few queries. documentCache. Again, this often doesn't have a great hit ration. It's main use as I understand it is to keep various parts of a query component chain from having to re-access the disk. Each element in a query component is completely separate from the others, so if two or more components want values from the doc, having them cached is useful. The usual recommendation is (#docs returned to user) * (expected simultaneous queries), where # docs returned to user is really the rows value. One of the consequences of having huge amounts of memory allocated to the JVM can be really long garbage collections. They happen less frequently but have more work to do when they happen. Oh, and when you start using 4.0, the memory patterns are much different... Finally, here's a great post on solr memory tuning, too bad the image links are broken... http://searchhub.org/dev/2011/03/27/garbage-collection-bootcamp-1-0/ Best Erick On Sat, Sep 29, 2012 at 3:08 PM, Aaron Daubman daub...@gmail.com wrote: Greetings, I've recently moved to running some of our Solr (3.6.1) instances using JDK 7u7 with the G1 GC (playing with max pauses in the 20 to 100ms range). By and large, it has been working well (or, perhaps I should say that without requiring much tuning it works much better in general than my haphazard attempts to tune CMS). I have two instances in particular, one with a heap size of 14G and one with a heap size of 60G. I'm attempting to squeeze out additional performance by increasing Solr's cache sizes (I am still seeing the hit ratio go up as I increase max size size and decrease the number of evictions), and am guessing this is the cause of some recent situations where the 14G instance especially eventually (12-24 hrs later under 100s of queries per minute) makes it to 80%-90% of the heap and then spirals into major GC with long-pause territory. I am wondering: 1) if anybody has experience tuning the G1 GC, especially for use with Solr (what are decent max-pause times to use?) 2) how to better tune Solr's cache sizes - e.g. how to even tell the actual amount of memory used by each cache (not # entries as the stats sow, but # bits) 3) if there are any guidelines on when increasing a cache's size (even if it does continue to increase the hit ratio) runs into the law of diminishing returns or even starts to hurt - e.g. if the document cache has a current maxSize of 65536 and has seen 4409275 evictions, and currently has a hit ratio of 0.74, should the max be increased further? If so, how much ram needs to be added to the heap, and how much larger should its max size be made? I should mention that these solr instances are read-only (so cache is probably more valuable than in other scenarios - we only invalidate the searcher every 20-24hrs or so) and are also backed with indexes (6G and 70G for the 14G and 60G heap sizes) on IODrives, so I'm not as concerned about leaving RAM for linux to cache the index files (I'd much rather actually cache the post-transformed values). Thanks as always, Aaron
Re: Solr Caching - how to tune, how much to increase, and any tips on using Solr with JDK7 and G1 GC?
Hi Erick, You mentioned for 4.0 memory pattern is much difference than 3.X . Can you elaborate whether its worse or better ? Does 4.0 tend to use more memory for similar index size as compared to 3.X ? Thanks Varun On Sat, Sep 29, 2012 at 1:58 PM, Erick Erickson erickerick...@gmail.comwrote: Well, I haven't had experience with JDK7, so I'll skip that part... But about caches. First, as far as memory is concerned, be sure to read Uwe's blog about MMapDirectory here: http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html As to the caches. Be a little careful here. Getting high hit rates on _all_ your caches is a waste. filterCache. This is the exception, you want as high a hit ratio as you can get for this one, it's where the results of all the fq= clauses go and is a major factor in speeding up QPS.. queryResultCache. Hmmm, given the lack of updates to your index, this one may actually get more hits than Id expect. But it's a very cheap cache memory wise. Think of it as a map where the key is the query and the value is an array of queryResultWindowSize longs (document IDs). It's really intended for paging mostly. It's also often the case that the chances of the exact same query (except for start and rows) being issued is actually relatively small. As always YMMV. I usually see hit rates on this cache 10%. Evictions merely mean it's been around a long time, bumping the size of this cache probably won't affect the hit rate unless your app somehow submits just a few queries. documentCache. Again, this often doesn't have a great hit ration. It's main use as I understand it is to keep various parts of a query component chain from having to re-access the disk. Each element in a query component is completely separate from the others, so if two or more components want values from the doc, having them cached is useful. The usual recommendation is (#docs returned to user) * (expected simultaneous queries), where # docs returned to user is really the rows value. One of the consequences of having huge amounts of memory allocated to the JVM can be really long garbage collections. They happen less frequently but have more work to do when they happen. Oh, and when you start using 4.0, the memory patterns are much different... Finally, here's a great post on solr memory tuning, too bad the image links are broken... http://searchhub.org/dev/2011/03/27/garbage-collection-bootcamp-1-0/ Best Erick On Sat, Sep 29, 2012 at 3:08 PM, Aaron Daubman daub...@gmail.com wrote: Greetings, I've recently moved to running some of our Solr (3.6.1) instances using JDK 7u7 with the G1 GC (playing with max pauses in the 20 to 100ms range). By and large, it has been working well (or, perhaps I should say that without requiring much tuning it works much better in general than my haphazard attempts to tune CMS). I have two instances in particular, one with a heap size of 14G and one with a heap size of 60G. I'm attempting to squeeze out additional performance by increasing Solr's cache sizes (I am still seeing the hit ratio go up as I increase max size size and decrease the number of evictions), and am guessing this is the cause of some recent situations where the 14G instance especially eventually (12-24 hrs later under 100s of queries per minute) makes it to 80%-90% of the heap and then spirals into major GC with long-pause territory. I am wondering: 1) if anybody has experience tuning the G1 GC, especially for use with Solr (what are decent max-pause times to use?) 2) how to better tune Solr's cache sizes - e.g. how to even tell the actual amount of memory used by each cache (not # entries as the stats sow, but # bits) 3) if there are any guidelines on when increasing a cache's size (even if it does continue to increase the hit ratio) runs into the law of diminishing returns or even starts to hurt - e.g. if the document cache has a current maxSize of 65536 and has seen 4409275 evictions, and currently has a hit ratio of 0.74, should the max be increased further? If so, how much ram needs to be added to the heap, and how much larger should its max size be made? I should mention that these solr instances are read-only (so cache is probably more valuable than in other scenarios - we only invalidate the searcher every 20-24hrs or so) and are also backed with indexes (6G and 70G for the 14G and 60G heap sizes) on IODrives, so I'm not as concerned about leaving RAM for linux to cache the index files (I'd much rather actually cache the post-transformed values). Thanks as always, Aaron
Re: Solr caching memory consumption Problem
Hello friends, I am using DIH for solr indexing. I have 60 million records in SQL which need to upload on solr. i started caching its smoothly working and memory consumption is normal, But after some time incrementally memory consumption going high and process reach more then 6 gb. that the reason i am not able to caching my data. please advise me if anything need to be done in configuration or in tomcat configuration. this will be very help full for me. - Regards, Suneel Pandey Sr. Software Developer -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-caching-memory-consumption-Problem-tp3873158p3877081.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr caching memory consumption Problem
On 3/31/2012 4:30 AM, Suneel wrote: Hello friends, I am using DIH for solr indexing. I have 60 million records in SQL which need to upload on solr. i started caching its smoothly working and memory consumption is normal, But after some time incrementally memory consumption going high and process reach more then 6 gb. that the reason i am not able to caching my data. please advise me if anything need to be done in configuration or in tomcat configuration. I saw your later message about virtual memory and the directoryFactory - most of the time it is best to go with the default (solr.StandardDirectoryFactory), which you can do by specifying it explicitly or by leaving that configuration out. When you talk about caching, are you talking about Solr's caches or OS/process memory and disk cache?If you are talking about the caches that you can configure in solrconfig.xml (filterCache, queryResultCache, and documentCache), you should not be trying to cache large portions of your index there. I have over 11 million documents in each of my index shards (68 million for the whole index) and my numbers for those three caches are 64, 512, and 16384, with autoWarm counts of 4 and 32, since the documentCache doesn't directly support warming. If you are talking about how much memory Windows says the Java process says it is taking up, take a look at the replies you have already gotten on your Virtual Memory message. As Erick and Michael told you, if you are using the latest version (3.5) with the standard directoryFactory config, most of the memory that you are seeing there is because the OS is memory mapping your entire on-disk index, taking advantage of the OS disk cache to speed up disk access without actually allocating the memory involved. This is a good thing, even though the process numbers look bad. JConsole or another java memory tool can show you the true picture. With 60 million records, even if those records are small, your Solr index will probably grow to several gigabytes. For the best performance, your server must have enough memory so that the entire index can fit into RAM, after discounting memory usage for the OS itself and the java process that contains Solr. If you can get MOST of the index into RAM, performance will likely still be acceptable. You message implies that 6GB worries you very much, so I am guessing that your server has somewhere in the range of 4GB to 8GB of RAM, but your index is very much larger than this. You don't actually say whether you lose performance. Do you, or are you just worried about the memory usage? If Solr's query times start increasing, that is usually a good indicator that it is not healthy. Thanks, Shawn
Solr caching memory consumption Problem
Hello friends, I am using DIH for solr indexing. I have 60 million records in SQL which need to upload on solr. i started caching its smoothly working and memory consumption is normal, But after some time incrementally memory consumption going high and process reach more then 6 gb. that the reason i am not able to caching my data. please advise me if anything need to be done in configuration or in tomcat configuration. this will be very help full for me. - Regards, Suneel Pandey Sr. Software Developer -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-caching-memory-consumption-Problem-tp3873158p3873158.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr caching problem
There are now two excellent books: Lucene In Action 2 and Solr 1.4 Enterprise Search Server the describe the inners workings of these technologies and how they fit together. Otherwise Solr and Lucene knowledge are only available in a fragmented form across many wiki pages, bug reports and email discussions. But the direct answer is: before you commit your changes you will not seem them in queries. When you commit them, all caches are thrown away and rebuilt when you do the same queries you did before. This rebuilding process has various tools to control it in solrconfig.xml. On Wed, Sep 23, 2009 at 8:27 PM, satya tosatyaj...@gmail.com wrote: Is there any way to analyze or see that which documents are getting cached by documentCache - documentCache class=solr.LRUCache size=512 initialSize=512 autowarmCount=0/ On Wed, Sep 23, 2009 at 8:10 AM, satya tosatyaj...@gmail.com wrote: First of all , thanks a lot for the clarification.Is there any way to see, how this cache is working internally and what are the objects being stored and how much memory its consuming,so that we can get a clear picture in mind.And how to test the performance through cache. On Tue, Sep 22, 2009 at 11:19 PM, Fuad Efendi f...@efendi.ca wrote: 1)Then do you mean , if we delete a perticular doc ,then that is going to be deleted from cache also. When you delete document, and then COMMIT your changes, new caches will be warmed up (and prepopulated by some key-value pairs from old instances), etc: !-- documentCache caches Lucene Document objects (the stored fields for each document). Since Lucene internal document ids are transient, this cache will not be autowarmed. -- documentCache class=solr.LRUCache size=512 initialSize=512 autowarmCount=0/ - this one won't be 'prepopulated'. 2)In solr,is cache storing the entire document in memory or only the references to documents in memory. There are many different cache instances, DocumentCache should store ID, Document pairs, etc -- Lance Norskog goks...@gmail.com
Re: solr caching problem
Is there any way to analyze or see that which documents are getting cached by documentCache - documentCache class=solr.LRUCache size=512 initialSize=512 autowarmCount=0/ On Wed, Sep 23, 2009 at 8:10 AM, satya tosatyaj...@gmail.com wrote: First of all , thanks a lot for the clarification.Is there any way to see, how this cache is working internally and what are the objects being stored and how much memory its consuming,so that we can get a clear picture in mind.And how to test the performance through cache. On Tue, Sep 22, 2009 at 11:19 PM, Fuad Efendi f...@efendi.ca wrote: 1)Then do you mean , if we delete a perticular doc ,then that is going to be deleted from cache also. When you delete document, and then COMMIT your changes, new caches will be warmed up (and prepopulated by some key-value pairs from old instances), etc: !-- documentCache caches Lucene Document objects (the stored fields for each document). Since Lucene internal document ids are transient, this cache will not be autowarmed. -- documentCache class=solr.LRUCache size=512 initialSize=512 autowarmCount=0/ - this one won't be 'prepopulated'. 2)In solr,is cache storing the entire document in memory or only the references to documents in memory. There are many different cache instances, DocumentCache should store ID, Document pairs, etc
solr caching problem
I configured filter cache in solrconfig.xml as here under : filterCache class=solr.FastLRUCache size=16384 initialSize=4096 autowarmCount=4096/ useFilterForSortedQuerytrue/useFilterForSortedQuery as per http://wiki.apache.org/solr/SolrCaching#head-b6a7d51521d55fa0c89f2b576b2659f297f9 And executed a query as: http://localhost:8080/solr/select/?q=*:*fq=id:(172704http://localhost:8080/solr/select/?q=*:*fq=id:%28172704TO 2079813)sort=id asc But when i deleted the doc id:172704 and executed the query again , i didnt find the same doc(172704 ) in my result.
Re: solr caching problem
Solr's caches should be transparent - they should only speed up queries, not change the result of queries. -Yonik http://www.lucidimagination.com On Tue, Sep 22, 2009 at 9:45 AM, satyasundar jena tosatyaj...@gmail.com wrote: I configured filter cache in solrconfig.xml as here under : filterCache class=solr.FastLRUCache size=16384 initialSize=4096 autowarmCount=4096/ useFilterForSortedQuerytrue/useFilterForSortedQuery as per http://wiki.apache.org/solr/SolrCaching#head-b6a7d51521d55fa0c89f2b576b2659f297f9 And executed a query as: http://localhost:8080/solr/select/?q=*:*fq=id:(172704http://localhost:8080/solr/select/?q=*:*fq=id:%28172704TO 2079813)sort=id asc But when i deleted the doc id:172704 and executed the query again , i didnt find the same doc(172704 ) in my result.
Re: solr caching problem
1)Then do you mean , if we delete a perticular doc ,then that is going to be deleted from cache also. 2)In solr,is cache storing the entire document in memory or only the references to documents in memory. And how to test this caching after all. I ll be thankful upon getting an elaboration. On Tue, Sep 22, 2009 at 8:46 PM, Yonik Seeley yo...@lucidimagination.comwrote: Solr's caches should be transparent - they should only speed up queries, not change the result of queries. -Yonik http://www.lucidimagination.com On Tue, Sep 22, 2009 at 9:45 AM, satyasundar jena tosatyaj...@gmail.com wrote: I configured filter cache in solrconfig.xml as here under : filterCache class=solr.FastLRUCache size=16384 initialSize=4096 autowarmCount=4096/ useFilterForSortedQuerytrue/useFilterForSortedQuery as per http://wiki.apache.org/solr/SolrCaching#head-b6a7d51521d55fa0c89f2b576b2659f297f9 And executed a query as: http://localhost:8080/solr/select/?q=*:*fq=id:(172704http://localhost:8080/solr/select/?q=*:*fq=id:%28172704 http://localhost:8080/solr/select/?q=*:*fq=id:%28172704TO 2079813)sort=id asc But when i deleted the doc id:172704 and executed the query again , i didnt find the same doc(172704 ) in my result.
RE: solr caching problem
1)Then do you mean , if we delete a perticular doc ,then that is going to be deleted from cache also. When you delete document, and then COMMIT your changes, new caches will be warmed up (and prepopulated by some key-value pairs from old instances), etc: !-- documentCache caches Lucene Document objects (the stored fields for each document). Since Lucene internal document ids are transient, this cache will not be autowarmed. -- documentCache class=solr.LRUCache size=512 initialSize=512 autowarmCount=0/ - this one won't be 'prepopulated'. 2)In solr,is cache storing the entire document in memory or only the references to documents in memory. There are many different cache instances, DocumentCache should store ID, Document pairs, etc
Re: solr caching problem
First of all , thanks a lot for the clarification.Is there any way to see, how this cache is working internally and what are the objects being stored and how much memory its consuming,so that we can get a clear picture in mind.And how to test the performance through cache. On Tue, Sep 22, 2009 at 11:19 PM, Fuad Efendi f...@efendi.ca wrote: 1)Then do you mean , if we delete a perticular doc ,then that is going to be deleted from cache also. When you delete document, and then COMMIT your changes, new caches will be warmed up (and prepopulated by some key-value pairs from old instances), etc: !-- documentCache caches Lucene Document objects (the stored fields for each document). Since Lucene internal document ids are transient, this cache will not be autowarmed. -- documentCache class=solr.LRUCache size=512 initialSize=512 autowarmCount=0/ - this one won't be 'prepopulated'. 2)In solr,is cache storing the entire document in memory or only the references to documents in memory. There are many different cache instances, DocumentCache should store ID, Document pairs, etc
Contributions Needed: Faceting Performance, SOLR Caching
Users Developers Possible Contributors, Hi, Recently I did some code hacks and I am using frequency calcs for TermVector instead of default out-of-the-box DocSet Intersections. It improves performance hundreds of times at shopping engine http://www.tokenizer.org - please check http://issues.apache.org/jira/browse/SOLR-711 - I feel the term faceting (and related architectural decision made for CNET several years ago) is completely wrong. Default SOLR response times: 30-180 seconds; with TermVector: 0.2 seconds (25 millions documents, tokenized field). For non-tokenized field: it also looks natural to use frequency calcs, but I have not done it yet. Sorry... too busy with Liferay Portal contract assignments, http://www.linkedin.com/in/liferay Another possible performance improvements: create safe concurrent cache for SOLR, you may check LingPipe, and also http://issues.apache.org/jira/browse/SOLR-665 and http://issues.apache.org/jira/browse/SOLR-667. Lucene developers are doing greate job to remove synchronization in several places too, such as isDeleted() method call... would be nice to have unsynchronized API version for read-only indexes. Thanks! -- View this message in context: http://www.nabble.com/Contributions-Needed%3A-Faceting-Performance%2C-SOLR-Caching-tp20058987p20058987.html Sent from the Solr - User mailing list archive at Nabble.com.