Re: solr multicore vs sharding vs 1 big collection
On 8/4/2015 3:30 PM, Jay Potharaju wrote: > For the last few days I have been trying to correlate the timeouts with GC. > I noticed in the GC logs that full GC takes long time once in a while. Does > this mean that the jvm memory is to high or is it set to low? > 1973953.560: [GC 4474277K->3300411K(4641280K), 0.0423129 secs] > 1973960.674: [GC 4536894K->3371225K(4630016K), 0.0560341 secs] > 1973960.731: [Full GC 3371225K->3339436K(5086208K), 15.5285889 secs] > 1973990.516: [GC 4548268K->3405111K(5096448K), 0.0657788 secs] > 1973998.191: [GC 4613934K->3527257K(5086208K), 0.1304232 secs] Based on what I can see there, it looks like 6GB might be enough heap. Your low points are all in the 3GB range, which is only half of that. A 6GB heap is not very big in the Solr world. Based on that GC log and my own experiences, I'm guessing that your GC isn't tuned. The default collector that Java chooses is *terrible* for Solr. Even just switching collectors to CMS or G1 will not improve the situation. Solr requires extensive GC tuning for good performance. The SolrPerformanceProblems wiki page that I pointed you to previously contains a little bit of info on GC tuning, and it also links to the following page, which is my personal page on the wiki, and documents some of my garbage collection journey with Solr: https://wiki.apache.org/solr/ShawnHeisey Thanks, Shawn
Re: solr multicore vs sharding vs 1 big collection
For the last few days I have been trying to correlate the timeouts with GC. I noticed in the GC logs that full GC takes long time once in a while. Does this mean that the jvm memory is to high or is it set to low? [GC 4730643K->3552794K(4890112K), 0.0433146 secs] 1973853.751: [Full GC 3552794K->2926402K(4635136K), 0.3123954 secs] 1973864.170: [GC 4127554K->2972129K(4644864K), 0.0418248 secs] 1973873.341: [GC 4185569K->2990123K(4640256K), 0.0451723 secs] 1973882.452: [GC 4201770K->2999178K(4645888K), 0.0611839 secs] 1973890.684: [GC 4220298K->3010751K(4646400K), 0.0302890 secs] 1973900.539: [GC 4229514K->3015049K(4646912K), 0.0470857 secs] 1973911.179: [GC 4237193K->3040837K(4646912K), 0.0373900 secs] 1973920.822: [GC 4262981K->3072045K(4655104K), 0.0450480 secs] 1973927.136: [GC 4307501K->3129835K(4635648K), 0.0392559 secs] 1973933.057: [GC 4363058K->3178923K(4647936K), 0.0426612 secs] 1973940.981: [GC 4405163K->3210677K(4648960K), 0.0557622 secs] 1973946.680: [GC 4436917K->3239408K(4656128K), 0.0430889 secs] 1973953.560: [GC 4474277K->3300411K(4641280K), 0.0423129 secs] 1973960.674: [GC 4536894K->3371225K(4630016K), 0.0560341 secs] 1973960.731: [Full GC 3371225K->3339436K(5086208K), 15.5285889 secs] 1973990.516: [GC 4548268K->3405111K(5096448K), 0.0657788 secs] 1973998.191: [GC 4613934K->3527257K(5086208K), 0.1304232 secs] 1974006.505: [GC 4723801K->3597899K(5132800K), 0.0899599 secs] 1974014.748: [GC 4793955K->3654280K(5163008K), 0.0989430 secs] 1974025.349: [GC 4880823K->3672457K(5182464K), 0.0683296 secs] 1974037.517: [GC 4899721K->3681560K(5234688K), 0.1028356 secs] 1974050.066: [GC 4938520K->3718901K(5256192K), 0.0796073 secs] 1974061.466: [GC 4974356K->3726357K(5308928K), 0.1324846 secs] 1974071.726: [GC 5003687K->3757516K(5336064K), 0.0734227 secs] 1974081.917: [GC 5036492K->3777662K(5387264K), 0.1475958 secs] 1974091.853: [GC 5074558K->3800799K(5421056K), 0.0799311 secs] 1974101.882: [GC 5097363K->3846378K(5434880K), 0.3011178 secs] 1974109.234: [GC 5121936K->3930457K(5478912K), 0.0956342 secs] 1974116.082: [GC 5206361K->3974011K(5215744K), 0.1967284 secs] Thanks Jay On Mon, Aug 3, 2015 at 1:53 PM, Bill Bell wrote: > Yeah a separate by month or year is good and can really help in this case. > > Bill Bell > Sent from mobile > > > > On Aug 2, 2015, at 5:29 PM, Jay Potharaju wrote: > > > > Shawn, > > Thanks for the feedback. I agree that increasing timeout might alleviate > > the timeout issue. The main problem with increasing timeout is the > > detrimental effect it will have on the user experience, therefore can't > > increase it. > > I have looked at the queries that threw errors, next time I try it > > everything seems to work fine. Not sure how to reproduce the error. > > My concern with increasing the memory to 32GB is what happens when the > > index size grows over the next few months. > > One of the other solutions I have been thinking about is to rebuild > > index(weekly) and create a new collection and use it. Are there any good > > references for doing that? > > Thanks > > Jay > > > >> On Sun, Aug 2, 2015 at 10:19 AM, Shawn Heisey > wrote: > >> > >>> On 8/2/2015 8:29 AM, Jay Potharaju wrote: > >>> The document contains around 30 fields and have stored set to true for > >>> almost 15 of them. And these stored fields are queried and updated all > >> the > >>> time. You will notice that the deleted documents is almost 30% of the > >>> docs. And it has stayed around that percent and has not come down. > >>> I did try optimize but that was disruptive as it caused search errors. > >>> I have been playing with merge factor to see if that helps with deleted > >>> documents or not. It is currently set to 5. > >>> > >>> The server has 24 GB of memory out of which memory consumption is > around > >> 23 > >>> GB normally and the jvm is set to 6 GB. And have noticed that the > >> available > >>> memory on the server goes to 100 MB at times during a day. > >>> All the updates are run through DIH. > >> > >> Using all availble memory is completely normal operation for ANY > >> operating system. If you hold up Windows as an example of one that > >> doesn't ... it lies to you about "available" memory. All modern > >> operating systems will utilize memory that is not explicitly allocated > >> for the OS disk cache. > >> > >> The disk cache will instantly give up any of the memory it is using for > >> programs that request it. Linux doesn't try to hide the disk cache from > >> you, but older versions of Windows do. In the newer versions of Windows > >> that have the Resource Monitor, you can go there to see the actual > >> memory usage including the cache. > >> > >>> Every day at least once i see the following error, which result in > search > >>> errors on the front end of the site. > >>> > >>> ERROR org.apache.solr.servlet.SolrDispatchFilter - > >>> null:org.eclipse.jetty.io.EofException > >>> > >>> From what I have read these are mainly due to timeout and my timeout is > >> set > >>> to 30 se
Re: solr multicore vs sharding vs 1 big collection
Yeah a separate by month or year is good and can really help in this case. Bill Bell Sent from mobile > On Aug 2, 2015, at 5:29 PM, Jay Potharaju wrote: > > Shawn, > Thanks for the feedback. I agree that increasing timeout might alleviate > the timeout issue. The main problem with increasing timeout is the > detrimental effect it will have on the user experience, therefore can't > increase it. > I have looked at the queries that threw errors, next time I try it > everything seems to work fine. Not sure how to reproduce the error. > My concern with increasing the memory to 32GB is what happens when the > index size grows over the next few months. > One of the other solutions I have been thinking about is to rebuild > index(weekly) and create a new collection and use it. Are there any good > references for doing that? > Thanks > Jay > >> On Sun, Aug 2, 2015 at 10:19 AM, Shawn Heisey wrote: >> >>> On 8/2/2015 8:29 AM, Jay Potharaju wrote: >>> The document contains around 30 fields and have stored set to true for >>> almost 15 of them. And these stored fields are queried and updated all >> the >>> time. You will notice that the deleted documents is almost 30% of the >>> docs. And it has stayed around that percent and has not come down. >>> I did try optimize but that was disruptive as it caused search errors. >>> I have been playing with merge factor to see if that helps with deleted >>> documents or not. It is currently set to 5. >>> >>> The server has 24 GB of memory out of which memory consumption is around >> 23 >>> GB normally and the jvm is set to 6 GB. And have noticed that the >> available >>> memory on the server goes to 100 MB at times during a day. >>> All the updates are run through DIH. >> >> Using all availble memory is completely normal operation for ANY >> operating system. If you hold up Windows as an example of one that >> doesn't ... it lies to you about "available" memory. All modern >> operating systems will utilize memory that is not explicitly allocated >> for the OS disk cache. >> >> The disk cache will instantly give up any of the memory it is using for >> programs that request it. Linux doesn't try to hide the disk cache from >> you, but older versions of Windows do. In the newer versions of Windows >> that have the Resource Monitor, you can go there to see the actual >> memory usage including the cache. >> >>> Every day at least once i see the following error, which result in search >>> errors on the front end of the site. >>> >>> ERROR org.apache.solr.servlet.SolrDispatchFilter - >>> null:org.eclipse.jetty.io.EofException >>> >>> From what I have read these are mainly due to timeout and my timeout is >> set >>> to 30 seconds and cant set it to a higher number. I was thinking maybe >> due >>> to high memory usage, sometimes it leads to bad performance/errors. >> >> Although this error can be caused by timeouts, it has a specific >> meaning. It means that the client disconnected before Solr responded to >> the request, so when Solr tried to respond (through jetty), it found a >> closed TCP connection. >> >> Client timeouts need to either be completely removed, or set to a value >> much longer than any request will take. Five minutes is a good starting >> value. >> >> If all your client timeout is set to 30 seconds and you are seeing >> EofExceptions, that means that your requests are taking longer than 30 >> seconds, and you likely have some performance issues. It's also >> possible that some of your client timeouts are set a lot shorter than 30 >> seconds. >> >>> My objective is to stop the errors, adding more memory to the server is >> not >>> a good scaling strategy. That is why i was thinking maybe there is a >> issue >>> with the way things are set up and need to be revisited. >> >> You're right that adding more memory to the servers is not a good >> scaling strategy for the general case ... but in this situation, I think >> it might be prudent. For your index and heap sizes, I would want the >> company to pay for at least 32GB of RAM. >> >> Having said that ... I've seen Solr installs work well with a LOT less >> memory than the ideal. I don't know that adding more memory is >> necessary, unless your system (CPU, storage, and memory speeds) is >> particularly slow. Based on your document count and index size, your >> documents are quite small, so I think your memory size is probably good >> -- if the CPU, memory bus, and storage are very fast. If one or more of >> those subsystems aren't fast, then make up the difference with lots of >> memory. >> >> Some light reading, where you will learn why I think 32GB is an ideal >> memory size for your system: >> >> https://wiki.apache.org/solr/SolrPerformanceProblems >> >> It is possible that your 6GB heap is not quite big enough for good >> performance, or that your GC is not well-tuned. These topics are also >> discussed on that wiki page. If you increase your heap size, then the >> likelihood of ne
Re: solr multicore vs sharding vs 1 big collection
There are two things that are likely to cause the timeouts you are seeing, I'd say. Firstly, your server is overloaded - that can be handled by adding additional replicas. However, it doesn't seem like this is the case, because the second query works fine. Secondly, you are hitting garbage collection issues. This seems more likely to me. You have 40m documents inside a 6Gb heap. That seems relatively tight to me. What that means is that Java may well not have enough space to create all the objects it needs inside a single commit cycle, forcing a garbage collection which can cause application pauses, which would fit with what you are seeing. I'd suggest using the jstat -gcutil command (I think I have that right) to watch the number of garbage collections taking place. You will quickly see from that if garbage collection is your issue. The simplistic remedy would be to allow your JVM a bit more memory. The other concern I have is that Solr (and Lucene) is intended for high read/low write scenarios. Its index structure is highly tuned for this scenario. If you are doing a lot of writes, then you will be creating a lot of index churn which will require more frequent merges, consuming both CPU and memory in the process. It may be worth looking at *how* you use Solr, and see whether, for example, you can separate your documents into slow moving, and fast moving parts, to better suit the Lucene index structures. Or to consider whether a Lucene based system is best for what you are attempting to achieve. For garbage collection, see here for a good Solr related write-up: http://lucidworks.com/blog/garbage-collection-bootcamp-1-0/ Upayavira On Mon, Aug 3, 2015, at 12:29 AM, Jay Potharaju wrote: > Shawn, > Thanks for the feedback. I agree that increasing timeout might alleviate > the timeout issue. The main problem with increasing timeout is the > detrimental effect it will have on the user experience, therefore can't > increase it. > I have looked at the queries that threw errors, next time I try it > everything seems to work fine. Not sure how to reproduce the error. > My concern with increasing the memory to 32GB is what happens when the > index size grows over the next few months. > One of the other solutions I have been thinking about is to rebuild > index(weekly) and create a new collection and use it. Are there any good > references for doing that? > Thanks > Jay > > On Sun, Aug 2, 2015 at 10:19 AM, Shawn Heisey > wrote: > > > On 8/2/2015 8:29 AM, Jay Potharaju wrote: > > > The document contains around 30 fields and have stored set to true for > > > almost 15 of them. And these stored fields are queried and updated all > > the > > > time. You will notice that the deleted documents is almost 30% of the > > > docs. And it has stayed around that percent and has not come down. > > > I did try optimize but that was disruptive as it caused search errors. > > > I have been playing with merge factor to see if that helps with deleted > > > documents or not. It is currently set to 5. > > > > > > The server has 24 GB of memory out of which memory consumption is around > > 23 > > > GB normally and the jvm is set to 6 GB. And have noticed that the > > available > > > memory on the server goes to 100 MB at times during a day. > > > All the updates are run through DIH. > > > > Using all availble memory is completely normal operation for ANY > > operating system. If you hold up Windows as an example of one that > > doesn't ... it lies to you about "available" memory. All modern > > operating systems will utilize memory that is not explicitly allocated > > for the OS disk cache. > > > > The disk cache will instantly give up any of the memory it is using for > > programs that request it. Linux doesn't try to hide the disk cache from > > you, but older versions of Windows do. In the newer versions of Windows > > that have the Resource Monitor, you can go there to see the actual > > memory usage including the cache. > > > > > Every day at least once i see the following error, which result in search > > > errors on the front end of the site. > > > > > > ERROR org.apache.solr.servlet.SolrDispatchFilter - > > > null:org.eclipse.jetty.io.EofException > > > > > > From what I have read these are mainly due to timeout and my timeout is > > set > > > to 30 seconds and cant set it to a higher number. I was thinking maybe > > due > > > to high memory usage, sometimes it leads to bad performance/errors. > > > > Although this error can be caused by timeouts, it has a specific > > meaning. It means that the client disconnected before Solr responded to > > the request, so when Solr tried to respond (through jetty), it found a > > closed TCP connection. > > > > Client timeouts need to either be completely removed, or set to a value > > much longer than any request will take. Five minutes is a good starting > > value. > > > > If all your client timeout is set to 30 seconds and you are seeing > > EofExceptions, that means tha
Re: solr multicore vs sharding vs 1 big collection
Shawn, Thanks for the feedback. I agree that increasing timeout might alleviate the timeout issue. The main problem with increasing timeout is the detrimental effect it will have on the user experience, therefore can't increase it. I have looked at the queries that threw errors, next time I try it everything seems to work fine. Not sure how to reproduce the error. My concern with increasing the memory to 32GB is what happens when the index size grows over the next few months. One of the other solutions I have been thinking about is to rebuild index(weekly) and create a new collection and use it. Are there any good references for doing that? Thanks Jay On Sun, Aug 2, 2015 at 10:19 AM, Shawn Heisey wrote: > On 8/2/2015 8:29 AM, Jay Potharaju wrote: > > The document contains around 30 fields and have stored set to true for > > almost 15 of them. And these stored fields are queried and updated all > the > > time. You will notice that the deleted documents is almost 30% of the > > docs. And it has stayed around that percent and has not come down. > > I did try optimize but that was disruptive as it caused search errors. > > I have been playing with merge factor to see if that helps with deleted > > documents or not. It is currently set to 5. > > > > The server has 24 GB of memory out of which memory consumption is around > 23 > > GB normally and the jvm is set to 6 GB. And have noticed that the > available > > memory on the server goes to 100 MB at times during a day. > > All the updates are run through DIH. > > Using all availble memory is completely normal operation for ANY > operating system. If you hold up Windows as an example of one that > doesn't ... it lies to you about "available" memory. All modern > operating systems will utilize memory that is not explicitly allocated > for the OS disk cache. > > The disk cache will instantly give up any of the memory it is using for > programs that request it. Linux doesn't try to hide the disk cache from > you, but older versions of Windows do. In the newer versions of Windows > that have the Resource Monitor, you can go there to see the actual > memory usage including the cache. > > > Every day at least once i see the following error, which result in search > > errors on the front end of the site. > > > > ERROR org.apache.solr.servlet.SolrDispatchFilter - > > null:org.eclipse.jetty.io.EofException > > > > From what I have read these are mainly due to timeout and my timeout is > set > > to 30 seconds and cant set it to a higher number. I was thinking maybe > due > > to high memory usage, sometimes it leads to bad performance/errors. > > Although this error can be caused by timeouts, it has a specific > meaning. It means that the client disconnected before Solr responded to > the request, so when Solr tried to respond (through jetty), it found a > closed TCP connection. > > Client timeouts need to either be completely removed, or set to a value > much longer than any request will take. Five minutes is a good starting > value. > > If all your client timeout is set to 30 seconds and you are seeing > EofExceptions, that means that your requests are taking longer than 30 > seconds, and you likely have some performance issues. It's also > possible that some of your client timeouts are set a lot shorter than 30 > seconds. > > > My objective is to stop the errors, adding more memory to the server is > not > > a good scaling strategy. That is why i was thinking maybe there is a > issue > > with the way things are set up and need to be revisited. > > You're right that adding more memory to the servers is not a good > scaling strategy for the general case ... but in this situation, I think > it might be prudent. For your index and heap sizes, I would want the > company to pay for at least 32GB of RAM. > > Having said that ... I've seen Solr installs work well with a LOT less > memory than the ideal. I don't know that adding more memory is > necessary, unless your system (CPU, storage, and memory speeds) is > particularly slow. Based on your document count and index size, your > documents are quite small, so I think your memory size is probably good > -- if the CPU, memory bus, and storage are very fast. If one or more of > those subsystems aren't fast, then make up the difference with lots of > memory. > > Some light reading, where you will learn why I think 32GB is an ideal > memory size for your system: > > https://wiki.apache.org/solr/SolrPerformanceProblems > > It is possible that your 6GB heap is not quite big enough for good > performance, or that your GC is not well-tuned. These topics are also > discussed on that wiki page. If you increase your heap size, then the > likelihood of needing more memory in the system becomes greater, because > there will be less memory available for the disk cache. > > Thanks, > Shawn > > -- Thanks Jay Potharaju
Re: solr multicore vs sharding vs 1 big collection
On 8/2/2015 8:29 AM, Jay Potharaju wrote: > The document contains around 30 fields and have stored set to true for > almost 15 of them. And these stored fields are queried and updated all the > time. You will notice that the deleted documents is almost 30% of the > docs. And it has stayed around that percent and has not come down. > I did try optimize but that was disruptive as it caused search errors. > I have been playing with merge factor to see if that helps with deleted > documents or not. It is currently set to 5. > > The server has 24 GB of memory out of which memory consumption is around 23 > GB normally and the jvm is set to 6 GB. And have noticed that the available > memory on the server goes to 100 MB at times during a day. > All the updates are run through DIH. Using all availble memory is completely normal operation for ANY operating system. If you hold up Windows as an example of one that doesn't ... it lies to you about "available" memory. All modern operating systems will utilize memory that is not explicitly allocated for the OS disk cache. The disk cache will instantly give up any of the memory it is using for programs that request it. Linux doesn't try to hide the disk cache from you, but older versions of Windows do. In the newer versions of Windows that have the Resource Monitor, you can go there to see the actual memory usage including the cache. > Every day at least once i see the following error, which result in search > errors on the front end of the site. > > ERROR org.apache.solr.servlet.SolrDispatchFilter - > null:org.eclipse.jetty.io.EofException > > From what I have read these are mainly due to timeout and my timeout is set > to 30 seconds and cant set it to a higher number. I was thinking maybe due > to high memory usage, sometimes it leads to bad performance/errors. Although this error can be caused by timeouts, it has a specific meaning. It means that the client disconnected before Solr responded to the request, so when Solr tried to respond (through jetty), it found a closed TCP connection. Client timeouts need to either be completely removed, or set to a value much longer than any request will take. Five minutes is a good starting value. If all your client timeout is set to 30 seconds and you are seeing EofExceptions, that means that your requests are taking longer than 30 seconds, and you likely have some performance issues. It's also possible that some of your client timeouts are set a lot shorter than 30 seconds. > My objective is to stop the errors, adding more memory to the server is not > a good scaling strategy. That is why i was thinking maybe there is a issue > with the way things are set up and need to be revisited. You're right that adding more memory to the servers is not a good scaling strategy for the general case ... but in this situation, I think it might be prudent. For your index and heap sizes, I would want the company to pay for at least 32GB of RAM. Having said that ... I've seen Solr installs work well with a LOT less memory than the ideal. I don't know that adding more memory is necessary, unless your system (CPU, storage, and memory speeds) is particularly slow. Based on your document count and index size, your documents are quite small, so I think your memory size is probably good -- if the CPU, memory bus, and storage are very fast. If one or more of those subsystems aren't fast, then make up the difference with lots of memory. Some light reading, where you will learn why I think 32GB is an ideal memory size for your system: https://wiki.apache.org/solr/SolrPerformanceProblems It is possible that your 6GB heap is not quite big enough for good performance, or that your GC is not well-tuned. These topics are also discussed on that wiki page. If you increase your heap size, then the likelihood of needing more memory in the system becomes greater, because there will be less memory available for the disk cache. Thanks, Shawn
Re: solr multicore vs sharding vs 1 big collection
The document contains around 30 fields and have stored set to true for almost 15 of them. And these stored fields are queried and updated all the time. You will notice that the deleted documents is almost 30% of the docs. And it has stayed around that percent and has not come down. I did try optimize but that was disruptive as it caused search errors. I have been playing with merge factor to see if that helps with deleted documents or not. It is currently set to 5. The server has 24 GB of memory out of which memory consumption is around 23 GB normally and the jvm is set to 6 GB. And have noticed that the available memory on the server goes to 100 MB at times during a day. All the updates are run through DIH. Every day at least once i see the following error, which result in search errors on the front end of the site. ERROR org.apache.solr.servlet.SolrDispatchFilter - null:org.eclipse.jetty.io.EofException >From what I have read these are mainly due to timeout and my timeout is set to 30 seconds and cant set it to a higher number. I was thinking maybe due to high memory usage, sometimes it leads to bad performance/errors. My objective is to stop the errors, adding more memory to the server is not a good scaling strategy. That is why i was thinking maybe there is a issue with the way things are set up and need to be revisited. Thanks On Sat, Aug 1, 2015 at 7:06 PM, Shawn Heisey wrote: > On 8/1/2015 6:49 PM, Jay Potharaju wrote: > > I currently have a single collection with 40 million documents and index > > size of 25 GB. The collections gets updated every n minutes and as a > result > > the number of deleted documents is constantly growing. The data in the > > collection is an amalgamation of more than 1000+ customer records. The > > number of documents per each customer is around 100,000 records on > average. > > > > Now that being said, I 'm trying to get an handle on the growing deleted > > document size. Because of the growing index size both the disk space and > > memory is being used up. And would like to reduce it to a manageable > size. > > > > I have been thinking of splitting the data into multiple core, 1 for each > > customer. This would allow me manage the smaller collection easily and > can > > create/update the collection also fast. My concern is that number of > > collections might become an issue. Any suggestions on how to address this > > problem. What are my other alternatives to moving to a multicore > > collections.? > > > > Solr: 4.9 > > Index size:25 GB > > Max doc: 40 million > > Doc count:29 million > > > > Replication:4 > > > > 4 servers in solrcloud. > > Creating 1000+ collections in SolrCloud is definitely problematic. If > you need to choose between a lot of shards and a lot of collections, I > would definitely go with a lot of shards. I would also want a lot of > servers for an index with that many pieces. > > https://issues.apache.org/jira/browse/SOLR-7191 > > I don't think it would matter how many collections or shards you have > when it comes to how many deleted documents are in your index. If you > want to clean up a large number of deletes in an index, the best option > is an optimize. An optimize requires a large amount of disk I/O, so it > can be extremely disruptive if the query volume is high. It should be > done when the query volume is at its lowest. For the index you > describe, a nightly or weekly optimize seems like a good option. > > Aside from having a lot of deleted documents in your index, what kind of > problems are you trying to solve? > > Thanks, > Shawn > > -- Thanks Jay Potharaju
Re: solr multicore vs sharding vs 1 big collection
On 8/1/2015 6:49 PM, Jay Potharaju wrote: > I currently have a single collection with 40 million documents and index > size of 25 GB. The collections gets updated every n minutes and as a result > the number of deleted documents is constantly growing. The data in the > collection is an amalgamation of more than 1000+ customer records. The > number of documents per each customer is around 100,000 records on average. > > Now that being said, I 'm trying to get an handle on the growing deleted > document size. Because of the growing index size both the disk space and > memory is being used up. And would like to reduce it to a manageable size. > > I have been thinking of splitting the data into multiple core, 1 for each > customer. This would allow me manage the smaller collection easily and can > create/update the collection also fast. My concern is that number of > collections might become an issue. Any suggestions on how to address this > problem. What are my other alternatives to moving to a multicore > collections.? > > Solr: 4.9 > Index size:25 GB > Max doc: 40 million > Doc count:29 million > > Replication:4 > > 4 servers in solrcloud. Creating 1000+ collections in SolrCloud is definitely problematic. If you need to choose between a lot of shards and a lot of collections, I would definitely go with a lot of shards. I would also want a lot of servers for an index with that many pieces. https://issues.apache.org/jira/browse/SOLR-7191 I don't think it would matter how many collections or shards you have when it comes to how many deleted documents are in your index. If you want to clean up a large number of deletes in an index, the best option is an optimize. An optimize requires a large amount of disk I/O, so it can be extremely disruptive if the query volume is high. It should be done when the query volume is at its lowest. For the index you describe, a nightly or weekly optimize seems like a good option. Aside from having a lot of deleted documents in your index, what kind of problems are you trying to solve? Thanks, Shawn
Re: solr multicore vs sharding vs 1 big collection
40 million docs isn't really very many by modern standards, although if they're huge documents then that might be an issue. So is this a single shard or multiple shards? If you're really facing performance issues, simply making a new collection with more than one shard (independent of how many replicas each has) is probably simplest. The number of deleted documents really shouldn't be a problem. Typically the deleted documents are purged during segment merging that happens automatically as you add documents. I often see 10-15% or the corpus consist of deleted documents. You can force these by doing a force merge (aka optimization), but that is usually not recommended unless you have a strange situation where you have lots and lots of docs that have been deleted as measured by the Admin UI page, the "deleted docs" entry relative to the maxDoc number (again on the admin UI page). So show us what you're seeing that's concerning. Typically, especially on an index that's continually getting updates it's adequate to just let the background segment merging take care of things. Best, Erick On Sat, Aug 1, 2015 at 8:49 PM, Jay Potharaju wrote: > Hi > > I currently have a single collection with 40 million documents and index > size of 25 GB. The collections gets updated every n minutes and as a result > the number of deleted documents is constantly growing. The data in the > collection is an amalgamation of more than 1000+ customer records. The > number of documents per each customer is around 100,000 records on average. > > Now that being said, I 'm trying to get an handle on the growing deleted > document size. Because of the growing index size both the disk space and > memory is being used up. And would like to reduce it to a manageable size. > > I have been thinking of splitting the data into multiple core, 1 for each > customer. This would allow me manage the smaller collection easily and can > create/update the collection also fast. My concern is that number of > collections might become an issue. Any suggestions on how to address this > problem. What are my other alternatives to moving to a multicore > collections.? > > Solr: 4.9 > Index size:25 GB > Max doc: 40 million > Doc count:29 million > > Replication:4 > > 4 servers in solrcloud. > > Thanks > Jay
solr multicore vs sharding vs 1 big collection
Hi I currently have a single collection with 40 million documents and index size of 25 GB. The collections gets updated every n minutes and as a result the number of deleted documents is constantly growing. The data in the collection is an amalgamation of more than 1000+ customer records. The number of documents per each customer is around 100,000 records on average. Now that being said, I 'm trying to get an handle on the growing deleted document size. Because of the growing index size both the disk space and memory is being used up. And would like to reduce it to a manageable size. I have been thinking of splitting the data into multiple core, 1 for each customer. This would allow me manage the smaller collection easily and can create/update the collection also fast. My concern is that number of collections might become an issue. Any suggestions on how to address this problem. What are my other alternatives to moving to a multicore collections.? Solr: 4.9 Index size:25 GB Max doc: 40 million Doc count:29 million Replication:4 4 servers in solrcloud. Thanks Jay