RE: huge shards (300GB each) and load balancing
Hi Dimitry, >>The parameters you have menioned -- termInfosIndexDivisor and >>termIndexInterval -- are not found in the solr 1.4.1 config|schema. Are you >>using SOLR 3.1? I'm pretty sure that the termIndexInterval (ratio of tii file to tis file) is in the 1.4.1 example solrconfig.xml file, although I don't have a copy to check at the moment. We are using a 3.1 dev version. As far as the termInfosIndexDivisor, I I'm also pretty sure it works with 1.4.1, but you might have to ask the list to be sure. As you can see from the blog posts those settings really reduced our memory requirements.We haven't been doing faceting so we expect memory use to go up again once we add faceting, but at least we are starting at a 4GB baseline instead of a 20-32GB baseline. >>Did you you do logical sharding or document hash based? On the indexing side we just assign documents to a particular shard on a round robin basis and use a database to keep track of which document is in which shard so if we need to update it we update the right shard (See the "Forty days" article on the blog for a more detailed description and some diagrams) . We hope that this distributes the documents evenly enough to avoid problems with Solr's lack of global idf. >>Do you have load balancer between the front SOLR (or front entity) and shards, As far as load balancing which shard is the head shard/front shard, again, our app layer just randomly picks one of the shards to be the head shard. We originally were going to do tests to determine if it was better to have one dedicated machine configured to be the head shard, but never got around to that. We have a very low query request rate, so haven't had to seriously look at load balancing >>do you do merging? I'm not sure what you mean by "do you do merging" . We are just using the default Solr distributed search. In theory our documents should be randomly distributed among the shards so the lack of global idf should not hurt the merging process. Andrzej Bialecki gave a recent presentation on Solr distributed search that talks about less than optimal results merging and some ideas for dealing with it: http://berlinbuzzwords.de/sites/berlinbuzzwords.de/files/AndrzejBialecki-Buzzwords-2011_0.pdf >>Each shard currently is allocated max 12GB memory. I'm curious about how much memory you leave to the OS for disk caching. Can you give any details about the number of shards per machine and the total memory on the machine. Tom Burton-West http://www.hathitrust.org/blogs/large-scale-search From: Dmitry Kan [dmitry....@gmail.com] Sent: Tuesday, June 14, 2011 2:15 PM To: solr-user@lucene.apache.org Subject: Re: huge shards (300GB each) and load balancing Hi Tom, Thanks a lot for sharing this. We have about half a terabyte total index size, and we have split our index over 10 shards (horizontal scaling, no replication). Each shard currently is allocated max 12GB memory. We use facet search a lot and non-facet search with parameter values generated by facet search (hence more focused search that hits small portion of solr documents). The parameters you have menioned -- termInfosIndexDivisor and termIndexInterval -- are not found in the solr 1.4.1 config|schema. Are you using SOLR 3.1? Did you you do logical sharding or document hash based? Do you have load balancer between the front SOLR (or front entity) and shards, do you do merging?
Re: huge shards (300GB each) and load balancing
Hi Tom, Thanks a lot for sharing this. We have about half a terabyte total index size, and we have split our index over 10 shards (horizontal scaling, no replication). Each shard currently is allocated max 12GB memory. We use facet search a lot and non-facet search with parameter values generated by facet search (hence more focused search that hits small portion of solr documents). The parameters you have menioned -- termInfosIndexDivisor and termIndexInterval -- are not found in the solr 1.4.1 config|schema. Are you using SOLR 3.1? Did you you do logical sharding or document hash based? Do you have load balancer between the front SOLR (or front entity) and shards, do you do merging? On Wed, Jun 8, 2011 at 10:23 PM, Burton-West, Tom wrote: > Hi Dmitry, > > I am assuming you are splitting one very large index over multiple shards > rather than replicating and index multiple times. > > Just for a point of comparison, I thought I would describe our experience > with large shards. At HathiTrust, we run a 6 terabyte index over 12 shards. > This is split over 4 machines with 3 shards per machine and our shards are > about 400-500GB. We get average response times of around 200 ms with the > 99th percentile queries up around 1-2 seconds. We have a very low qps rate, > i.e. less than 1 qps. We also index offline on a separate machine and > update the indexes nightly. > > Some of the issues we have found with very large shards are: > 1) Becaue of the very large shard size, I/O tends to be the bottleneck, > with phrase queries containing common words being the slowest. > 2) Because of the I/O issues running cache-warming queries to get postings > into the OS disk cache is important as is leaving significant free memory > for the OS to use for disk caching > 3) Because of the I/O issues using stop words or CommonGrams produces a > significant performance increase. > 2) We have a huge number of unique terms in our indexes. In order to > reduce the amount of memory needed by the in-memory terms index we set the > termInfosIndexDivisor to 8, which causes Solr to only load every 8th term > from the tii file into memory. This reduced memory use from over 18GB to > below 3G and got rid of 30 second stop the world java Garbage Collections. > (See > http://www.hathitrust.org/blogs/large-scale-search/too-many-words-againfor > details) We later ran into memory problems when indexing so instead > changed the index time parameter termIndexInterval from 128 to 1024. > > (More details here: http://www.hathitrust.org/blogs/large-scale-search) > > Tom Burton-West > > -- Regards, Dmitry Kan
RE: huge shards (300GB each) and load balancing
Hi Dmitry, I am assuming you are splitting one very large index over multiple shards rather than replicating and index multiple times. Just for a point of comparison, I thought I would describe our experience with large shards. At HathiTrust, we run a 6 terabyte index over 12 shards. This is split over 4 machines with 3 shards per machine and our shards are about 400-500GB. We get average response times of around 200 ms with the 99th percentile queries up around 1-2 seconds. We have a very low qps rate, i.e. less than 1 qps. We also index offline on a separate machine and update the indexes nightly. Some of the issues we have found with very large shards are: 1) Becaue of the very large shard size, I/O tends to be the bottleneck, with phrase queries containing common words being the slowest. 2) Because of the I/O issues running cache-warming queries to get postings into the OS disk cache is important as is leaving significant free memory for the OS to use for disk caching 3) Because of the I/O issues using stop words or CommonGrams produces a significant performance increase. 2) We have a huge number of unique terms in our indexes. In order to reduce the amount of memory needed by the in-memory terms index we set the termInfosIndexDivisor to 8, which causes Solr to only load every 8th term from the tii file into memory. This reduced memory use from over 18GB to below 3G and got rid of 30 second stop the world java Garbage Collections. (See http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again for details) We later ran into memory problems when indexing so instead changed the index time parameter termIndexInterval from 128 to 1024. (More details here: http://www.hathitrust.org/blogs/large-scale-search) Tom Burton-West
Re: huge shards (300GB each) and load balancing
Hi, Bill. Thanks, always nice to have options! Dmitry On Wed, Jun 8, 2011 at 4:47 PM, Bill Bell wrote: > Re Amazon elb. > > This is not exactly true. The ELB does load balancer internal IPs. But the > ELB IP address must be external. Still a major issue unless you use > authentication. Nginx and others can also do load balancing. > > Bill Bell > Sent from mobile > > > On Jun 8, 2011, at 3:32 AM, "Upayavira" wrote: > > > > > > > On Wed, 08 Jun 2011 10:42 +0300, "Dmitry Kan" > > wrote: > >> Hello list, > >> > >> Thanks for attending to my previous questions so far, have learnt a lot. > >> Here is another one, I hope it will be interesting to answer. > >> > >> > >> > >> We run our SOLR shards and front end SOLR on the Amazon high-end > >> machines. > >> Currently we have 6 shards with around 200GB in each. Currently we have > >> only > >> one front end SOLR which, given a client query, redirects it to all the > >> shards. Our shards are constantly growing, data is at times reindexed > (in > >> batches, which is done by removing a decent chunk before replacing it > >> with > >> updated data), constant stream of new data is coming every hour (usually > >> hits the latest shard in time, but can also hit other shards, which have > >> older data). Since the front end SOLR has started to be a SPOF, we are > >> thinking about setting up some sort of load balancer. > >> > >> 1) do you think ELB from Amazon is a good solution for starters? We > don't > >> need to maintain sessions between SOLR and client. > >> 2) What other load balancers have been used specifically with SOLR? > >> > >> > >> Overall: does SOLR scale to such size (200GB in an index) and what can > be > >> recommended as next step -- resharding (cutting existing shards to > >> smaller > >> chunks), replication? > > > > Really, it is going to be up to you to work out what works in your > > situation. You may be reaching the limit of what a Lucene index can > > handle, don't know. If your query traffic is low, you might find that > > two 100Gb cores in a single instance performs better. But then, maybe > > not! Or two 100Gb shards on smaller Amazon hosts. But then, maybe not! > > :-) > > > > The principal issue with Amazon's load balancers (at least when I was > > using them last year) is that the ports that they balance need to be > > public. You can't use an Amazon load balancer as an internal service > > within a security group. For a service such as Solr, that can be a bit > > of a killer. > > > > If they've fixed that issue, then they'd work fine (I used them quite > > happily in another scenario). > > > > When looking at resolving single points of failure, handling search is > > pretty easy (as you say, stateless load balancer). You will need to give > > more attention though to how you handle it regarding indexing. > > > > Hope that helps a bit! > > > > Upayavira > > > > > > > > > > > > --- > > Enterprise Search Consultant at Sourcesense UK, > > Making Sense of Open Source > > > -- Regards, Dmitry Kan
Re: huge shards (300GB each) and load balancing
Re Amazon elb. This is not exactly true. The ELB does load balancer internal IPs. But the ELB IP address must be external. Still a major issue unless you use authentication. Nginx and others can also do load balancing. Bill Bell Sent from mobile On Jun 8, 2011, at 3:32 AM, "Upayavira" wrote: > > > On Wed, 08 Jun 2011 10:42 +0300, "Dmitry Kan" > wrote: >> Hello list, >> >> Thanks for attending to my previous questions so far, have learnt a lot. >> Here is another one, I hope it will be interesting to answer. >> >> >> >> We run our SOLR shards and front end SOLR on the Amazon high-end >> machines. >> Currently we have 6 shards with around 200GB in each. Currently we have >> only >> one front end SOLR which, given a client query, redirects it to all the >> shards. Our shards are constantly growing, data is at times reindexed (in >> batches, which is done by removing a decent chunk before replacing it >> with >> updated data), constant stream of new data is coming every hour (usually >> hits the latest shard in time, but can also hit other shards, which have >> older data). Since the front end SOLR has started to be a SPOF, we are >> thinking about setting up some sort of load balancer. >> >> 1) do you think ELB from Amazon is a good solution for starters? We don't >> need to maintain sessions between SOLR and client. >> 2) What other load balancers have been used specifically with SOLR? >> >> >> Overall: does SOLR scale to such size (200GB in an index) and what can be >> recommended as next step -- resharding (cutting existing shards to >> smaller >> chunks), replication? > > Really, it is going to be up to you to work out what works in your > situation. You may be reaching the limit of what a Lucene index can > handle, don't know. If your query traffic is low, you might find that > two 100Gb cores in a single instance performs better. But then, maybe > not! Or two 100Gb shards on smaller Amazon hosts. But then, maybe not! > :-) > > The principal issue with Amazon's load balancers (at least when I was > using them last year) is that the ports that they balance need to be > public. You can't use an Amazon load balancer as an internal service > within a security group. For a service such as Solr, that can be a bit > of a killer. > > If they've fixed that issue, then they'd work fine (I used them quite > happily in another scenario). > > When looking at resolving single points of failure, handling search is > pretty easy (as you say, stateless load balancer). You will need to give > more attention though to how you handle it regarding indexing. > > Hope that helps a bit! > > Upayavira > > > > > > --- > Enterprise Search Consultant at Sourcesense UK, > Making Sense of Open Source >
Re: huge shards (300GB each) and load balancing
Hi Upayavira, Thanks for sharing insights and experience on this. As we have 6 shards at the moment, it is pretty hard (=almost impossible) to keep them on a single box, so that's why we decided to shard. On the other hand, we have never tried multicore architecture, so that's a good point, thanks. On the indexing side, we do it rather straightforward, that is, by updating the online shards. This should hopefully be improved with [offline update / http swap] system, as already now, updating online 200GB shards at times produces OOM, freezing and other issues. Does someone have other experience / pointers to load balancer software that was tried with SOLR? Dmitry On Wed, Jun 8, 2011 at 12:32 PM, Upayavira wrote: > > > On Wed, 08 Jun 2011 10:42 +0300, "Dmitry Kan" > wrote: > > Hello list, > > > > Thanks for attending to my previous questions so far, have learnt a lot. > > Here is another one, I hope it will be interesting to answer. > > > > > > > > We run our SOLR shards and front end SOLR on the Amazon high-end > > machines. > > Currently we have 6 shards with around 200GB in each. Currently we have > > only > > one front end SOLR which, given a client query, redirects it to all the > > shards. Our shards are constantly growing, data is at times reindexed (in > > batches, which is done by removing a decent chunk before replacing it > > with > > updated data), constant stream of new data is coming every hour (usually > > hits the latest shard in time, but can also hit other shards, which have > > older data). Since the front end SOLR has started to be a SPOF, we are > > thinking about setting up some sort of load balancer. > > > > 1) do you think ELB from Amazon is a good solution for starters? We don't > > need to maintain sessions between SOLR and client. > > 2) What other load balancers have been used specifically with SOLR? > > > > > > Overall: does SOLR scale to such size (200GB in an index) and what can be > > recommended as next step -- resharding (cutting existing shards to > > smaller > > chunks), replication? > > Really, it is going to be up to you to work out what works in your > situation. You may be reaching the limit of what a Lucene index can > handle, don't know. If your query traffic is low, you might find that > two 100Gb cores in a single instance performs better. But then, maybe > not! Or two 100Gb shards on smaller Amazon hosts. But then, maybe not! > :-) > > The principal issue with Amazon's load balancers (at least when I was > using them last year) is that the ports that they balance need to be > public. You can't use an Amazon load balancer as an internal service > within a security group. For a service such as Solr, that can be a bit > of a killer. > > If they've fixed that issue, then they'd work fine (I used them quite > happily in another scenario). > > When looking at resolving single points of failure, handling search is > pretty easy (as you say, stateless load balancer). You will need to give > more attention though to how you handle it regarding indexing. > > Hope that helps a bit! > > Upayavira > > > > > > --- > Enterprise Search Consultant at Sourcesense UK, > Making Sense of Open Source > >
Re: huge shards (300GB each) and load balancing
On Wed, 08 Jun 2011 10:42 +0300, "Dmitry Kan" wrote: > Hello list, > > Thanks for attending to my previous questions so far, have learnt a lot. > Here is another one, I hope it will be interesting to answer. > > > > We run our SOLR shards and front end SOLR on the Amazon high-end > machines. > Currently we have 6 shards with around 200GB in each. Currently we have > only > one front end SOLR which, given a client query, redirects it to all the > shards. Our shards are constantly growing, data is at times reindexed (in > batches, which is done by removing a decent chunk before replacing it > with > updated data), constant stream of new data is coming every hour (usually > hits the latest shard in time, but can also hit other shards, which have > older data). Since the front end SOLR has started to be a SPOF, we are > thinking about setting up some sort of load balancer. > > 1) do you think ELB from Amazon is a good solution for starters? We don't > need to maintain sessions between SOLR and client. > 2) What other load balancers have been used specifically with SOLR? > > > Overall: does SOLR scale to such size (200GB in an index) and what can be > recommended as next step -- resharding (cutting existing shards to > smaller > chunks), replication? Really, it is going to be up to you to work out what works in your situation. You may be reaching the limit of what a Lucene index can handle, don't know. If your query traffic is low, you might find that two 100Gb cores in a single instance performs better. But then, maybe not! Or two 100Gb shards on smaller Amazon hosts. But then, maybe not! :-) The principal issue with Amazon's load balancers (at least when I was using them last year) is that the ports that they balance need to be public. You can't use an Amazon load balancer as an internal service within a security group. For a service such as Solr, that can be a bit of a killer. If they've fixed that issue, then they'd work fine (I used them quite happily in another scenario). When looking at resolving single points of failure, handling search is pretty easy (as you say, stateless load balancer). You will need to give more attention though to how you handle it regarding indexing. Hope that helps a bit! Upayavira --- Enterprise Search Consultant at Sourcesense UK, Making Sense of Open Source