RE: Solr maximum Optimal Index Size per Shard
Toke Eskildsen [t...@statsbiblioteket.dk] wrote: [Toke: SSDs with 2.7TB of index on a 256GB machine] tl;dr: for small result sets ( 1M hits) on unwarmed searches with simple queries, response time is below 100ms. If we enable faceting with plain Solr, this jumps to about 1 second. I did a top on the machine and it says that 50GB is currently used for caching, so an 80GB (and probably less) machine would work fine for our 2.7TB index. So we actually tried this: 3.6TB (4 shards) and 80GB of RAM, leaving a little less than 40GB for caching: 40GB / 3,600GB ~= 1% of the index size. This performed quite well, with faceting times still around 1 second and non-faceted search a lot lower. There's a writeup at http://sbdevel.wordpress.com/2014/06/17/terabyte-index-search-and-faceting-with-solr/ - Toke Eskildsen, State and University Library, Denmark
Re: Solr maximum Optimal Index Size per Shard
Hi Shawn, Thanks for your response, wanted to clarify a few things. *Does that mean for querying smoothly we need to have memory atleast equal or greater to the size of index? As in my case the index size will be very heavy(~2TB) and practically speaking that amount of memory is not possible. Even If it goes to multiple shards, say around 10 Shards then also 200GB of RAM will not be an feasible option. *With CloudSolrServer can we specify which Shard the particular index should go and reside, which I can do with EmbeddedSolrServer by indexing in different directories and moving them to appropriate shard directories. Thanks! On Wed, Jun 4, 2014 at 12:43 PM, Shawn Heisey s...@elyograg.org wrote: On 6/4/2014 12:45 AM, Vineet Mishra wrote: Thanks all for your response. I presume this conversation concludes that indexing around 1Billion documents per shard won't be a problem, as I have 10 Billion docs to index, so approx 10 shards with 1 Billion each should be fine with it and how about Memory, what size of RAM should be fine for this amount of data? Figure out the heap requirements of the operating system and every program on the machine (Solr especially). Then you would add that number to the total size of the index data on the machine. That is the ideal minimum RAM. http://wiki.apache.org/solr/SolrPerformanceProblems Unfortunately, if you are dealing with a huge index with billions of documents, it is likely to be prohibitively expensive to buy that much RAM. If you are running Solr on Amazon's cloud, the cost for that much RAM would be astronomical. Exactly how much RAM would actually be required is very difficult to predict. If you had only 25% of the ideal, your index might have perfectly acceptable performance, or it might not. It might do fine under a light query load, but if you increase to 50 queries per second, performance may drop significantly ... or it might be good. It's generally not possible to know how your hardware will perform until you actually build and use your index. http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/ A general rule of thumb for RAM that I have found to be useful is that if you've got less than half of the ideal memory size, you might have performance problems. Moreover what should be the indexing technique for this huge data set, as currently I am indexing with EmbeddedSolrServer but its going pathetically slow after some 20Gb of indexing. Comparatively SolrHttpPost was slow due to network delays and response but after this long running the indexing with EmbeddedSolrServer I am getting a different notion. Any good indexing technique for this huge dataset would be highly appreciated. EmbeddedSolrServer is not recommended. Run Solr in the traditional way with HTTP connectivity. HTTP overhead on a LAN is usually quite small. Solr is fully thread-safe, so you can have several indexing threads all going at the same time. Indexes at this scale should normally be built with SolrCloud, with enough servers so that each machine is only handling one shard replica. The ideal indexing program would be written in Java, using CloudSolrServer. Thanks, Shawn
Re: Solr maximum Optimal Index Size per Shard
Hey Jack, Well I have indexed around some 10 Million documents consuming 20 GB index size. Each Document is consisting of nearly 100 String Fields with data upto 10 characters per field. For my case each document containing number of fields can expand much widely (from current 100 to 500 or ever more). As for the typical exceptional case I was more interested for a way to evenly maintain the right ratio of index vs shard. Thanks! On Wed, Jun 4, 2014 at 7:47 PM, Jack Krupansky j...@basetechnology.com wrote: How many documents was in that 20GB index? I'm skeptical that a 1 billion document shard won't be a problem. I mean technically it is possible, but as you are already experiencing, it may take a long time and a very powerful machine to do so. 100 million (or 250 million max) would be a more realistic goal. Even then, it depends on your doc size and machine size. The main point from the previous discussion is that although the technical hard limit for a Solr shard is 2G docs, from a practical perspective it is very difficult to get to that limit, not that indexing 1 billion docs on a single shard is just fine! As a general rule, if you want fast queries for high volume, strive to assure that your per-shard index fits entirely into the system memory available for OS caching of file system pages. In any case, a proof of concept implementation will tell you everything you need to know. -- Jack Krupansky -Original Message- From: Vineet Mishra Sent: Wednesday, June 4, 2014 2:45 AM To: solr-user@lucene.apache.org Subject: Re: Solr maximum Optimal Index Size per Shard Thanks all for your response. I presume this conversation concludes that indexing around 1Billion documents per shard won't be a problem, as I have 10 Billion docs to index, so approx 10 shards with 1 Billion each should be fine with it and how about Memory, what size of RAM should be fine for this amount of data? Moreover what should be the indexing technique for this huge data set, as currently I am indexing with EmbeddedSolrServer but its going pathetically slow after some 20Gb of indexing. Comparatively SolrHttpPost was slow due to network delays and response but after this long running the indexing with EmbeddedSolrServer I am getting a different notion. Any good indexing technique for this huge dataset would be highly appreciated. Thanks again! On Wed, Jun 4, 2014 at 6:40 AM, rulinma ruli...@gmail.com wrote: mark. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-maximum- Optimal-Index-Size-per-Shard-tp4139565p4139698.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr maximum Optimal Index Size per Shard
On Fri, 2014-06-06 at 12:32 +0200, Vineet Mishra wrote: *Does that mean for querying smoothly we need to have memory atleast equal or greater to the size of index? If you absolutely, positively have to reduce latency as much as possible, then yes. With an estimated index size of 2TB, I would guess that 10-20 machines with powerful CPUs (1 per shard per expected concurrent request) would also be advisable. While you're at it, do make sure that you're using high-speed memory. That was not a serious suggestion, should you be in doubt. Very few people need the best latency possible. Most just need the individual searches to be fast enough and want to scale throughput instead. As in my case the index size will be very heavy(~2TB) and practically speaking that amount of memory is not possible. Even If it goes to multiple shards, say around 10 Shards then also 200GB of RAM will not be an feasible option. We're building a projected 24TB index collection and are currently at 2.7TB+, growing with about 1TB/10 days. Our current plan is to use a single machine with 256GB of RAM, but we will of course adjust along the way if it proves to be too small. Requirements differ with the corpus and the needs, but for us, SSDs as storage seems to provide quite enough of a punch. I did a little testing yesterday: https://plus.google.com/u/0/+TokeEskildsen/posts/4yPvzrQo8A7 tl;dr: for small result sets ( 1M hits) on unwarmed searches with simple queries, response time is below 100ms. If we enable faceting with plain Solr, this jumps to about 1 second. I did a top on the machine and it says that 50GB is currently used for caching, so an 80GB (and probably less) machine would work fine for our 2.7TB index. - Toke Eskildsen, State and University Library, Denmark
Re: Solr maximum Optimal Index Size per Shard
Hi Toke, That was Spectacular, really great to hear that you have already indexed 2.7TB+ data to your server and still the query response time is under ms or a few seconds for such a huge dataset. Could you state what indexing mechanism are you using, as I started with EmbeddedSolrServer but it was pretty slow after a few GB(~30+) of indexing. I started indexing 1 week back and still its 37GB, although I assume HttpPost mechanism will perform lethargic slow due to network latency and for the response await. Furthermore I started with CloudSolrServer but facing some weird exception saying ClassCastException Cannot cast to Exception while adding the SolrInputDocument to the Server. CloudSolrServer server1 = new CloudSolrServer(zkHost:port1,zkHost:port2,zkHost:port3,false); server1.setDefaultCollection(mycollection); SolrInputDocument doc = new SolrInputDocument(); doc.addField( ID, 123); doc.addField( A0_s, 282628854); server1.add(doc); //Error at this line server1.commit(); Thanks again Toke for sharing that Stats. On Fri, Jun 6, 2014 at 5:04 PM, Toke Eskildsen t...@statsbiblioteket.dk wrote: On Fri, 2014-06-06 at 12:32 +0200, Vineet Mishra wrote: *Does that mean for querying smoothly we need to have memory atleast equal or greater to the size of index? If you absolutely, positively have to reduce latency as much as possible, then yes. With an estimated index size of 2TB, I would guess that 10-20 machines with powerful CPUs (1 per shard per expected concurrent request) would also be advisable. While you're at it, do make sure that you're using high-speed memory. That was not a serious suggestion, should you be in doubt. Very few people need the best latency possible. Most just need the individual searches to be fast enough and want to scale throughput instead. As in my case the index size will be very heavy(~2TB) and practically speaking that amount of memory is not possible. Even If it goes to multiple shards, say around 10 Shards then also 200GB of RAM will not be an feasible option. We're building a projected 24TB index collection and are currently at 2.7TB+, growing with about 1TB/10 days. Our current plan is to use a single machine with 256GB of RAM, but we will of course adjust along the way if it proves to be too small. Requirements differ with the corpus and the needs, but for us, SSDs as storage seems to provide quite enough of a punch. I did a little testing yesterday: https://plus.google.com/u/0/+TokeEskildsen/posts/4yPvzrQo8A7 tl;dr: for small result sets ( 1M hits) on unwarmed searches with simple queries, response time is below 100ms. If we enable faceting with plain Solr, this jumps to about 1 second. I did a top on the machine and it says that 50GB is currently used for caching, so an 80GB (and probably less) machine would work fine for our 2.7TB index. - Toke Eskildsen, State and University Library, Denmark
Re: Solr maximum Optimal Index Size per Shard
On Fri, 2014-06-06 at 14:05 +0200, Vineet Mishra wrote: Could you state what indexing mechanism are you using, as I started with EmbeddedSolrServer but it was pretty slow after a few GB(~30+) of indexing. I suspect that is due to too-frequent commits, too small heap or something third, unrelated to EmbeddedSolrServer itself. Underneath the surface it is just the same as a standalone Solr. We're building our ~1TB indexes individually, using standalone workers for the heavy part of the analysis (Tika). The delivery from the workers to the Solr server is over the network, using the Solr binary protocol. My colleague Thomas Egense just created a small write-up at https://github.com/netarchivesuite/netsearch I started indexing 1 week back and still its 37GB, although I assume HttpPost mechanism will perform lethargic slow due to network latency and for the response await. Maybe if you send the documents one at a time, but if you bundle them in larger updates, the post-method should be fine. - Toke Eskildsen, State and University Library, Denmark
Re: Solr maximum Optimal Index Size per Shard
Earlier I used to index with HtttpPost Mechanism only, making each post size specific to 2Mb to 20Mb that was going fine, but we had a suspect that instead of indexing through network call(which ofcourse results in latency due to network delays and http protocol) if we can index Offline by just writing the index and dumping it to Shards it would be much better. Although I am doing commit with a batch of 25K docs which I will try to replace with CommitWithin(seems it works faster) or probably have a look at this Binary Prot. Thanks! On Fri, Jun 6, 2014 at 5:55 PM, Toke Eskildsen t...@statsbiblioteket.dk wrote: On Fri, 2014-06-06 at 14:05 +0200, Vineet Mishra wrote: Could you state what indexing mechanism are you using, as I started with EmbeddedSolrServer but it was pretty slow after a few GB(~30+) of indexing. I suspect that is due to too-frequent commits, too small heap or something third, unrelated to EmbeddedSolrServer itself. Underneath the surface it is just the same as a standalone Solr. We're building our ~1TB indexes individually, using standalone workers for the heavy part of the analysis (Tika). The delivery from the workers to the Solr server is over the network, using the Solr binary protocol. My colleague Thomas Egense just created a small write-up at https://github.com/netarchivesuite/netsearch I started indexing 1 week back and still its 37GB, although I assume HttpPost mechanism will perform lethargic slow due to network latency and for the response await. Maybe if you send the documents one at a time, but if you bundle them in larger updates, the post-method should be fine. - Toke Eskildsen, State and University Library, Denmark
Re: Solr maximum Optimal Index Size per Shard
Thanks all for your response. I presume this conversation concludes that indexing around 1Billion documents per shard won't be a problem, as I have 10 Billion docs to index, so approx 10 shards with 1 Billion each should be fine with it and how about Memory, what size of RAM should be fine for this amount of data? Moreover what should be the indexing technique for this huge data set, as currently I am indexing with EmbeddedSolrServer but its going pathetically slow after some 20Gb of indexing. Comparatively SolrHttpPost was slow due to network delays and response but after this long running the indexing with EmbeddedSolrServer I am getting a different notion. Any good indexing technique for this huge dataset would be highly appreciated. Thanks again! On Wed, Jun 4, 2014 at 6:40 AM, rulinma ruli...@gmail.com wrote: mark. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-maximum-Optimal-Index-Size-per-Shard-tp4139565p4139698.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr maximum Optimal Index Size per Shard
On 6/4/2014 12:45 AM, Vineet Mishra wrote: Thanks all for your response. I presume this conversation concludes that indexing around 1Billion documents per shard won't be a problem, as I have 10 Billion docs to index, so approx 10 shards with 1 Billion each should be fine with it and how about Memory, what size of RAM should be fine for this amount of data? Figure out the heap requirements of the operating system and every program on the machine (Solr especially). Then you would add that number to the total size of the index data on the machine. That is the ideal minimum RAM. http://wiki.apache.org/solr/SolrPerformanceProblems Unfortunately, if you are dealing with a huge index with billions of documents, it is likely to be prohibitively expensive to buy that much RAM. If you are running Solr on Amazon's cloud, the cost for that much RAM would be astronomical. Exactly how much RAM would actually be required is very difficult to predict. If you had only 25% of the ideal, your index might have perfectly acceptable performance, or it might not. It might do fine under a light query load, but if you increase to 50 queries per second, performance may drop significantly ... or it might be good. It's generally not possible to know how your hardware will perform until you actually build and use your index. http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/ A general rule of thumb for RAM that I have found to be useful is that if you've got less than half of the ideal memory size, you might have performance problems. Moreover what should be the indexing technique for this huge data set, as currently I am indexing with EmbeddedSolrServer but its going pathetically slow after some 20Gb of indexing. Comparatively SolrHttpPost was slow due to network delays and response but after this long running the indexing with EmbeddedSolrServer I am getting a different notion. Any good indexing technique for this huge dataset would be highly appreciated. EmbeddedSolrServer is not recommended. Run Solr in the traditional way with HTTP connectivity. HTTP overhead on a LAN is usually quite small. Solr is fully thread-safe, so you can have several indexing threads all going at the same time. Indexes at this scale should normally be built with SolrCloud, with enough servers so that each machine is only handling one shard replica. The ideal indexing program would be written in Java, using CloudSolrServer. Thanks, Shawn
Re: Solr maximum Optimal Index Size per Shard
How many documents was in that 20GB index? I'm skeptical that a 1 billion document shard won't be a problem. I mean technically it is possible, but as you are already experiencing, it may take a long time and a very powerful machine to do so. 100 million (or 250 million max) would be a more realistic goal. Even then, it depends on your doc size and machine size. The main point from the previous discussion is that although the technical hard limit for a Solr shard is 2G docs, from a practical perspective it is very difficult to get to that limit, not that indexing 1 billion docs on a single shard is just fine! As a general rule, if you want fast queries for high volume, strive to assure that your per-shard index fits entirely into the system memory available for OS caching of file system pages. In any case, a proof of concept implementation will tell you everything you need to know. -- Jack Krupansky -Original Message- From: Vineet Mishra Sent: Wednesday, June 4, 2014 2:45 AM To: solr-user@lucene.apache.org Subject: Re: Solr maximum Optimal Index Size per Shard Thanks all for your response. I presume this conversation concludes that indexing around 1Billion documents per shard won't be a problem, as I have 10 Billion docs to index, so approx 10 shards with 1 Billion each should be fine with it and how about Memory, what size of RAM should be fine for this amount of data? Moreover what should be the indexing technique for this huge data set, as currently I am indexing with EmbeddedSolrServer but its going pathetically slow after some 20Gb of indexing. Comparatively SolrHttpPost was slow due to network delays and response but after this long running the indexing with EmbeddedSolrServer I am getting a different notion. Any good indexing technique for this huge dataset would be highly appreciated. Thanks again! On Wed, Jun 4, 2014 at 6:40 AM, rulinma ruli...@gmail.com wrote: mark. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-maximum-Optimal-Index-Size-per-Shard-tp4139565p4139698.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr maximum Optimal Index Size per Shard
Hi All, Has anyone came across the maximum threshold document or size wise for each core of solr to hold. As I have indexed some 10 Million Documents of 18Gb and when I index another 5 (9Gb)Million Documents on top of these indexes it responds little slow with Stats query. Considering I have around 2Tb of data to index what should be an appropriate balanced proportionate of Data vs # of Shards. Its more of a indexing Big data for NRT. Looking forward for your response. Urgent! Thanks!
Re: Solr maximum Optimal Index Size per Shard
How much free system memory do you have for the OS to cache file system data? If your entire index fits in system memory operations will be fast, but as your index grows beyond the space the OS can use to cache the data, performance will decline. But there's no hard limit in Solr per se. -- Jack Krupansky -Original Message- From: Vineet Mishra Sent: Tuesday, June 3, 2014 8:43 AM To: solr-user@lucene.apache.org Subject: Solr maximum Optimal Index Size per Shard Hi All, Has anyone came across the maximum threshold document or size wise for each core of solr to hold. As I have indexed some 10 Million Documents of 18Gb and when I index another 5 (9Gb)Million Documents on top of these indexes it responds little slow with Stats query. Considering I have around 2Tb of data to index what should be an appropriate balanced proportionate of Data vs # of Shards. Its more of a indexing Big data for NRT. Looking forward for your response. Urgent! Thanks!
Re: Solr maximum Optimal Index Size per Shard
On 6/3/2014 12:54 PM, Jack Krupansky wrote: How much free system memory do you have for the OS to cache file system data? If your entire index fits in system memory operations will be fast, but as your index grows beyond the space the OS can use to cache the data, performance will decline. But there's no hard limit in Solr per se. Vineet, There is only one hard limit in Solr: You can't put more than about 2 billion documents in one shard. The exact number is 2147483647 -- the largest number that you can store in a Java integer. Because this number also includes deleted documents, to be absolutely sure that nothing will have a problem, it would be advisable to stay below 1 billion documents per shard. Because of Solr's reliance on RAM for the OS disk cache, which Jack has already mentioned, chances are very good that your shards will have performance problems long before you reach a billion documents. Thanks, Shawn
Re: Solr maximum Optimal Index Size per Shard
Anybody care to forecast when hardware will catch up with Solr and we can routinely look forward to newbies complaining that they indexed some data and after only 10 minutes they hit this weird 2G document count limit? -- Jack Krupansky -Original Message- From: Shawn Heisey Sent: Tuesday, June 3, 2014 3:34 PM To: solr-user@lucene.apache.org Subject: Re: Solr maximum Optimal Index Size per Shard On 6/3/2014 12:54 PM, Jack Krupansky wrote: How much free system memory do you have for the OS to cache file system data? If your entire index fits in system memory operations will be fast, but as your index grows beyond the space the OS can use to cache the data, performance will decline. But there's no hard limit in Solr per se. Vineet, There is only one hard limit in Solr: You can't put more than about 2 billion documents in one shard. The exact number is 2147483647 -- the largest number that you can store in a Java integer. Because this number also includes deleted documents, to be absolutely sure that nothing will have a problem, it would be advisable to stay below 1 billion documents per shard. Because of Solr's reliance on RAM for the OS disk cache, which Jack has already mentioned, chances are very good that your shards will have performance problems long before you reach a billion documents. Thanks, Shawn
Re: Solr maximum Optimal Index Size per Shard
On 6/3/2014 1:47 PM, Jack Krupansky wrote: Anybody care to forecast when hardware will catch up with Solr and we can routinely look forward to newbies complaining that they indexed some data and after only 10 minutes they hit this weird 2G document count limit? I would speculate that Lucene will update its index format to use 64-bit (or maybe VInt) document identifiers before a typical (or test) data source would present a problem like this. I also seriously doubt that it would be a complete newbie, it is more likely that it would be an experienced admin or developer who would instantly know why it happened as soon as they saw how many documents were successfully indexed. They might need someone to point out SolrCloud shards to them as the solution. I've already heard of a production SolrCloud install with five billion documents, so even now it's not a theoretical problem. Thanks, Shawn
Re: Solr maximum Optimal Index Size per Shard
mark. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-maximum-Optimal-Index-Size-per-Shard-tp4139565p4139698.html Sent from the Solr - User mailing list archive at Nabble.com.