Thanks for this - I haven't any previous experience with utilising SSDs in the way you suggest, so I guess I need to start learning! And thanks for the Danish-webscale URL, looks like very informed reading. (Yes, I think we're working in similar industries with similar constraints and expectations).
Compiliing my answers into one email, " Curious how many documents per shard you were planning? The number of documents per shard and field type will drive the amount of a RAM needed to sort and facet." - Number of documents per shard, I think about 200 million. That's a bit of a rough estimate based on other Solrs we run though. Which I think means we hold a lot of data for each document, though I keep arguing to keep this to the truly required minimum. We also have many facets, some of which are pretty large (I'm stretching my understanding here but I think most documents have many 'entries' in many facets so these really hit us performance-wise.) I try to keep a 1-to-1 ratio of Solr nodes to CPUs with a few spare for the operating system. I utilise MMapDirectory to manage memory via the OS. So at this moment I guessing that we'll have 56 Solr dedicated CPUs across 2 physical 32 CPU servers and _hopefully_ 256GB RAM on each. This would give 28 shards and each would have 5GB java memory (in Tomcat), leaving 126GB on each server for the OS and MMap. (I believe the Solr theory for this doesn't accurately work out but we can accept the edge cases where this will fail.) I can also see that our hardware requirements will also depend on usage as well as the volume of data, and I've been pondering how best we can structure our index/es to facilitate a long term service (which means that, given it's a lot of data, I need to structure the data so that new usage doesn't require re-indexing.) But at this early stage, as people say, we need to prototype, test, profile etc. and to do that I need the hardware to run the trials (policy dictates that I buy the production hardware now, before profiling - I get to control much of the design and construction so I don't argue with this!) Thanks for all the comments everyone, all very much appreciated :) Gil -----Original Message----- From: Toke Eskildsen [mailto:t...@statsbiblioteket.dk] Sent: 11 December 2013 12:02 To: solr-user@lucene.apache.org Subject: Re: Solr hardware memory question On Tue, 2013-12-10 at 17:51 +0100, Hoggarth, Gil wrote: > We're probably going to be building a Solr service to handle a dataset > of ~60TB, which for our data and schema typically gives a Solr index > size of 1/10th - i.e., 6TB. Given there's a general rule about the > amount of hardware memory required should exceed the size of the Solr > index (exceed to also allow for the operating system etc.), how have > people handled this situation? By acknowledging that it is cheaper to buy SSDs instead of trying to compensate for slow spinning drives with excessive amounts of RAM. Our plans for an estimated 20TB of indexes out of 372TB of raw web data is to use SSDs controlled by a single machine with 512GB of RAM (or was it 256GB? I'll have to ask the hardware guys): https://sbdevel.wordpress.com/2013/12/06/danish-webscale/ As always YMMW and the numbers you quite elsewhere indicates that your queries are quite complex. You might want to be a bit of profiling to see if they are heavy enough to make the CPU the bottleneck. Regards, Toke Eskildsen, State and University Library, Denmark