What I can say is:  
  

  * SDD (crucial for performance if the index doesn't fit in memory, and will 
not fit)
  * Divide and conquer, for that volume of docs you will need more than 6 nodes.
  * DocValues to not stress the java HEAP.
  * Do you will you aggregate data?, if yes, what is your max cardinality?, 
this question is the most important to size correctly the memory needs.
  * Latency is important too, which threshold is acceptable before consider a 
query slow?
At my company we are running a 12 terabytes (2 replicas) Solr cluster with 8
billion documents sparse over 500 collection . For this we have about 12
machines with SDDs and 32G of ram each (~24G for the heap).  
  
We don't have a strict need of speed, 30 second query to aggregate 100 million
documents with 1M of unique keys is fast enough for us, normally the
aggregation performance decrease as the number of unique keys increase, with
low unique key factor, queries take less than 2 seconds if data is in OS
cache.  
  
Personal recommendations:

  * Sharding is important and smart sharding is crucial, you don't want run 
queries on data that is not interesting (this slow down queries when the 
dataset is big). 
  * If you want measure speed do it with about 1 billion documents to simulate 
something real (real for 10 billion document world).
  * Index with re-indexing in mind. with 10 billion docs, re-index data takes 
months ... This is important if you don't use regular features of Solr. In my 
case I configured Docvalues with disk format (not standard feature in 4.x) and 
at some point this format was deprecated. Upgrade Solr to 5.x was an epic 3 
months battle to do it without full downtime.
  * Solr is like your girlfriend, will demand love and care and plenty of space 
to full-recover replicas that in some point are out of sync, happen a lot 
restarting nodes (this is annoying with replicas with 100G), don't 
underestimate this point. Free space can save your life.  

\--

/Yago Riveiro

> On Jan 19 2016, at 11:26 pm, Shawn Heisey <apa...@elyograg.org> wrote:  

>

> On 1/19/2016 1:30 PM, Troy Edwards wrote:  
> We are currently "beta testing" a SolrCloud with 2 nodes and 2 shards
with  
> 2 replicas each. The number of documents is about 125000.  
>  
> We now want to scale this to about 10 billion documents.  
>  
> What are the steps to prototyping, hardware estimation and stress
testing?

>

> There is no general information available for sizing, because there are  
too many factors that will affect the answers. Some of the important  
information that you need will be impossible to predict until you  
actually build it and subject it to a real query load.

>

> https://lucidworks.com/blog/sizing-hardware-in-the-abstract-why-we-dont-
have-a-definitive-answer/

>

> With an index of 10 billion documents, you may not be able to precisely  
predict performance and hardware requirements from a small-scale  
prototype. You'll likely need to build a full-scale system on a small  
testbed, look for bottlenecks, ask for advice, and plan on a larger  
system for production.

>

> The hard limit for documents on a single shard is slightly less than  
Java's Integer.MAX_VALUE -- just over two billion. Because deleted  
documents count against this max, about one billion documents per shard  
is the absolute max that should be loaded in practice.

>

> BUT, if you actually try to put one billion documents in a single  
server, performance will likely be awful. A more reasonable limit per  
machine is 100 million ... but even this is quite large. You might need  
smaller shards, or you might be able to get good performance with larger  
shards. It all depends on things that you may not even know yet.

>

> Memory is always a strong driver for Solr performance, and I am speaking  
specifically of OS disk cache -- memory that has not been allocated by  
any program. With 10 billion documents, your total index size will  
likely be hundreds of gigabytes, and might even reach terabyte scale.  
Good performance with indexes this large will require a lot of total  
memory, which probably means that you will need a lot of servers and  
many shards. SSD storage is strongly recommended.

>

> For extreme scaling on Solr, especially if the query rate will be high,  
it is recommended to only have one shard replica per server.

>

> I have just added an "extreme scaling" section to the following wiki  
page, but it's mostly a placeholder right now. I would like to have a  
discussion with people who operate very large indexes so I can put real  
usable information in this section. I'm on IRC quite frequently in the  
#solr channel.

>

> https://wiki.apache.org/solr/SolrPerformanceProblems

>

> Thanks,  
Shawn

Reply via email to