On 10/2/2018 9:33 AM, Rekha wrote:
Dear Solr Team, I need following clarification from you, please check and give 
suggestion to me, 1. I want to store and search 200 Billions of documents(Each 
document contains 16 fields). For my case can I able to achieve by using Solr 
cloud? 2. For my case how many shard and nodes will be needed? 3. In future can 
I able to increase the nodes and shards? Thanks, Rekha Karthick

In a nutshell:  It's not possible to give generic advice. The contents of the fields will affect exactly what you need.  The nature of the queries that you send will affect exactly what you need.  The query rate will affect exactly what you need. The overall size of the index (disk space, as well as document count) will affect what you need.

In the "not very helpful" department, but I promise this is absolute truth, there's this blog post:

https://lucidworks.com/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

To handle 200 billion documents *in a single collection*, you're probably going to want at least 200 shards, and there are good reasons to go with even more shards than that.  But you need to be warned that there can be serious scalability problems when SolrCloud must keep track of that many different indexes.  Here's an issue I filed for scalability problems with thousands of collections ... there can be similar problems with lots of shards as well.  This issue says it is fixed, but no code changes that I am aware of were ever made related to the issue, and as far as I can tell, it's still a problem even in the latest version:

https://issues.apache.org/jira/browse/SOLR-7191

That many shard/replicas on one collection is likely to need zookeeper's maximum znode size (jute.maxbuffer) boosted, because it will probably require more than one megabyte to hold the JSON structure describing the collection.

As for how many machines you'll need ... absolutely no idea.  If query rate will be insanely high, you'll want a dedicated machine for each shard replica, and you may need many replicas, which is going to mean hundreds, possibly thousands, of servers.  If the query rate is really low and/or each document is very small, you might be able to house more than one shard per server.  But you should know that handling 200 billion documents is going to require a lot of hardware even if it turns out that you're not going to be handling tons of data (per document) or queries.

Thanks,
Shawn

Reply via email to