On 2/20/2014 9:24 PM, search engn dev wrote: > Data size is 250 GB of small records. each record is of around 0.3kb size. It > consists around 1 billion records. my index has 20 different fields. . > Majorly queries will be very simple or spacial queries mainly on on 2-3 > fields. > all 20 fields will be stored. Any suggestions on how many shards will I need > to search data?
Your question is impossible to answer. I will tell you that this is a very big index, and it's going to take a lot of hardware. It's not the biggest I've heard of, but it is quite large. Any situation that would result in a performance issue on a small index is going to be far worse on a large index. http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-we-ont-have-a-definitive-answer/ Two machines with 32GB of RAM each are not going to be anywhere near enough. If you can't get more than 32GB of RAM in each server, you're probably going to need a lot of them. Since all your fields will be stored, the *minimum* size of your index will be approximately equal to the original data size after compression -- assuming you're using 4.1.0 or later, where compression was introduced. That will not be the end, though -- it doesn't take into account the size of the *indexed* data. Although it is theoretically possible to look at a schema and the original data to calculate the size of the indexed data, in reality the only way to be SURE is to actually index a significant percentage of your real data with the same schema you would use in production. Once you know how big your index is actually going to be, you can begin to figure out how much total RAM you'll need across all the servers for a single copy of the index (no redundancy). If you want redundancy, the requirements will be at least twice what you calculate. http://wiki.apache.org/solr/SolrPerformanceProblems The number of shards and replicas that you're going to need is going to depend on the query volume, the nature of the queries, and the nature of the data. Just like with index size, the only way to know is to try it with all your real data. If your query volume is large, you'll need multiple copies of the complete index, which means more servers. If you don't care how long each query takes and your query volume will be low, then your server requirements will be a LOT smaller. Thanks, Shawn