On 11/30/2010 2:27 PM, Cinquini, Luca (3880) wrote:
Hi,
        I'd like to know if anybody has suggestions/opinions on what is 
currently the best architecture for a distributed search system using Solr. The 
use case is that of a system composed
of N indexes, each hosted on a separate machine, each index containing unique 
content.

Options that I know of are:

A) Using Solr distributed search
B) Using Solr + Zookeeper integration
C) Using replication, i.e. each node replicates all the others

It seems like options A) and B) would suffer from a fault-tolerance standpoint: 
if any of the nodes goes down, the search won't -at this time- return partial 
results, but instead report an exception.
Option C) would provide fault tolerance, at least for any search initiated at a 
node that is available, but would incur into a large replication overhead.

Exactly what will work best for you is highly dependent on your specific requirements. Answers to the following questions will influence the choice. If you choose distributed, they will also affect how you distribute the data among the different machines:

* How many documents?
* How much disk space will the index take up?
* Do you need uniform IDF across the entire corpus?
* How often do you need to insert new content?
* How often do you need to delete old content?
* What query volume do you have to support?

For my index, I use distributed+replicated, for redundancy.  Statistics:

* Low query volume.
* 49 million documents
* Total index size: 87GB (using 1024, not 1000)
* Adding about 40-50000 new documents every day
* Inserts done every two minutes.
* Deletes done every ten minutes.
* Six shards with static content. Each one is 14GB and 8+ million documents. * One shard with content less than a week old. It is usually less than 1GB in size.

The entire system consists of 18 virtual machines on four physical hosts. There is a master VMs and a slave VM for each shard. The static shard VMs have 9GB of RAM, the recent content shard has 3GB of RAM. Two VMs with 1.5GB of RAM are Solr instances that do not have indexes, they are used as search brokers. Two VMs are load balancers, running heartbeat and HAProxy. The load balancers present the two search brokers as a single IP address. Each of the VMs in a pair is running on a different physical host.

By low query volume, I mean that the average queries per second is less than 1, but this is over a long time period, so it includes nights and weekends when there is very little traffic. Even during heavy times, I would estimate that the queries per second is still single digit. Over the last 11 days and 21 hours, the website made 320,000 queries. In the same time period, there were 340,000 load balancer health-check queries, at a rate of two every five seconds.

Shawn

Reply via email to