Re: distributed architecture

Shawn Heisey Tue, 30 Nov 2010 16:42:42 -0800

On 11/30/2010 2:27 PM, Cinquini, Luca (3880) wrote:

Hi,
        I'd like to know if anybody has suggestions/opinions on what is 
currently the best architecture for a distributed search system using Solr. The 
use case is that of a system composed
of N indexes, each hosted on a separate machine, each index containing unique 
content.


Options that I know of are:

A) Using Solr distributed search
B) Using Solr + Zookeeper integration
C) Using replication, i.e. each node replicates all the others

It seems like options A) and B) would suffer from a fault-tolerance standpoint: 
if any of the nodes goes down, the search won't -at this time- return partial 
results, but instead report an exception.
Option C) would provide fault tolerance, at least for any search initiated at a 
node that is available, but would incur into a large replication overhead.

Exactly what will work best for you is highly dependent on your specificrequirements. Answers to the following questions will influence thechoice. If you choose distributed, they will also affect how youdistribute the data among the different machines:


* How many documents?
* How much disk space will the index take up?
* Do you need uniform IDF across the entire corpus?
* How often do you need to insert new content?
* How often do you need to delete old content?
* What query volume do you have to support?

For my index, I use distributed+replicated, for redundancy.  Statistics:

* Low query volume.
* 49 million documents
* Total index size: 87GB (using 1024, not 1000)
* Adding about 40-50000 new documents every day
* Inserts done every two minutes.
* Deletes done every ten minutes.

* Six shards with static content. Each one is 14GB and 8+ milliondocuments.* One shard with content less than a week old. It is usually less than1GB in size.

The entire system consists of 18 virtual machines on four physicalhosts. There is a master VMs and a slave VM for each shard. The staticshard VMs have 9GB of RAM, the recent content shard has 3GB of RAM. TwoVMs with 1.5GB of RAM are Solr instances that do not have indexes, theyare used as search brokers. Two VMs are load balancers, runningheartbeat and HAProxy. The load balancers present the two searchbrokers as a single IP address. Each of the VMs in a pair is running ona different physical host.

By low query volume, I mean that the average queries per second is lessthan 1, but this is over a long time period, so it includes nights andweekends when there is very little traffic. Even during heavy times, Iwould estimate that the queries per second is still single digit. Overthe last 11 days and 21 hours, the website made 320,000 queries. In thesame time period, there were 340,000 load balancer health-check queries,at a rate of two every five seconds.


Shawn

Re: distributed architecture

Reply via email to