Hi,

Our system [1] consists of +220 million semi-structured web documents (RDF, Microformats, etc.), with fairly small documents (a few kb) and large documents (a few MB). Each document has in addition a dozen of additional fields for indexing and storing metadata about the document.

It runs on top of Solr 3.1 with the following configuration:
- 2 master indexes
- 2 slaves indexes
Each server is a quad-core with 32Gb of Ram, and 4 SATA drives in RAID10.

The indexing performance are quite good. We can reindex our full data collection in less than a day (using only the two master indexes). Live updates (a few millions documents per day) are processed continuously by our masters. We replicate the change every hours to the slave indexes. Query performance are also ok (you can try it by yourself on [1]).

As a side note, we are using Solr 3.1 plus a plugin we have developped for indexing semi-structured data. This plugin is adding much more data to the index than plain Solr. So you can expect even better performance by using plain solr (with respect to indexing performance).

[1] http://sindice.com
--
Renaud Delbru

On 12/05/11 17:59, atreyu wrote:
Hi,

I have about 300 million docs (or 10TB data) which is doubling every 3
years, give or take.  The data mostly consists of Oracle records, webpage
files (HTML/XML, etc.) and office doc files.  There are b/t two and four
dozen concurrent users, typically.  The indexing server has>  27 GB of RAM,
but it still gets extremely taxed, and this will only get worse.

Would Solr be able to efficiently deal with a load of this size?  I am
trying to avoid the heavy cost of GSA, etc...

Thanks.


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Support-for-huge-data-set-tp2932652p2932652.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to