Our system, which I am not at liberty to disclose, consists of 55 million documents, mostly photos and text, but video is starting to become prominent. The entire archive is about 80 terabytes, but we only index a subset of the metadata, stored in a MySQL database, which is about 100GB or so in size.

The Solr index (version 1.4.1) consists of six large shards, each about 16GB in size, plus a seventh shard containing the most recent 7 days, which is usually less than 1 GB. The entire system is replicated to slave servers. Each virtual machine that houses a large shard has 9GB of RAM, and there are three large shards on each of the four physical hosts. Each physical host is dual quad-core with 32GB of RAM, with a six drive SATA RAID10. We went with virtualization (Xen) for cost reasons.

Performance is good. If we could move to physical machines instead of virtualization, that would be optimal, but I think I'll have to settle for a RAM upgrade instead.

The main reason I stuck with distributed search is because of index rebuild time. I can currently rebuild the entire index in 3-4 hours. It would take 5-6 times that if I had a single large index.


On 5/13/2011 12:37 AM, Otis Gospodnetic wrote:
With that many documents, I think GSA cost might be in millions of USD.  Don't
go there.

300 MB docs might be called medium these days.  Of course, if those documents
themselves are huge, then it's more resource intensive.  10 TB sounds like a lot
when it comes to search, but it's hard to tell what that represents (e.g. are
those docs with lots of photos in them?  Presentations very light on text?
Plain text documents with 300 words per page? etc.)

Anyhow, yes, Solr is a fine choice for this.

Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
From: atreyu<wjhendrick...@gmail.com>
To: solr-user@lucene.apache.org
Sent: Thu, May 12, 2011 12:59:28 PM
Subject: Support for huge data set?

Hi,

I have about 300 million docs (or 10TB data) which is doubling every  3
years, give or take.  The data mostly consists of Oracle records,  webpage
files (HTML/XML, etc.) and office doc files.  There are b/t two  and four
dozen concurrent users, typically.  The indexing server has>  27 GB of RAM,
but it still gets extremely taxed, and this will only get  worse.

Would Solr be able to efficiently deal with a load of this  size?  I am
trying to avoid the heavy cost of GSA,  etc...

Thanks.


--
View this message in context:
http://lucene.472066.n3.nabble.com/Support-for-huge-data-set-tp2932652p2932652.html

Sent  from the Solr - User mailing list archive at Nabble.com.


Reply via email to