Re: Support for huge data set?

Shawn Heisey Fri, 13 May 2011 07:59:51 -0700

Our system, which I am not at liberty to disclose, consists of 55million documents, mostly photos and text, but video is starting tobecome prominent. The entire archive is about 80 terabytes, but we onlyindex a subset of the metadata, stored in a MySQL database, which isabout 100GB or so in size.

The Solr index (version 1.4.1) consists of six large shards, each about16GB in size, plus a seventh shard containing the most recent 7 days,which is usually less than 1 GB. The entire system is replicated toslave servers. Each virtual machine that houses a large shard has 9GBof RAM, and there are three large shards on each of the four physicalhosts. Each physical host is dual quad-core with 32GB of RAM, with asix drive SATA RAID10. We went with virtualization (Xen) for cost reasons.

Performance is good. If we could move to physical machines instead ofvirtualization, that would be optimal, but I think I'll have to settlefor a RAM upgrade instead.

The main reason I stuck with distributed search is because of indexrebuild time. I can currently rebuild the entire index in 3-4 hours.It would take 5-6 times that if I had a single large index.



On 5/13/2011 12:37 AM, Otis Gospodnetic wrote:

With that many documents, I think GSA cost might be in millions of USD.  Don't
go there.

300 MB docs might be called medium these days.  Of course, if those documents
themselves are huge, then it's more resource intensive.  10 TB sounds like a lot
when it comes to search, but it's hard to tell what that represents (e.g. are
those docs with lots of photos in them?  Presentations very light on text?
Plain text documents with 300 words per page? etc.)

Anyhow, yes, Solr is a fine choice for this.

Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----

From: atreyu<wjhendrick...@gmail.com>
To: solr-user@lucene.apache.org
Sent: Thu, May 12, 2011 12:59:28 PM
Subject: Support for huge data set?

Hi,

I have about 300 million docs (or 10TB data) which is doubling every  3
years, give or take.  The data mostly consists of Oracle records,  webpage
files (HTML/XML, etc.) and office doc files.  There are b/t two  and four
dozen concurrent users, typically.  The indexing server has>  27 GB of RAM,
but it still gets extremely taxed, and this will only get  worse.

Would Solr be able to efficiently deal with a load of this  size?  I am
trying to avoid the heavy cost of GSA,  etc...

Thanks.

--
View this message in context:
http://lucene.472066.n3.nabble.com/Support-for-huge-data-set-tp2932652p2932652.html

Sent  from the Solr - User mailing list archive at Nabble.com.

Re: Support for huge data set?

Reply via email to