Re: Solr feasibility with terabyte-scale data

Srikant Jakilinki Fri, 18 Jan 2008 20:15:21 -0800

Nice description of a use-case. My 2 pennies embedded...


Phillip Farber wrote:

Hello everyone,
We are considering Solr 1.2 to index and search a terabyte-scaledataset of OCR. Initially our requirements are simple: basictokenizing, score sorting only, no faceting. The schema is simpletoo. A document consists of a numeric id, stored and indexed and alarge text field, indexed not stored, containing the OCR typically~1.4Mb. Some limited faceting or additional metadata fields may beadded later.

It has been my experience that large fields are a bit of a problem. Ifpossible, try to segment them. Just suggesting this as indexing seems tothe bottleneck, not the queries.

The data in question currently amounts to about 1.1Tb of OCR (about 1Mdocs) which we expect to increase to 10Tb over time. Pilot tests onthe desktop w/ 2.6 GHz P4 with 2.5 Gb memory, java 1Gb heap on ~180 Mbof data via HTTP suggest we can index at a rate sufficient to keep upwith the inputs (after getting over the 1.1 Tb hump). We envisionnightly commits/optimizes.
We expect to have low QPS (<10) rate and probably will not needmillisecond query response.
Our environment makes available Apache on blade servers (Dell 1955 dual
dual-core 3.x GHz Xeons w/ 8GB RAM) connected to a *large*,
high-performance NAS system over a dedicated (out-of-band) GbE switch
(Dell PowerConnect 5324) using a 9K MTU (jumbo packets). We are starting
with 2 blades and will add as demands require.
While we have a lot of storage, the idea of master/slave SolrCollection Distribution to add more Solr instances clearly meansduplicating an immense index. Is it possible to use one instance toupdate the index on NAS while other instances only read the index andcommit to keep their caches warm instead?

So, let me get this straight. You want to put the index in 'true' sharedstorage. Just one copy of it on NAS with Solr used in a disklessfashion? If this is the case, I do not see why you cannot do what youwant to do. Just do not have a master/slave configuration at all sincethis configuration makes all Solr boxes behave as similarly andup-to-date as possible with multiple copies of indexes created. Have oneor more Solr box index the index. Have one or more Solr box search theindex. Load balance the boxes yourself by a simple round robin.

Should we expect Solr indexing time to slow significantly as we scaleup? What kind of query performance could we expect? Is it totallynaive even to consider Solr at this kind of scale?

Solr can be used and proved at this kind of scale. But if it is simpleindex/search you are after you might also consider writing theindex-search-update programs yourself.

Given these parameters is it realistic to think that Solr could handlethe task?

I am sure it would. Please keep us posted on your approach and whichworked for you as yours is a very generic problem and should bedocumented from the use-case (your mail) to design (this thread) toimplementation (your decision) and performance (your benchmarks).


Any advice/wisdom greatly appreciated,

Phil


----------------------------------------------------------------------
Free pop3 email with a spam filter.
http://www.bluebottle.com/tag/5

Re: Solr feasibility with terabyte-scale data

Reply via email to