Nice description of a use-case. My 2 pennies embedded...
Phillip Farber wrote:
Hello everyone,
We are considering Solr 1.2 to index and search a terabyte-scale
dataset of OCR. Initially our requirements are simple: basic
tokenizing, score sorting only, no faceting. The schema is simple
too. A document consists of a numeric id, stored and indexed and a
large text field, indexed not stored, containing the OCR typically
~1.4Mb. Some limited faceting or additional metadata fields may be
added later.
It has been my experience that large fields are a bit of a problem. If
possible, try to segment them. Just suggesting this as indexing seems to
the bottleneck, not the queries.
The data in question currently amounts to about 1.1Tb of OCR (about 1M
docs) which we expect to increase to 10Tb over time. Pilot tests on
the desktop w/ 2.6 GHz P4 with 2.5 Gb memory, java 1Gb heap on ~180 Mb
of data via HTTP suggest we can index at a rate sufficient to keep up
with the inputs (after getting over the 1.1 Tb hump). We envision
nightly commits/optimizes.
We expect to have low QPS (<10) rate and probably will not need
millisecond query response.
Our environment makes available Apache on blade servers (Dell 1955 dual
dual-core 3.x GHz Xeons w/ 8GB RAM) connected to a *large*,
high-performance NAS system over a dedicated (out-of-band) GbE switch
(Dell PowerConnect 5324) using a 9K MTU (jumbo packets). We are starting
with 2 blades and will add as demands require.
While we have a lot of storage, the idea of master/slave Solr
Collection Distribution to add more Solr instances clearly means
duplicating an immense index. Is it possible to use one instance to
update the index on NAS while other instances only read the index and
commit to keep their caches warm instead?
So, let me get this straight. You want to put the index in 'true' shared
storage. Just one copy of it on NAS with Solr used in a diskless
fashion? If this is the case, I do not see why you cannot do what you
want to do. Just do not have a master/slave configuration at all since
this configuration makes all Solr boxes behave as similarly and
up-to-date as possible with multiple copies of indexes created. Have one
or more Solr box index the index. Have one or more Solr box search the
index. Load balance the boxes yourself by a simple round robin.
Should we expect Solr indexing time to slow significantly as we scale
up? What kind of query performance could we expect? Is it totally
naive even to consider Solr at this kind of scale?
Solr can be used and proved at this kind of scale. But if it is simple
index/search you are after you might also consider writing the
index-search-update programs yourself.
Given these parameters is it realistic to think that Solr could handle
the task?
I am sure it would. Please keep us posted on your approach and which
worked for you as yours is a very generic problem and should be
documented from the use-case (your mail) to design (this thread) to
implementation (your decision) and performance (your benchmarks).
Any advice/wisdom greatly appreciated,
Phil
----------------------------------------------------------------------
Free pop3 email with a spam filter.
http://www.bluebottle.com/tag/5