Re: Solr feasibility with terabyte-scale data

Phillip Farber Tue, 22 Jan 2008 11:05:36 -0800


Ryan McKinley wrote:

We are considering Solr 1.2 to index and search a terabyte-scaledataset of OCR. Initially our requirements are simple: basictokenizing, score sorting only, no faceting. The schema is simpletoo. A document consists of a numeric id, stored and indexed and alarge text field, indexed not stored, containing the OCR typically~1.4Mb. Some limited faceting or additional metadata fields may beadded later.
I have not done anything on this scale...  but with:
https://issues.apache.org/jira/browse/SOLR-303 it will be possible tosplit a large index into many smaller indices and return the union ofall results. This may or may not be necessary depending on what thedata actually looks like (if you text just uses 100 words, your indexmay not be that big)
How many documents are you talking about?

Currently 1M docs @ ~1.4M/doc. Scaling to 7M docs. This is OCR so weare talking perhaps 50K words total to index so as you point out theindex might not be too big. It's the *data* that is big not the*index*, right?. So I don't think SOLR-303 (distributed search) isrequired here.

Obviously as the number of documents increase the index size mustincrease to some degree -- I think linearly? But what index size willresult for 7M documents over 50K words where we're talking just 2 fieldsper doc: 1 id field and one OCR field of ~1.4M? Ballpark?

Regarding single word queries, do you think, say, 0.5 sec/query toreturn 7M score-ranked IDs is possible/reasonable in this scenario?

Should we expect Solr indexing time to slow significantly as we scaleup? What kind of query performance could we expect? Is it totallynaive even to consider Solr at this kind of scale?
You may want to check out the lucene benchmark stuff
http://lucene.apache.org/java/docs/benchmarks.html
http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/benchmark/byTask/package-summary.html
ryan

Re: Solr feasibility with terabyte-scale data

Reply via email to