Re: Solr feasibility with terabyte-scale data

Ryan McKinley Sat, 19 Jan 2008 11:14:06 -0800

We are considering Solr 1.2 to index and search a terabyte-scale datasetof OCR. Initially our requirements are simple: basic tokenizing, scoresorting only, no faceting. The schema is simple too. A documentconsists of a numeric id, stored and indexed and a large text field,indexed not stored, containing the OCR typically ~1.4Mb. Some limitedfaceting or additional metadata fields may be added later.


I have not done anything on this scale...  but with:

https://issues.apache.org/jira/browse/SOLR-303 it will be possible tosplit a large index into many smaller indices and return the union ofall results. This may or may not be necessary depending on what thedata actually looks like (if you text just uses 100 words, your indexmay not be that big)


How many documents are you talking about?

Should we expect Solr indexing time to slow significantly as we scaleup? What kind of query performance could we expect? Is it totallynaive even to consider Solr at this kind of scale?


You may want to check out the lucene benchmark stuff
http://lucene.apache.org/java/docs/benchmarks.html

http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/benchmark/byTask/package-summary.html


ryan

Re: Solr feasibility with terabyte-scale data

Reply via email to