Seek your wisdom for implementing 12 million docs..

Ikhsvaku S Sun, 25 Sep 2011 13:00:37 -0700

Hi List,

We are pretty new to Solr & Lucene and have just starting indexing few 10K
documents using Solr. Before we attempt anything bigger we want to see what
should be the best approach..


Documents: We have close to ~12 million XML docs, of varying sizes average
size 20 KB. These documents have 150 fields, which should be searchable &
indexed. Of which over 80% fixed length string fields and few strings are
multivalued ones (e.g. title, headline, id, submitter, reviewers,
suggested-titles etc), there other 15% who are date specific (added-on,
reviewed-on etc). Rest are multivalued text documents, (E,g,
description, summary, comments, notes etc). Some of the documents do have
large number of these text fields (so we are leaning against storing these
in index). Approximately ~6000 such documents are updated & 400-800 new ones
are added each day

Queries: A typical query would mainly be on string fields ~ 60% of queries
e.g. a simple one would be find document ids of documents whose author is
XYZ & submitted between [X-Z] & whose status is reviewed or pending review
&& title has this string etc... the results of which are exacting nature
(found 300 docs). Rest of searches would include the text fields, where they
search quoted snippets or phrases... Almost all queries have multiple
operators. Also each one would want to grab as many result rows as possible
(we are limiting this to 2000). The output shall contain only 1-5 fields.
(No highlighting etc needed)

Available hardware:
Some of existing hardware we could find consists of existing ~300GB SAN each
on 4 Boxes with ~96Gig each. We do couple of older HP DL380s (mainly want to
use for offline indexing). All of this is on 10G Ethernet.

Questions:
Our priority is to provide results fast, and the new or updated documents
should be indexed within 2 hour. Users are also known to use complex queries
for data mining. Seeing all this any recommendations for indexing data,
fields?
How do we scale, what architecture should we follow here? Slave/master
servers? Any possible issues we may hit?

Thanks

Seek your wisdom for implementing 12 million docs..

Reply via email to