Hi List, We are pretty new to Solr & Lucene and have just starting indexing few 10K documents using Solr. Before we attempt anything bigger we want to see what should be the best approach..
Documents: We have close to ~12 million XML docs, of varying sizes average size 20 KB. These documents have 150 fields, which should be searchable & indexed. Of which over 80% fixed length string fields and few strings are multivalued ones (e.g. title, headline, id, submitter, reviewers, suggested-titles etc), there other 15% who are date specific (added-on, reviewed-on etc). Rest are multivalued text documents, (E,g, description, summary, comments, notes etc). Some of the documents do have large number of these text fields (so we are leaning against storing these in index). Approximately ~6000 such documents are updated & 400-800 new ones are added each day Queries: A typical query would mainly be on string fields ~ 60% of queries e.g. a simple one would be find document ids of documents whose author is XYZ & submitted between [X-Z] & whose status is reviewed or pending review && title has this string etc... the results of which are exacting nature (found 300 docs). Rest of searches would include the text fields, where they search quoted snippets or phrases... Almost all queries have multiple operators. Also each one would want to grab as many result rows as possible (we are limiting this to 2000). The output shall contain only 1-5 fields. (No highlighting etc needed) Available hardware: Some of existing hardware we could find consists of existing ~300GB SAN each on 4 Boxes with ~96Gig each. We do couple of older HP DL380s (mainly want to use for offline indexing). All of this is on 10G Ethernet. Questions: Our priority is to provide results fast, and the new or updated documents should be indexed within 2 hour. Users are also known to use complex queries for data mining. Seeing all this any recommendations for indexing data, fields? How do we scale, what architecture should we follow here? Slave/master servers? Any possible issues we may hit? Thanks