Hello , Searching real-time sounds difficult with that amount of data. With large documents, 3 million documents, and 5TB of data the index will be very large. With indexes that large your performance will probably be I/O bound.
Do you plan on allowing phrase or proximity searches? If so, your performance will be even more I/O bound as documents that large will have huge positions indexes that will need to be read into memory for processing phrase queries. To reduce I/O you need as much of the index in memory (Lucene/Solr caches, and operating system disk cache). Every commit invalidates the Solr/Lucene caches (unless the newer nrt code has solved this for Solr). If you index and serve on the same server, you are also going to get terrible response time whenever your commits trigger a large merge. If you need to service 10-100 qps or more, you may need to look at putting your index on SSDs or spreading it over enough machines so it can stay in memory. What kind of response times are you looking for and what query rate? We have somewhat smaller documents. We have 10 million documents and about 6-8TB of data in HathiTrust and have spread the index over 12 shards on 4 machines (i.e. 3 shards per machine). We get an average of around 200-300ms response time but our 95th percentile times are about 800ms and 99th percentile are around 2 seconds. This is with an average load of less than 1 query/second. As Otis suggested, you may want to implement a strategy that allows users to search within the large documents by breaking the documents up into smaller units. What we do is have two Solr indexes. The first indexes complete documents. When the user clicks on a result, we index the entire document on a page level in a small Solr index on-the-fly. That way they can search within the document and get page level results. More details about our setup: http://www.hathitrust.org/blogs/large-scale-search Tom Burton-West University of Michigan Library www.hathitrust.org -----Original Message-----