Hello ,

Searching real-time sounds difficult with that amount of data. With large 
documents, 3 million documents, and 5TB of data the index will be very large. 
With indexes that large your performance will probably be I/O bound.  

Do you plan on allowing phrase or proximity searches? If so, your performance 
will be even more I/O bound as documents that large will have huge positions 
indexes that will need to be read into memory for processing phrase queries. To 
reduce I/O you need as much of the index in memory (Lucene/Solr caches, and 
operating system disk cache).  Every commit invalidates the Solr/Lucene caches 
(unless the newer nrt code has solved this for Solr).  

If you index and serve on the same server, you are also going to get terrible 
response time whenever your commits trigger a large merge.

If you need to service 10-100 qps or more, you may need to look at putting your 
index on SSDs or spreading it over enough machines so it can stay in memory.

What kind of response times are you looking for and what query rate?

We have somewhat smaller documents. We have 10 million documents and about 
6-8TB of data in HathiTrust and have spread the index over 12 shards on 4 
machines (i.e. 3 shards per machine).   We get an average of around 200-300ms 
response time but our 95th percentile times are about 800ms and 99th percentile 
are around 2 seconds.  This is with an average load of less than 1 query/second.

As Otis suggested, you may want to implement a strategy that allows users to 
search within the large documents by breaking the documents up into smaller 
units. What we do is have two Solr indexes.  The first indexes complete 
documents.  When the user clicks on a result, we index the entire document on a 
page level in a small Solr index on-the-fly.  That way they can search within 
the document and get page level results.
 
More details about our setup: http://www.hathitrust.org/blogs/large-scale-search

Tom Burton-West
University of Michigan Library
www.hathitrust.org
-----Original Message-----

Reply via email to