I've been toying with the idea of setting up an experiment to index a large
document set 1+ TB -- any thoughts on an open data set that one could use
for this purpose?

Thanks.

On Mon, Jan 16, 2012 at 5:00 PM, Burton-West, Tom <tburt...@umich.edu>wrote:

> Hello ,
>
> Searching real-time sounds difficult with that amount of data. With large
> documents, 3 million documents, and 5TB of data the index will be very
> large. With indexes that large your performance will probably be I/O bound.
>
> Do you plan on allowing phrase or proximity searches? If so, your
> performance will be even more I/O bound as documents that large will have
> huge positions indexes that will need to be read into memory for processing
> phrase queries. To reduce I/O you need as much of the index in memory
> (Lucene/Solr caches, and operating system disk cache).  Every commit
> invalidates the Solr/Lucene caches (unless the newer nrt code has solved
> this for Solr).
>
> If you index and serve on the same server, you are also going to get
> terrible response time whenever your commits trigger a large merge.
>
> If you need to service 10-100 qps or more, you may need to look at putting
> your index on SSDs or spreading it over enough machines so it can stay in
> memory.
>
> What kind of response times are you looking for and what query rate?
>
> We have somewhat smaller documents. We have 10 million documents and about
> 6-8TB of data in HathiTrust and have spread the index over 12 shards on 4
> machines (i.e. 3 shards per machine).   We get an average of around
> 200-300ms response time but our 95th percentile times are about 800ms and
> 99th percentile are around 2 seconds.  This is with an average load of less
> than 1 query/second.
>
> As Otis suggested, you may want to implement a strategy that allows users
> to search within the large documents by breaking the documents up into
> smaller units. What we do is have two Solr indexes.  The first indexes
> complete documents.  When the user clicks on a result, we index the entire
> document on a page level in a small Solr index on-the-fly.  That way they
> can search within the document and get page level results.
>
> More details about our setup:
> http://www.hathitrust.org/blogs/large-scale-search
>
> Tom Burton-West
> University of Michigan Library
> www.hathitrust.org
> -----Original Message-----
>
>

Reply via email to