I've been toying with the idea of setting up an experiment to index a large document set 1+ TB -- any thoughts on an open data set that one could use for this purpose?
Thanks. On Mon, Jan 16, 2012 at 5:00 PM, Burton-West, Tom <tburt...@umich.edu>wrote: > Hello , > > Searching real-time sounds difficult with that amount of data. With large > documents, 3 million documents, and 5TB of data the index will be very > large. With indexes that large your performance will probably be I/O bound. > > Do you plan on allowing phrase or proximity searches? If so, your > performance will be even more I/O bound as documents that large will have > huge positions indexes that will need to be read into memory for processing > phrase queries. To reduce I/O you need as much of the index in memory > (Lucene/Solr caches, and operating system disk cache). Every commit > invalidates the Solr/Lucene caches (unless the newer nrt code has solved > this for Solr). > > If you index and serve on the same server, you are also going to get > terrible response time whenever your commits trigger a large merge. > > If you need to service 10-100 qps or more, you may need to look at putting > your index on SSDs or spreading it over enough machines so it can stay in > memory. > > What kind of response times are you looking for and what query rate? > > We have somewhat smaller documents. We have 10 million documents and about > 6-8TB of data in HathiTrust and have spread the index over 12 shards on 4 > machines (i.e. 3 shards per machine). We get an average of around > 200-300ms response time but our 95th percentile times are about 800ms and > 99th percentile are around 2 seconds. This is with an average load of less > than 1 query/second. > > As Otis suggested, you may want to implement a strategy that allows users > to search within the large documents by breaking the documents up into > smaller units. What we do is have two Solr indexes. The first indexes > complete documents. When the user clicks on a result, we index the entire > document on a page level in a small Solr index on-the-fly. That way they > can search within the document and get page level results. > > More details about our setup: > http://www.hathitrust.org/blogs/large-scale-search > > Tom Burton-West > University of Michigan Library > www.hathitrust.org > -----Original Message----- > >