mike anderson [saidthero...@gmail.com] wrote: > That's a great point. If SSDs are sufficient, then what does the "Index size > vs Response time" curve look like? Since that would dictate the number > of machines needed. I took a look at > http://wiki.apache.org/solr/SolrPerformanceData but only one use case > seemed comparable.
I generally find it very hard to compare acrosse setups. Looking at SolrPerformanceData for example, we see that CNET Shopper has a very poor resposetime/size ratio, while HathiTrust is a lot better. This is not too surprising as CNET seems to use quite advanced searching where HathiTrust's is more simple, but it does illustrate that comparisons are not easy. However, as long as I/O has been identified as the main bottleneck for a given setup, relative gains from different storage back ends should be fairly comparable across setups. We did some work on storage testing with Lucene two years ago (see the I-wish-I-had-the-time-to-update-this page at http://wiki.statsbiblioteket.dk/summa/Hardware), but unfortunately we did very little testing on scaling over index size. ... I just digged out some old measurements that says a little bit: We tried changing the size of out index (by deleting every X document and optimizing) and performing 350K queries with extraction of 2 or 3 fairly small fields for the first 20 hits from each. The machine was capped at 4GB of RAM. I am fairly certain the searcher was single threaded and there were no web-services involved, so this is very raw Lucene speed: 4GB index: 626 queries/second 9GB index: 405 queries/second 17GB index: 205 queries/second 26GB index: 188 queries/second Not a lot of measurement points and I wish I had data for larger index sizes, as it seems that the curve is flattening quite drastically at the end. Graph at http://www.mathcracker.com/scatterplotimage.php?datax=4,9,17,26&datay=626,405,205,188&namex=Index%20size%20in%20GB&namey=queries/second&titl=SSD%20scaling%20performance%20with%20Lucene > We currently have about 25M docs, split into 18 shards, with a > total index size of about 120GB. If index size has truly little > impact on performance then perhaps tagging articles with user > IDs is a better way to approach my use case. I don't know your budget, but do consider buying a single 160GB Intel X25-M or one of the new 256GB SandForce-based SSDs for testing. If it does not deliver what you hoped for, you'll be happy to put it in your workstation. It would be nice if there were some sort of corpus generator that generated Zipfian-distributed data and sample queries so that we could do large scale testing on different hardware without having to share sample data. Regards, Toke Eskildsen