mike anderson [saidthero...@gmail.com] wrote:
> That's a great point. If SSDs are sufficient, then what does the "Index size
> vs Response time" curve look like? Since that would dictate the number
> of machines needed. I took a look at 
> http://wiki.apache.org/solr/SolrPerformanceData but only one use case
> seemed comparable.

I generally find it very hard to compare acrosse setups. Looking at 
SolrPerformanceData for example, we see that CNET Shopper has a very poor 
resposetime/size ratio, while HathiTrust is a lot better. This is not too 
surprising as CNET seems to use quite advanced searching where HathiTrust's is 
more simple, but it does illustrate that comparisons are not easy.

However, as long as I/O has been identified as the main bottleneck for a given 
setup, relative gains from different storage back ends should be fairly 
comparable across setups. We did some work on storage testing with Lucene two 
years ago (see the I-wish-I-had-the-time-to-update-this page at 
http://wiki.statsbiblioteket.dk/summa/Hardware), but unfortunately we did very 
little testing on scaling over index size.

...

I just digged out some old measurements that says a little bit: We tried 
changing the size of out index (by deleting every X document and optimizing) 
and performing 350K queries with extraction of 2 or 3 fairly small fields for 
the first 20 hits from each. The machine was capped at 4GB of RAM. I am fairly 
certain the searcher was single threaded and there were no web-services 
involved, so this is very raw Lucene speed:
4GB index: 626 queries/second
9GB index: 405 queries/second
17GB index: 205 queries/second
26GB index: 188 queries/second
Not a lot of measurement points and I wish I had data for larger index sizes, 
as it seems that the curve is flattening quite drastically at the end. Graph at
http://www.mathcracker.com/scatterplotimage.php?datax=4,9,17,26&datay=626,405,205,188&namex=Index%20size%20in%20GB&namey=queries/second&titl=SSD%20scaling%20performance%20with%20Lucene

> We currently have about 25M docs, split into 18 shards, with a
> total index size of about 120GB. If index size has truly little
> impact on performance then perhaps tagging articles with user
> IDs is a better way to approach my use case.

I don't know your budget, but do consider buying a single 160GB Intel X25-M or 
one of the new 256GB SandForce-based SSDs for testing. If it does not deliver 
what you hoped for, you'll be happy to put it in your workstation.

It would be nice if there were some sort of corpus generator that generated 
Zipfian-distributed data and sample queries so that we could do large scale 
testing on different hardware without having to share sample data.

Regards,
Toke Eskildsen

Reply via email to