> http://www.xfeedme.com/nucular/gut.py/go?FREETEXT=w
> (w for "web") we get 6294 entries which takes about 500ms on
> a cold index and about 150ms on a warm index.  This is on a very
> active shared hosting machine.

That's reasonable speed, but is that just to do the set intersections
and return the size of the result set, or does it retrieve the actual
result set?  It only showed 20 results on a page.  I notice that each
book in the result list has an ID number.  Say those are stored fields
in Nucular: how long does it take to add up all the ID numbers for the
results of that query?  I.e. the requirement is to actually access
every single record in order to compute the sum.  This is similar to
what happens with faceting.

> You are right that you might want to
> use more in-process memory for a really smart, multi-faceted relevance
> ordering or whatever, but you have to be willing to pay for it
> in terms of system resources, config/development time, etcetera.
> If you want cheap and easy, nucular might be good enough, afaik.

I used a cave-man approach with solr, which is I have an external
process keeping the indexes warm by simply reading something from each
page a few times an hour.  That is enough to pull 10k or so results a
second from a query.  Without the warming, getting that many results
takes over a minute.  I do think much better approaches are possible
and solr/lucene is by no means the be-all and end-all.  I don't know
if solr is using mmap or actual seek system calls underneath.

> Regarding the 30 million number -- I bet google does
> estimations and culling of some kind (not really looking at all 10M).


> I'm not interested in really addressing the "google" size of data set
> at the moment.

Right, me neither, but a few 10's of GB of indexes is not all that
large these days.  

> >  http://www.newegg.com/Product/Product.aspx?Item=N82E16820147021
> holy rusty metal batman! way-cool!

Heh, check out the benchmark graphs:


