I think it would be nice to have a few cluster strategies on the wiki. It seems there are at least three separate needs: CPU, storage and bandwidth, and I think the more those could be cleanly spread to different boxes, the better.
Guess I am imagining a breakdown that lists, by priority, how things should be broken out. So someone could look at the list and say, ok, I have three good boxes, I should make the best box do x, the second best do y, etc. There could also be case studies for how different folks did their own implementations and what their crawl/query times were like. I have a small cluster (up to 15 boxes) and would like to start to play around and see how things go under different strategies. I also have about a million pages of local content, so I can hammer things pretty hard without even leaving my network. I know that may not match normal conditions, but it could hopefully remove a variable or two (network latency, slow sites), to keep things simple at least to start. I think it also a decent goal to be able to crawl/index my pages in a night (say eight hours), which would be around 35 pages/second. If that isn't a reasonable goal, I would like to hear why not. For each strategy, we could have a set of confs describing how to set things up. I can picture a gui which could list box roles (crawler, mapper, whatever) and boxes available. The users could drag and drop their boxes to roles, and confs could then be generated. Think it could make for rather easy design/implementation of clusters that could get rather complicated. I can do drag/drop and interpolate into templates in javascript, so I could envision a rather simple page. Maybe we could even store the cluster setup in xml, and have a script that takes the xml and draws the cluster. Then when people report slowness or the like, they could also post their cluster setup. I think when users come to nutch, they come with a set of boxes. I think it would be nice for them to see what has worked for such a set of boxes in the past and be able to easily implement such a strategy. Kind of the one hour from download to spidering vision. Just a few thoughts. Earl __________________________________ Yahoo! Mail - PC Magazine Editors' Choice 2005 http://mail.yahoo.com