I think it would be nice to have a few cluster
strategies on the wiki.

It seems there are at least three separate needs: CPU,
storage and bandwidth, and I think the more those
could be cleanly spread to different boxes, the

Guess I am imagining a breakdown that lists, by
priority, how things should be broken out.  So someone
could look at the list and say, ok, I have three good
boxes, I should make the best box do x, the second
best do y, etc.  There could also be case studies for
how different folks did their own implementations and
what their crawl/query times were like.

I have a small cluster (up to 15 boxes) and would like
to start to play around and see how things go under
different strategies.  I also have about a million
pages of local content, so I can hammer things pretty
hard without even leaving my network.  I know that may
not match normal conditions, but it could hopefully
remove a variable or two (network latency, slow
sites), to keep things simple at least to start.

I think it also a decent goal to be able to
crawl/index my pages in a night (say eight hours),
which would be around 35 pages/second.  If that isn't
a reasonable goal, I would like to hear why not.

For each strategy, we could have a set of confs
describing how to set things up.  I can picture a gui
which could list box roles (crawler, mapper, whatever)
and boxes available.  The users could drag and drop
their boxes to roles, and confs could then be
generated.  Think it could make for rather easy
design/implementation of clusters that could get
rather complicated.  I can do drag/drop and
interpolate into templates in javascript, so I could
envision a rather simple page.

Maybe we could even store the cluster setup in xml,
and have a script that takes the xml and draws the
cluster.  Then when people report slowness or the
like, they could also post their cluster setup.

I think when users come to nutch, they come with a set
of boxes.  I think it would be nice for them to see
what has worked for such a set of boxes in the past
and be able to easily implement such a strategy.  Kind
of the one hour from download to spidering vision.

Just a few thoughts.


Yahoo! Mail - PC Magazine Editors' Choice 2005 

Reply via email to