scott cotton wrote:
Some general thoughts:
I agree that a distributed fs w/ large files doesn't necessarily help scale
search.
One thing to think about would be using consistent hashing on documentids
or domains. This would allow distributing the indexes in a disjoint manner, possibly allow for replication of data if the underlying fs doesn't support
that, and would allow for a means to add/remove index nodes with minimal
overhead. (Consistent hashing is just hashing in which the addition/removal
of hash buckets results only in copying of data between "adjacent buckets").
Could you perhaps provide some pointers for more info on this? (algorithm descriptions, reference implementations)?
Disjointness of distributed indices is not always good if its based on a hash function accross all documents: sometimes it might be useful to keep all the most volatile documents in one small index, and more static documents in a bigger index, for example, but consistent hashing seems like a good start to handle efficient distribution of indices.
Agreed - actually, in the installation I'm setting up we were planning to use separate group of machines to handle rapidly changing indexes, but so far we're just using a bunch of scripts...
PS On this note, at MindOwl we're working on an open source generic distributed
storage mechanism *for search data* (indexes, linkdb, document db, etc). This
is not yet ready but it's also not so far from a release, it's part of the project at mosl.mindowl.com, in case anyone is interested.
That would be interesting, if not for the license (GPL). I understand the motivation, but as such this project is not usable with Nutch...
-- Best regards, Andrzej Bialecki
------------------------------------------------- Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator ------------------------------------------------- FreeBSD developer (http://www.freebsd.org)
------------------------------------------------------- This SF.net email is sponsored by: IT Product Guide on ITManagersJournal Use IT products in your business? Tell us what you think of them. Give us Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more http://productguide.itmanagersjournal.com/guidepromo.tmpl _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
