On Thu, Oct 21, 2004 at 09:37:15PM +0200, Andrzej Bialecki wrote: > scott cotton wrote: > > >Some general thoughts: > > > >I agree that a distributed fs w/ large files doesn't necessarily help scale > >search. > > > >One thing to think about would be using consistent hashing on documentids > >or domains. This would allow distributing the indexes in a disjoint > >manner, possibly allow for replication of data if the underlying fs > >doesn't support > >that, and would allow for a means to add/remove index nodes with minimal > >overhead. (Consistent hashing is just hashing in which the > >addition/removal > >of hash buckets results only in copying of data between "adjacent > >buckets"). > > > > Could you perhaps provide some pointers for more info on this? > (algorithm descriptions, reference implementations)?
I think (one of ?) the original papers on consistent hashing is http://citeseer.ist.psu.edu/karger97consistent.html (section 4). There's the UBI web crawler, which uses consistent hashing and has a GPL consistent hashing class available. http://ubi.imc.pi.cnr.it/projects/ubicrawler/. Basically, if you take a hash value v and map it to the unit interval, (say with a function m) and pin your buckets to locations in the unit interval, then you select a bucket for a hash value by finding the closest bucket to m(v). When you remove a bucket, it's contents are sent to the closest of the adjacent buckets and when adding a bucket, it receives some contents from adjacent buckets in the same manner. For inverted indices or webdb like data, the buckets could be servers and the hashed values could be document ids or url hash values or somesuch. Making the data redundant can be achieved by associating successor lists with each bucket, but involves more data exchange between buckets for each time a bucket is added or removed. More details are available in the pointers above and their references. [...] > >PS On this note, at MindOwl we're working on an open source generic > >distributed > >storage mechanism *for search data* (indexes, linkdb, document db, etc). > >This > >is not yet ready but it's also not so far from a release, it's part of > >the project at mosl.mindowl.com, in case anyone is interested. > > That would be interesting, if not for the license (GPL). I understand > the motivation, but as such this project is not usable with Nutch... I would like to give it an Apache-style or recent BSD style license but unfortunately can't for the forseeable future. In the mean time, maybe nutch can achieve a better distributed set-up by using consistent hashing to distribute the webdb and/or indexes. I'd be happy to help with this some, (coding,testing,..) but believe it probably is best considered by a core developer. Best, scott > > -- > Best regards, > Andrzej Bialecki > > ------------------------------------------------- > Software Architect, System Integration Specialist > CEN/ISSS EC Workshop, ECIMF project chair > EU FP6 E-Commerce Expert/Evaluator > ------------------------------------------------- > FreeBSD developer (http://www.freebsd.org) > > > > ------------------------------------------------------- > This SF.net email is sponsored by: IT Product Guide on ITManagersJournal > Use IT products in your business? Tell us what you think of them. Give us > Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more > http://productguide.itmanagersjournal.com/guidepromo.tmpl > _______________________________________________ > Nutch-developers mailing list > [EMAIL PROTECTED] > https://lists.sourceforge.net/lists/listinfo/nutch-developers ------------------------------------------------------- This SF.net email is sponsored by: IT Product Guide on ITManagersJournal Use IT products in your business? Tell us what you think of them. Give us Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more http://productguide.itmanagersjournal.com/guidepromo.tmpl _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
