Re: [Nutch-dev] NDFS, DistributedSearch - redundant deployment proposal

scott cotton Thu, 21 Oct 2004 14:08:12 -0700

On Thu, Oct 21, 2004 at 09:37:15PM +0200, Andrzej Bialecki wrote:
> scott cotton wrote:
> 
> >Some general thoughts:
> >
> >I agree that a distributed fs w/ large files doesn't necessarily help scale
> >search. 
> >
> >One thing to think about would be using consistent hashing on documentids
> >or domains.  This would allow distributing the indexes in a disjoint 
> >manner, possibly allow for replication of data if the underlying fs 
> >doesn't support
> >that, and would allow for a means to add/remove index nodes with minimal
> >overhead.  (Consistent hashing is just hashing in which the 
> >addition/removal
> >of hash buckets results only in copying of data between "adjacent 
> >buckets").
> >
> 
> Could you perhaps provide some pointers for more info on this? 
> (algorithm descriptions, reference implementations)?


I think (one of ?) the original papers on consistent hashing is
http://citeseer.ist.psu.edu/karger97consistent.html (section 4).

There's the UBI web crawler, which uses consistent hashing and
has a GPL consistent hashing class
available. http://ubi.imc.pi.cnr.it/projects/ubicrawler/.  

Basically, if you take a hash value v and map it to the unit interval,
(say with a function m) and pin your buckets to locations in the unit interval,
then you select a bucket for a hash value by finding the closest bucket to
m(v). When you remove a bucket, it's contents are sent to the closest of the
adjacent buckets and when adding a bucket, it receives some contents from
adjacent buckets in the same manner.

For inverted indices or webdb like data, the buckets could be servers
and the hashed values could be document ids or url hash values or somesuch.
Making the data redundant can be achieved by associating successor lists with
each bucket, but involves more data exchange between buckets for each time a
bucket is added or removed.  More details are available in the pointers above
and their references. 

[...]

> >PS On this note, at MindOwl we're working on an open source generic 
> >distributed
> >storage mechanism *for search data* (indexes, linkdb, document db, etc). 
> >This
> >is not yet ready but it's also not so far from a release, it's part of  
> >the project at mosl.mindowl.com, in case anyone is interested.
> 
> That would be interesting, if not for the license (GPL). I understand 
> the motivation, but as such this project is not usable with Nutch...

I would like to give it an Apache-style or recent BSD style license
but unfortunately can't for the forseeable future.  In the mean time, maybe
nutch can achieve a better distributed set-up by using consistent hashing
to distribute the webdb and/or indexes.  I'd be happy to help with this some, 
(coding,testing,..) but believe it probably is best considered by a core
developer. 

Best,

scott



> 
> -- 
> Best regards,
> Andrzej Bialecki
> 
> -------------------------------------------------
> Software Architect, System Integration Specialist
> CEN/ISSS EC Workshop, ECIMF project chair
> EU FP6 E-Commerce Expert/Evaluator
> -------------------------------------------------
> FreeBSD developer (http://www.freebsd.org)
> 
> 
> 
> -------------------------------------------------------
> This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
> Use IT products in your business? Tell us what you think of them. Give us
> Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
> http://productguide.itmanagersjournal.com/guidepromo.tmpl
> _______________________________________________
> Nutch-developers mailing list
> [EMAIL PROTECTED]
> https://lists.sourceforge.net/lists/listinfo/nutch-developers


-------------------------------------------------------
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] NDFS, DistributedSearch - redundant deployment proposal

Reply via email to