Re: [Nutch-dev] NDFS, DistributedSearch - redundant deployment proposal

scott cotton Thu, 21 Oct 2004 10:34:31 -0700

Some general thoughts:

I agree that a distributed fs w/ large files doesn't necessarily help scale
search.


One thing to think about would be using consistent hashing on documentids
or domains.  This would allow distributing the indexes in a disjoint manner, 
possibly allow for replication of data if the underlying fs doesn't support
that, and would allow for a means to add/remove index nodes with minimal
overhead.  (Consistent hashing is just hashing in which the addition/removal
of hash buckets results only in copying of data between "adjacent buckets").

Disjointness of distributed indices is not always good if its based on a hash
function accross all documents: sometimes it might be useful to keep all the
most volatile documents in one small index, and more static documents in a
bigger index, for example, but consistent hashing seems like a good start to
handle efficient distribution of indices.

Best Regards,

scott

PS On this note, at MindOwl we're working on an open source generic distributed
storage mechanism *for search data* (indexes, linkdb, document db, etc). This
is not yet ready but it's also not so far from a release, it's part of  
the project at mosl.mindowl.com, in case anyone is interested.


On Thu, Oct 21, 2004 at 05:30:43PM +0200, Andrzej Bialecki wrote:
> Hi folks,
> 
> First of all, a small clarification about the purpose of NDFS. I had 
> some off-the-list conversation with Michael C., after which I came to 
> the following conclusions:
> 
> * currently NDFS is just a simple version of distributed filesystem (as
> the javadoc comments say :)). As such, it offers no special support for
> Nutch or any other search-specific task. It's just a cheap pure Java
> version of Coda or any other distributed FS.
> 
> * the primary goal for NDFS is to provide an easy way for people to
> handle big amounts of data in WebDB and segments, but only when doing
> operations like DB update, analysis, fetchlist generation, and fetching.
> 
> * NDFS is _NOT_ suitable for distributing the search indexes, and
> running search servers on indexes that are put on NDFS will kill the
> performance. Searching requires a fast local access to the index files.
> 
> So, currently NDFS helps you only as a distributed storage for segments
> data, but it does not address the needs of efficient and redundant 
> deployment of search indexes over a group of search servers (each of 
> them running DistributedSearch$Server, DS$Server for short).
> 
> In other words, you may want to have two separate groups of boxes: one 
> group working in NDFS for DB + fetching operations (storage boxes), and 
> the other group running DS$Servers (search boxes).
> 
> For a high-performance operation one would always want a setup with 
> multiple DS$Servers. Currently there is no straightforward way to ensure 
> redundancy in a DS$Server group - if you lose one of the boxes, a part 
> of your index goes offline until you bring another box and put the 
> missing segment + index on it. Also, it takes too much effort and manual 
> labor to deploy the segments/indexes to the search nodes.
> 
> I think we need an efficient and automatic way for this, and also to 
> ensure redundancy in a set of search boxes - but in such a way that they 
> are complete and usable as local FS-es for DS$Servers (as opposed to the 
> way NDFS currently works, because it works on fixed-size blocks of bytes).
> 
> My idea is to operate on units that make sense to DS$Server - which 
> means these units must be a part of segment data. Assuming we have the 
> tools to cut&paste fragments of segment data as we wish (which we have, 
> they just need to be wrapped in command-line tools), I propose the 
> following scenario then:
> 
> 1. we handle the DB updates/fetching/parsing as we originally would,
> perhaps using a block-based NDFS for storage, or a SAN, or somesuch.
> 
> 2. I guess we should dedup the data before distributing it, otherwise
> it would be more difficult - but it would be nice to be able to do it in
> step 4...
> 
> 3. then we deploy the segments data to search nodes in the following steps:
>       - slice the segments data into units of fixed size
>        (determined from config, or obtained via IPC from individual
>        search nodes). The slicing could be done just linearly, or in
>        a smarter way (e.g. by creating slices sorted by md5 hash)
>       - send each slice to 1 or more DS$Servers (applying some
>        redundancy algo.)
>       - after the segment is deployed, send a command to a
>        selected search node to "mount" the slice.
>       - make a note what are the segments locations (in a similar
>        fashion to the NDFS$NameNode), and which one is active at
>        this moment.
> 
> 4. now I guess there is some work to do on the search server. The newly
> received slice needs to be indexed and de-duplicated with the already
> existing older slices on the server. It would be nice to have some
> method to do this across the whole cluster of search servers before the
> slices are sent to search servers, but if not, the global de-duplication
> must take place in step 2.
> 
> 5. selected search server "mounts" the newly received and indexed slice, 
> and makes it available for searching. Optionally, the new slice can be 
> merged into a single segment with other already existing slices.
> 
> Much of the logic from NDFS can be reused for selecting the "active" 
> slice, checking the heartbeat and so on.
> 
> Now, if one of the search servers goes down (as detected by heartbeat 
> messages), the "name node" sends messages to other search nodes that 
> contain segment replicas of the ones on the failed box. The missing 
> segment is back online now. The "name node" notes that there are too few 
> replicas of this segment, and initiates a transfer of this segment to 
> one of the other search boxes (again, the same logic already exists in 
> NDFS).
> 
> Additional steps are also needed to "populate" newly added blank boxes 
> (e.g. when you replace a failed box, or when you want to increase the 
> total number of search nodes), and this logic also is already present in 
> NDFS.
> 
> Any comments or suggestions are highly appreciated...
> 
> -- 
> Best regards,
> Andrzej Bialecki
> 
> -------------------------------------------------
> Software Architect, System Integration Specialist
> CEN/ISSS EC Workshop, ECIMF project chair
> EU FP6 E-Commerce Expert/Evaluator
> -------------------------------------------------
> FreeBSD developer (http://www.freebsd.org)
> 
> 
> 
> 
> 
> -------------------------------------------------------
> This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
> Use IT products in your business? Tell us what you think of them. Give us
> Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
> http://productguide.itmanagersjournal.com/guidepromo.tmpl
> _______________________________________________
> Nutch-developers mailing list
> [EMAIL PROTECTED]
> https://lists.sourceforge.net/lists/listinfo/nutch-developers


-------------------------------------------------------
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] NDFS, DistributedSearch - redundant deployment proposal

Reply via email to