Re: [Nutch-dev] NDFS, DistributedSearch - redundant deployment proposal

Luke Baker Sat, 23 Oct 2004 19:24:02 -0700

On 10/21/2004 11:30 AM, Andrzej Bialecki wrote:

Hi folks,


Hey,

Nice email.  Some good stuff the think about.

First of all, a small clarification about the purpose of NDFS. I had some off-the-list conversation with Michael C., after which I came to the following conclusions:

* currently NDFS is just a simple version of distributed filesystem (as
the javadoc comments say :)). As such, it offers no special support for
Nutch or any other search-specific task. It's just a cheap pure Java
version of Coda or any other distributed FS.

* the primary goal for NDFS is to provide an easy way for people to
handle big amounts of data in WebDB and segments, but only when doing
operations like DB update, analysis, fetchlist generation, and fetching.

I think you're right here for the most part. Wouldn't using NDFS for fetching, be somewhat limiting in certain circumstances where the transfer/writing of the data is helping saturate the network connection that it slows down the fetching? I would think that this "limitation" is not really something to worry about now, except keeping it in the back or our minds as we plan how to distribute the search.

[snip]

I think we need an efficient and automatic way for this, and also to ensure redundancy in a set of search boxes - but in such a way that they are complete and usable as local FS-es for DS$Servers (as opposed to the way NDFS currently works, because it works on fixed-size blocks of bytes).

My idea is to operate on units that make sense to DS$Server - which means these units must be a part of segment data. Assuming we have the tools to cut&paste fragments of segment data as we wish (which we have, they just need to be wrapped in command-line tools), I propose the following scenario then:

Are you saying that we have tools to cut&paste fragments of indexed segments as we please? Meaning we can basically create index segments of any size we want and rearrange them as we want, within reason? Or are you talking about segments of fetched content (non-indexed)? Perhaps we need terms for discussing the various kinds of pieces to this whole puzzle :-) (Google calls "slices" of their index which are spread accross their search servers with multiple slices per server: shards)


1. we handle the DB updates/fetching/parsing as we originally would,
perhaps using a block-based NDFS for storage, or a SAN, or somesuch.

2. I guess we should dedup the data before distributing it, otherwise
it would be more difficult - but it would be nice to be able to do it in
step 4...

If we dedup here, how can we make sure we delete the duplicates of stuff in this "new" content and data already being searched on the search servers?


3. then we deploy the segments data to search nodes in the following steps:
    - slice the segments data into units of fixed size
     (determined from config, or obtained via IPC from individual
     search nodes). The slicing could be done just linearly, or in
     a smarter way (e.g. by creating slices sorted by md5 hash)
    - send each slice to 1 or more DS$Servers (applying some
     redundancy algo.)
    - after the segment is deployed, send a command to a
     selected search node to "mount" the slice.
    - make a note what are the segments locations (in a similar
     fashion to the NDFS$NameNode), and which one is active at
     this moment.

4. now I guess there is some work to do on the search server. The newly
received slice needs to be indexed and de-duplicated with the already
existing older slices on the server. It would be nice to have some
method to do this across the whole cluster of search servers before the
slices are sent to search servers, but if not, the global de-duplication
must take place in step 2.

How would this work? (1) Have all the search servers send the relevant info (URL, content md5, etc.) to a single location (NDFS?). (2) Find duplicates and which to delete on one server/process. (3) Then tell the search servers what to delete from their indexes. I'm just throwing this out, and I'm not sure about the performance/scalability of it or even if there is a worthwhile advantage to doing it this way.

5. selected search server "mounts" the newly received and indexed slice, and makes it available for searching. Optionally, the new slice can be merged into a single segment with other already existing slices.

Much of the logic from NDFS can be reused for selecting the "active" slice, checking the heartbeat and so on.

It sounds like you're implying that there would only be 1 copy of a slice active at a given time. Is that correct? This makes sense in a situation where say we have 3 search servers and each server can only reasonably handle 1/3 of the total indexed data. However to increase throughput of searches someone might want to have 9 servers that each can handle 1/3 of the total indexed data, so it'd be good to have the search master do load balancing of the searches. Adding this ability though means we can't just broadcast the searches to every search server. We need to make sure we don't search multiple, active "shards" which are on different search servers. Perhaps the solution would be to have the master server keep track of every "shard" and which servers each is on and if it is active on each server. Then when searching, the master sends (or tells clients to send) the searches to a subset of search servers but also tells each search server which "shard(s)" that the server should return results for. Maybe this is already what you were planning, if so ignore this. :-)

Luke Baker

Now, if one of the search servers goes down (as detected by heartbeat messages), the "name node" sends messages to other search nodes that contain segment replicas of the ones on the failed box. The missing segment is back online now. The "name node" notes that there are too few replicas of this segment, and initiates a transfer of this segment to one of the other search boxes (again, the same logic already exists in NDFS).

Additional steps are also needed to "populate" newly added blank boxes (e.g. when you replace a failed box, or when you want to increase the total number of search nodes), and this logic also is already present in NDFS.
Any comments or suggestions are highly appreciated...


-------------------------------------------------------
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] NDFS, DistributedSearch - redundant deployment proposal

Reply via email to