Luke Baker wrote:
(1) Have all the search servers send the relevant info (URL, content md5, etc.) to a single location (NDFS?).
(2) Find duplicates and which to delete on one server/process.
(3) Then tell the search servers what to delete from their indexes.
I'm just throwing this out, and I'm not sure about the performance/scalability of it or even if there is a worthwhile advantage to doing it this way.

Here's the way I've imagined doing it.

1. Assume that segments are smaller than what a single search node will search, e.g., 1M documents/segment, 10M documents/search node.

2. Segments are stored in an NDFS.

3. Segment indexes are merged for every N segments. These merged indexes are stored in NDFS. Individial segment indexes may be discarded. (Also assume we have a read-only Lucene Directory implementation that can read indexes stored in NDFS.)

4. Indexes are copied to search servers, one index per search server.

5. As segments age, and new segments are added, search servers are updated incrementally.

For example, with ten segments per index, each time ten new segments are added one would:
a. Merge the 10 new segment's indexes into a single index.
b. Removed the oldest 10 segments and their indexes.
c. Perform duplicate detection over all indexes.
d. Update the search node that was searching the 10 oldest segments with the 10 newest segments and their index, copying from NDFS to the local filesystem on the search node.
e. Update all other search node indexes with a new .del file.


The dedup command is fast, so a 100M page index might be updated hourly, if so desired.

If a search server has a hardware failure then it can be reconstituted from the copy of its data in NDFS.

The parts of this strategy which are not yet implemented are:
1. NDFS-based read-only Lucene Directory (easy).
2. Version of DeleteDuplicate tool which works over merged indexes (easy).


Does this make sense?

Doug




------------------------------------------------------- This SF.net email is sponsored by: IT Product Guide on ITManagersJournal Use IT products in your business? Tell us what you think of them. Give us Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more http://productguide.itmanagersjournal.com/guidepromo.tmpl _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to