Actually no.  Let's say you have 10 machines and hence 10 search 
servers.  You would run through 10 iterations of fetch-index-deploy, one 
to each machine.  Lets say you have 3 million pages per machine so this 
whole system could support a 30 million page index.

Once you deploy to 10 you would want to start over as you don't have any 
more space (machines, etc.).  So you would reset the crawldb (this is a 
special job that simply makes sure that all pages are available for 
fetching and are not filtered by next crawl date).  Then you would run 
the next generate with topN which would grab the next top 3 million urls 
to be fetched again.  This fetch-index-deploy cycle would then replace 
(not overwrite) the deployment on search server 1, then 2,3,... as you 
do more cycles.  This way the best urls would continually rise to the top.

One point is there is no concept of depth, only of top urls to fetch. 
With each cycle we update a single master crawldb so the top urls will 
continually change.  But we are not fetching levels as in the whole web 
crawl tutorial.  While going through the cycle we don't reset the 
crawldb and therefore any pages we have fetched during the run of 
machines wouldn't get fetched again until we reset the crawldb after all 
machines have been deployed and we start the whole cycle over again.

And yes you may have some duplicates in your indexes but this is taken 
care of in the search itself (there is a dedupField option in 
NutchBean).  Of the duplicates the one with the best score (most 
relevant) should be returned.

This whole process is continuous and would just keep running until you 
tell it to stop.  The search would never be fully down as only a single 
search server would be down at once and only for a few seconds while the 
database files are replaced.  And you would continually get the best 
urls in your index for the space you have.  I imagine that this is very 
similar to how the google dance works.

Dennis Kubes

charlie w wrote:
> On 8/1/07, Dennis Kubes <[EMAIL PROTECTED]> wrote:
>> I am currently writing a python script to automate this whole process
>> from inject to pushing out to search servers.  It should be done in a
>> day or two and I will post it on the wiki.
> 
> 
> I'm very much looking forward to this.  Reading the code always helps make
> it concrete to me.
> 
> You can do a dedeup of results on the search itself.  So yes there are
>> duplicates in the different index segments, but you will always be
>> returning the "best" pages to the user.
> 
> 
> I get it; so dedup based on the timestamp of each version of the document
> with a particular URL that was a hit.
> 
>>> It also seems that I must be missing something regarding new pages.  If,
>> as
>>> in step 9, you are replacing the index on a search server, wouldn't you
>>> possibly create the effect of removing documents from the index?  Say
>> you
>>> have the same 2 search servers, but do 10 iterations of fetching as a
>>> "depth" of crawl.  Wouldn't you be replacing the documents in search
>> server
>>> 1 several times over the course of those 10 iterations?
>> No because you are updating a single master crawldb and on the next
>> iteration it wouldn't grab the same pages, it would grab the next best n
>> pages.
> 
> 
> I had the impression you were overwriting the index on the search servers
> with the segment and index from the new iteration of fetching.  Meaning in
> my 2 search server example, iteration 3 of fetching would overwrite
> the index built by iteration 1 of fetching (they'd both wind up on search
> server 1).  But instead, you're actually merging the results of iteration 3
> into the search server's existing index from iteration 1, rather than
> replacing the entire index?
> 
> - C
> 

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to