I thought it was along those lines. Do you know if there is any way I can
influence the fetch so that I get an even spread of urls without setting
topN to a high number (i.e. 10,000,000) as this causes java to consume 100%
CPU?

Thanks.


Briggs wrote:
> 
> Well, the quick/simple exlanation is:
> 
> If you have 5 urls with their associate nutch score:
> 
> http://a.com/something1 = 5.0
> http://b.com/something2 = 4.0
> http://c.com/something3 = 3.0
> http://d.com/something4 = 2.0
> http://e.com/something5 = 1.0
> 
> Then you set nutch to crawl with topN = 3 then a,b,c will be fetched
> and d and e will not.  It just means "give me the 3 best ranking URLs"
> from the current crawl database.
> 
> On 6/8/07, monkeynuts84 <[EMAIL PROTECTED]> wrote:
>>
>> Can someone give me an explanation of what topN does? I've seen various
>> pieces of info but some of them seem to be conflicting. I've noticed in
>> my
>> crawls that certain sites are crawled more than other in each iteration
>> of a
>> fetch. Is this caused by topN?
>>
>> Thanks.
>> --
>> View this message in context:
>> http://www.nabble.com/Explanation-of-topN-tf3891964.html#a11033441
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
> 
> 
> -- 
> "Conscious decisions by conscious minds are what make reality real"
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Explanation-of-topN-tf3891964.html#a11035013
Sent from the Nutch - User mailing list archive at Nabble.com.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to