I thought it was along those lines. Do you know if there is any way I can influence the fetch so that I get an even spread of urls without setting topN to a high number (i.e. 10,000,000) as this causes java to consume 100% CPU?
Thanks. Briggs wrote: > > Well, the quick/simple exlanation is: > > If you have 5 urls with their associate nutch score: > > http://a.com/something1 = 5.0 > http://b.com/something2 = 4.0 > http://c.com/something3 = 3.0 > http://d.com/something4 = 2.0 > http://e.com/something5 = 1.0 > > Then you set nutch to crawl with topN = 3 then a,b,c will be fetched > and d and e will not. It just means "give me the 3 best ranking URLs" > from the current crawl database. > > On 6/8/07, monkeynuts84 <[EMAIL PROTECTED]> wrote: >> >> Can someone give me an explanation of what topN does? I've seen various >> pieces of info but some of them seem to be conflicting. I've noticed in >> my >> crawls that certain sites are crawled more than other in each iteration >> of a >> fetch. Is this caused by topN? >> >> Thanks. >> -- >> View this message in context: >> http://www.nabble.com/Explanation-of-topN-tf3891964.html#a11033441 >> Sent from the Nutch - User mailing list archive at Nabble.com. >> >> > > > -- > "Conscious decisions by conscious minds are what make reality real" > > -- View this message in context: http://www.nabble.com/Explanation-of-topN-tf3891964.html#a11035013 Sent from the Nutch - User mailing list archive at Nabble.com. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
