Re: Crawl the Internet - Limit the fetchlist of unfetched urls

Dennis Kubes Sat, 10 Jan 2009 07:46:02 -0800


Markus Thomas wrote:

Hello everyone,
first of all, I am new to nutch. I installed nutch on my internet serverand tried to start crowling the internet.I unterstood that there are two opportunities to generate a fetchlist.First, using the parameter -topN to generate a limited list of the toprated domains. Second, without the topN parameter it generated afetchlist of all unfetched urls. That's what i want to do now, but idon't want to fetch ALL uncrawled domains at a time.So is there an opportunity to crawl unfetched urls, but limit that to1000 urls, ar what else?

Yes, first you would need to inject a list of urls. Search the nutchlist or take a look at the wiki for injecting the DMOZ database. Thatwill give you a starting point.

All urls start off with the same score. Using topN once a list isinjected will limit to only X number of urls. So you would go through aprocess of inject once, (generate, fetch, update crawldb), loop on thegenerate-update cycle for x number of shards, then either merge segmentsand index, or index and merge indexes, or deploy out shards toindividual servers. Rinse, lather repeat, start the whole process overagain from generate.


Dennis



Thank you and best regars,
Markus Thomas

Re: Crawl the Internet - Limit the fetchlist of unfetched urls

Reply via email to