Markus Thomas wrote:
Hello everyone,

first of all, I am new to nutch. I installed nutch on my internet server and tried to start crowling the internet. I unterstood that there are two opportunities to generate a fetchlist. First, using the parameter -topN to generate a limited list of the top rated domains. Second, without the topN parameter it generated a fetchlist of all unfetched urls. That's what i want to do now, but i don't want to fetch ALL uncrawled domains at a time. So is there an opportunity to crawl unfetched urls, but limit that to 1000 urls, ar what else?

Yes, first you would need to inject a list of urls. Search the nutch list or take a look at the wiki for injecting the DMOZ database. That will give you a starting point.

All urls start off with the same score. Using topN once a list is injected will limit to only X number of urls. So you would go through a process of inject once, (generate, fetch, update crawldb), loop on the generate-update cycle for x number of shards, then either merge segments and index, or index and merge indexes, or deploy out shards to individual servers. Rinse, lather repeat, start the whole process over again from generate.

Dennis



Thank you and best regars,
Markus Thomas

Reply via email to