Markus Thomas wrote:
Hello everyone,
first of all, I am new to nutch. I installed nutch on my internet server
and tried to start crowling the internet.
I unterstood that there are two opportunities to generate a fetchlist.
First, using the parameter -topN to generate a limited list of the top
rated domains. Second, without the topN parameter it generated a
fetchlist of all unfetched urls. That's what i want to do now, but i
don't want to fetch ALL uncrawled domains at a time.
So is there an opportunity to crawl unfetched urls, but limit that to
1000 urls, ar what else?
Yes, first you would need to inject a list of urls. Search the nutch
list or take a look at the wiki for injecting the DMOZ database. That
will give you a starting point.
All urls start off with the same score. Using topN once a list is
injected will limit to only X number of urls. So you would go through a
process of inject once, (generate, fetch, update crawldb), loop on the
generate-update cycle for x number of shards, then either merge segments
and index, or index and merge indexes, or deploy out shards to
individual servers. Rinse, lather repeat, start the whole process over
again from generate.
Dennis
Thank you and best regars,
Markus Thomas