Hi,
I wanted to try out Nutch and understand how to setup the whole
Internet crawling. It was very easy to follow the tutorial for
Whole-web Crawling but I got some questions:
1. I have read that by default Nutch will recrawl urls every
30 days. I have said "Nutch" but I really don't know who is triggering
the recrawl? Fetcher thread is stopping as soon as all fetcher threads
are done. Tutorial advises to perform different steps in order to do
the "Whole-web Crawling": generate, inject, fecth, index.
What command (component ) will create thread which will
remain alive and trigger the recrawl?
2. How newly discovered URLs are being crawled?
3. How can I run Nutch crawler on multiple machines?
Will appreciate your help!!
Thanks,
Daniel
-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games. How far can you shotput
a projector? How fast can you ride your desk chair down the office luge track?
If you want to score the big prize, get to know the little guy.
Play to win an NEC 61" plasma display: http://www.necitguy.com/?r
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general