Nutch behaves ... So by default it will not fetch more 1 url every 5s (setting changeable) to a given host (by name or ip depending on the nutch conf file). So actually you will find the opposite it is very slow for a single site... Speed comes when you fetch several sites in parallel.
2009/12/4, Jesse Hires <[email protected]>: > use the -topN flag to only grab a small number of URLs. > Also I believe there is also a setting you can put in nutch-site.xml that > can be used to slow down how many URLs you grab over time. > > Jesse > > int GetRandomNumber() > { > return 4; // Chosen by fair roll of dice > // Guaranteed to be random > } // xkcd.com > > > > On Fri, Dec 4, 2009 at 4:10 AM, Mr Hadoop <[email protected]> wrote: > >> I am just staring to learn nutch. One question I wanted to know is that >> can >> nutch pause, stop and start indexing a site on a incremental daily basis? >> My concern with nutch is that nutch behaving like a hog and crawling >> everything with huge bandwidth consumption and pissing off the many site >> owners. >> >> Can some experts shed some light in this? >> > -- -MilleBii-
