Hi,

 I tried both -numFetchers 1 and -numFetchers 4 and both times I had 2 
sequential fetches that lasted each 13 minutes.

 Thanks.


----- Message d'origine -----
De : Julien Nioche
Envoyés : 28.09.11 17:16
À : [email protected]
Objet : Re: Fetch performance

 Hi, Check the value of the parameter '-numFetchers' when calling generate. l 
guess you are using a value of 2 in non-distributed mode i.e they are done in 
sequential order. I'd strongly advise to move to a more recent version of Nutch 
if you can. There has been a considerable number of improvements added since 
1.0 Julien On 28 September 2011 15:50, Danicela nutch <[email protected]> 
wrote: > Hi, > > My config is : > > Nutch 1.0. > generate.max.per.host = 130 > 
fetcher.server.delay = 5 > fetcher.threads.fetch = 50 > number of hosts in 
seeds = 30 > > If the fetch was effective, we would get 130 * 6 (5+1 
imprecision) seconds > = 13 min for a fetch. > > According to the results, a 
fetch lasts 26 minutes. > > When I analyse hadoop.log, I noticed that some 
sites are fetched during > the 13 first minutes, and the other sites, which 
weren't fetched until the > 13rd minute, begin to be fetched after the 13rd 
minute. These sites are > fetched until the 26th minute. > > I can con
 clude that the fetch lasts twice as much time than it should, > because a part 
of the sites are fetched only after others. (some STATS are > produced between 
the 2 steps) > > How can we prevent this split ? I mean, how to force all sites 
to be > fetched from the beginning ? > > Thanks in advance for helping. > -- * 
*Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ 
http://www.digitalpebble.com

Reply via email to