Hey Andrei, Thanks a lot for the reply. That clears up a major doubt in my mind. Fyi, I experimented using a single machine to crawl using Hadoop DFS, MapReduce. The largest experiment was to crawl around 300K pages from a few thousand hosts. I could push the crawler to a speed of around 27 pages/sec when using 2000 threads. When I increased the number of threads to more than 3000, the jobs started failing.
I am now going to conduct a larger experiment on 3-4 machines. Will report the performance once I am done. In this case, since I know the optimal # of threads on 1 machine is 2000, should I scale the #threads linearly to say 6000 for 3 machines, or just increasing the number of map/red tasks linearly will take care of the scaling? Thanks, -vishal. -----Original Message----- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: Friday, August 25, 2006 5:46 PM To: [email protected] Subject: Re: -numFetchers in generate command Vishal Shah wrote: > Hi Andrei, > > I am running some experiments to figure out what numThreads param to > use while fetching on my machine. I made the mistake of putting the # of > map/reduce tasks in hadoop-site.xml and not in mapred-default.xml, > however I can clearly see a change in performace for different numbers > of threads (I tested using 5 different options, ranging from 10 to > 2000). > > I was wondering why I am seeing these performance changes even though > the number of reduce parts is only 2 for all the experiments. Also, how > is the number of fetcher threads param used during generate related to > the numthreads param used during fetch? > Well, you will always run as many fetching (map) tasks as many parts you created when running Generator's reduce phase. Now, each fetching task can run multiple fetching threads in parallel ... so, as you increase the number of threads your fetching performance will likely increase (unless you face some other limits, like the blocked addresses and your bandwidth limits). -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
