Hey Andrei,

  Thanks a lot for the reply. That clears up a major doubt in my mind.
Fyi, I experimented using a single machine to crawl using Hadoop DFS,
MapReduce. The largest experiment was to crawl around 300K pages from a
few thousand hosts. I could push the crawler to a speed of around 27
pages/sec when using 2000 threads. When I increased the number of
threads to more than 3000, the jobs started failing. 

I am now going to conduct a larger experiment on 3-4 machines. Will
report the performance once I am done. In this case, since I know the
optimal # of threads on 1 machine is 2000, should I scale the #threads
linearly to say 6000 for 3 machines, or just increasing the number of
map/red tasks linearly will take care of the scaling?

Thanks,

-vishal.

-----Original Message-----
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] 
Sent: Friday, August 25, 2006 5:46 PM
To: [email protected]
Subject: Re: -numFetchers in generate command

Vishal Shah wrote:
> Hi Andrei,
>
>    I am running some experiments to figure out what numThreads param
to
> use while fetching on my machine. I made the mistake of putting the #
of
> map/reduce tasks in hadoop-site.xml and not in mapred-default.xml,
> however I can clearly see a change in performace for different numbers
> of threads (I tested using 5 different options, ranging from 10 to
> 2000).
>
>   I was wondering why I am seeing these performance changes even
though
> the number of reduce parts is only 2 for all the experiments. Also,
how
> is the number of fetcher threads param used during generate related to
> the numthreads param used during fetch?
>   

Well, you will always run as many fetching (map) tasks as many parts you

created when running Generator's reduce phase. Now, each fetching task 
can run multiple fetching threads in parallel ... so, as you increase 
the number of threads your fetching performance will likely increase 
(unless you face some other limits, like the blocked addresses and your 
bandwidth limits).

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to