[Nutch-dev] RE: fetcher question: why multithreaded?

Peter Veentjer - Anchor Men Mon, 05 Sep 2005 03:43:06 -0700

 
>>I'm currently fetching with 35 threads. The CPU load is about 5-10%
(P4 3.0 HT). 
>>Parsing obviously isn't using many resources.
-------------------------------------------------------------------
What kind of network connection do you have?  
If the network connection is the bottleneck.. Then it won`t matter,
But if you have a 100mbit line, you could use more than 35 threads 
I think and then the context switching could be a big price to pay.
With java nio it is possible to let a single thread read from a lot
Of sockets without blocking, so no time is lost with context 
switching. This approach is generally used with larger systems so
I`m wondering why Nutch isn`t using it.


>>Removing parsing also would not speed up the fetching process.
>>If parsing (while fetching) is removed (with a command line argument),

>>I'll probably tune the fetcher down to 30 threads and have the same 
>>overall fetching speed.
---------------------------------------------------------------------
The same ammount of work has to be done, so it is logical
That moving the parsing part wouldn`t give a performance boost. But
It makes it more difficult to use parts of Nutch in a different
If components have a lot of responsibilities. 

We want to use Lucene (and it would be nice if we could use large
Parts of Nutch) in a searchsystem that has to be very scalable.
And that is why I need components that only do a single thing
And where a lot of dependencies can be injected into. 




-----Original Message-----
From: Peter Veentjer - Anchor Men [mailto:[EMAIL PROTECTED]
Sent: Monday, September 05, 2005 6:01 AM
To: [email protected]
Subject: fetcher question: why multithreaded?

Hi,
 
I`m looking at the code of the fetcher and have the following question:
why does the fetcher do more than fetching? Wouldn`t it be better te
move the page parsing to another component and let the fetcher only
fetch?
(so the fetch threads only do fetching).
 
Another problem with this threaded approach is that you need a lot of
threads because a single thread is responsible for retrieving data and
also for parsing it. If you remove the parsing part, a thread would only
be responsible for fetching. And this makes it possible to use a single
thread in the Fetcher that gathers data from a lot of sockets (and this
reduces context switching overhead). This is a technique widely used in
search engines and I`m curious about why Nutch goes for a different
approach.
 
 
 
 

Met vriendelijke groet,

Peter Veentjer
Anchor Men Interactive Solutions - duidelijk in zakelijke
internetoplossingen

Praediniussingel 41
9711 AE Groningen

T: 050-3115222
F: 050-5891696
E: [EMAIL PROTECTED]
I : www.anchormen.nl <blocked::http://www.anchormen.nl/> 

 




-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] RE: fetcher question: why multithreaded?

Reply via email to