Adding some comments to the email below, but here on nutch-dev.

Basically, it is my feeling that whenever fetchlists (and its parts) are not 
"well balanced", this inefficiency will be seen.
Concretely, whichever task is "stuck fetching from the slow server with a lot 
of its pages in the fetchlist", it will prolong the whole fetch job.  Slow 
server with lots of pages is a bad combination, and I see that a lot.  Perhaps 
it's the nature of my crawl - it is constrained, not web-side, with the number 
of distinct hosts is around 15-20K?

* Example fetchlist part:
slow.com/1
fast.com/1
ok.com/1
slow.com/2
fast.com/2
ok.com/2
ok.com/3
slow.com/3
slow.com/4
slow.com/5
slow.com/6

* The above fetchlist part will take a lot longer than this one:
speedy.com/1
speedy.com/2
speedy.com/3
speedy.com/4
superspeedy.com/1
ok2.com/1
ok2.com/2
speedy.com/5
speedy.com/6
speedy.com/7
ok2.com/3
speedy.com/8

The task processing the first set of URLs will be slower because it got the 
slow slow.com server and slow.com happens to have a lot of pages in that 
fetchlist part.  The task processing the second set of URLs will be quick, 
since all its servers are pretty fast.

Some questions:
Are there ways around this?
Are others not seeing the same behaviour?
Is this just the nature of my crawl - constrained and with only 15-20K unique 
servers?

If others are seeing this behaviour, then I'm wondering if others have any 
thoughts about improving this either before 1.0 or after 1.0 release?  For 
instance, maybe things would be better with that HostDb and a Generator that 
knows not to produce fetchlists with lots of URLs from slow servers?  Or maybe 
there is a way to keep feeding Fetchers with URLs from other sites, so its idle 
threads can be kept busy instead of in spinWait status?

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
> From: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
> To: Nutch User List <[EMAIL PROTECTED]>
> Sent: Monday, April 21, 2008 4:16:24 PM
> Subject: Fetching inefficiency
> 
> Hello,
> 
> I am wondering how others deal with the following, which I see as fetching 
> inefficiency:
> 
> 
> When fetching, the fetchlist is broken up into multiple parts and fetchers on 
> cluster nodes start fetching.  Some fetchers end up fetching from fast 
> servers, 
> and some from very very slow servers.  Those fetching from slow servers take 
> a 
> long time to complete and prolong the whole fetching process.  For instance, 
> I've seen tasks from the same fetch job finish in only 1-2 hours, and others 
> in 
> 10 hours.  Those taking 10 hours were stuck fetching pages from a single or 
> handful of slow sites.  If you have two nodes doing the fetching and one is 
> stuck with a slow server, the other one is idling and wasting time.  The node 
> stuck with the slow server is also underutilized, as it's slowly fetching 
> from 
> only 1 server instead of many.
> 
> I imagine anyone using Nutch is seeing the same.  If not, what's the trick?
> 
> I have not tried overlapping fetching jobs yet, but I have a feeling that 
> won't 
> help a ton, plus it could lead to two fetchers fetching from the same server 
> and 
> being impolite - am I wrong?
> 
> Thanks,
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

Reply via email to