Adding some comments to the email below, but here on nutch-dev. Basically, it is my feeling that whenever fetchlists (and its parts) are not "well balanced", this inefficiency will be seen. Concretely, whichever task is "stuck fetching from the slow server with a lot of its pages in the fetchlist", it will prolong the whole fetch job. Slow server with lots of pages is a bad combination, and I see that a lot. Perhaps it's the nature of my crawl - it is constrained, not web-side, with the number of distinct hosts is around 15-20K?

[snip]

Some questions: Are there ways around this? Are others not seeing the same behaviour? Is this just the nature of my crawl - constrained and with only 15-20K unique servers?

We often ran into the same problem, while doing our vertical tech pages crawl - smaller number of unique hosts, and some really slow hosts slowing down the entire fetch cycle.

We added code that terminated slow fetches. After fooling around with some different approaches, I think we settled on terminating all remaining fetches when the number of active fetch threads dropped below a threshold set from the total # of threads available. The ratio was set to 20% or so.

URLs that were terminated in this manner would get their status set to the same as if the page had returned a "temp unavailable" HTTP response, IIRC.

This worked pretty well, though we had to hack the httpclient lib because even when you interrupted a fetch, there was some cleanup code executed during a socket close that would try to empty the stream, and for some slow servers this would still cause the fetch to hang.

-- Ken


If others are seeing this behaviour, then I'm wondering if others have any thoughts about improving this either before 1.0 or after 1.0 release? For instance, maybe things would be better with that HostDb and a Generator that knows not to produce fetchlists with lots of URLs from slow servers? Or maybe there is a way to keep feeding Fetchers with URLs from other sites, so its idle threads can be kept busy instead of in spinWait status? Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > To: Nutch User List <[EMAIL PROTECTED]> > Sent: Monday, April 21, 2008 4:16:24 PM > Subject: Fetching inefficiency > > Hello, > > I am wondering how others deal with the following, which I see as fetching > inefficiency: > > > When fetching, the fetchlist is broken up into multiple parts and fetchers on > cluster nodes start fetching. Some fetchers end up fetching from fast servers, > and some from very very slow servers. Those fetching from slow servers take a > long time to complete and prolong the whole fetching process. For instance, > I've seen tasks from the same fetch job finish in only 1-2 hours, and others in > 10 hours. Those taking 10 hours were stuck fetching pages from a single or > handful of slow sites. If you have two nodes doing the fetching and one is > stuck with a slow server, the other one is idling and wasting time. The node > stuck with the slow server is also underutilized, as it's slowly fetching from > only 1 server instead of many. > > I imagine anyone using Nutch is seeing the same. If not, what's the trick? > > I have not tried overlapping fetching jobs yet, but I have a feeling that won't > help a ton, plus it could lead to two fetchers fetching from the same server and > being impolite - am I wrong? > > Thanks, > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"

Reply via email to