Re: Fetching inefficiency

Ken Krugler Mon, 21 Apr 2008 17:22:02 -0700

Adding some comments to the email below, but here on nutch-dev.Basically, it is my feeling that whenever fetchlists (and its parts)are not "well balanced", this inefficiency will be seen. Concretely,whichever task is "stuck fetching from the slow server with a lot ofits pages in the fetchlist", it will prolong the whole fetch job.Slow server with lots of pages is a bad combination, and I see thata lot. Perhaps it's the nature of my crawl - it is constrained, notweb-side, with the number of distinct hosts is around 15-20K?


[snip]

Some questions: Are there ways around this? Are others not seeingthe same behaviour? Is this just the nature of my crawl -constrained and with only 15-20K unique servers?

We often ran into the same problem, while doing our vertical techpages crawl - smaller number of unique hosts, and some really slowhosts slowing down the entire fetch cycle.

We added code that terminated slow fetches. After fooling around withsome different approaches, I think we settled on terminating allremaining fetches when the number of active fetch threads droppedbelow a threshold set from the total # of threads available. Theratio was set to 20% or so.

URLs that were terminated in this manner would get their status setto the same as if the page had returned a "temp unavailable" HTTPresponse, IIRC.

This worked pretty well, though we had to hack the httpclient libbecause even when you interrupted a fetch, there was some cleanupcode executed during a socket close that would try to empty thestream, and for some slow servers this would still cause the fetch tohang.


-- Ken

If others are seeing this behaviour, then I'm wondering if othershave any thoughts about improving this either before 1.0 or after1.0 release? For instance, maybe things would be better with thatHostDb and a Generator that knows not to produce fetchlists withlots of URLs from slow servers? Or maybe there is a way to keepfeeding Fetchers with URLs from other sites, so its idle threads canbe kept busy instead of in spinWait status? Thanks, Otis -- Sematext-- http://sematext.com/ -- Lucene - Solr - Nutch ----- OriginalMessage ---- > From: "[EMAIL PROTECTED]"<[EMAIL PROTECTED]> > To: Nutch User List<[EMAIL PROTECTED]> > Sent: Monday, April 21, 20084:16:24 PM > Subject: Fetching inefficiency > > Hello, > > I amwondering how others deal with the following, which I see asfetching > inefficiency: > > > When fetching, the fetchlist isbroken up into multiple parts and fetchers on > cluster nodes startfetching. Some fetchers end up fetching from fast servers, > andsome from very very slow servers. Those fetching from slow serverstake a > long time to complete and prolong the whole fetchingprocess. For instance, > I've seen tasks from the same fetch jobfinish in only 1-2 hours, and others in > 10 hours. Those taking 10hours were stuck fetching pages from a single or > handful of slowsites. If you have two nodes doing the fetching and one is > stuckwith a slow server, the other one is idling and wasting time. Thenode > stuck with the slow server is also underutilized, as it'sslowly fetching from > only 1 server instead of many. > > I imagineanyone using Nutch is seeing the same. If not, what's thetrick? > > I have not tried overlapping fetching jobs yet, but Ihave a feeling that won't > help a ton, plus it could lead to twofetchers fetching from the same server and > being impolite - am Iwrong? > > Thanks, > Otis > -- > Sematext -- http://sematext.com/ --Lucene - Solr - Nutch



--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"

Re: Fetching inefficiency

Reply via email to