Update on fetch performance of my current run: download speed has been
stable at 3.8 pages/sec, about 640kbps. This is probably limited by my
bandwidth - regular DSL service, promising up to 1.5 mbps inbound but
realistically only 640 kbps.

More than 1 million pages were fetched, but it took several days at current
speed - just too slow. I'm planning to get more bandwidth. Could someone
share their experience on what stable rate (pages/sec) can be achieved using
3 mbps or 10 mbps inbound connection?

I'm currently getting about 6 pages/second, and peaking at 4.8MBps, with a fast machine (Xeon quad 3.0GHz, 4GB RAM, SCSI RAID 5 disks) and a big pipe.

I'm doing just fetching & parsing, no indexing, with a 10MB max content size. Note that the max size can matter - we'd originally set it to no limit, so that we wouldn't miss any PDFs. But this caused other problems - for example, the regex-urlfilters.txt file that you get with Nutch 0.7 doesn't exclude .bz2 files. So we were chewing up lots of bandwidth, and occasionally running out of memory, when trying to download these.

I think our pages/second would be faster, but I'm also getting lots of max-retry errors. We're doing a limited domain crawl, so this gets hit a lot because there are typically many URLs from the same host in our fetchlist. Things improved a bit when we randomized the URL list after doing the topN prune, so that they weren't in (essentially) alphabetical order.

When you use the FetchListTool to emit multiple lists, it intentionally divides up the list using the MD5 value for the link, so that you get hosts scattered between the lists. But for a single list, this doesn't happen, and thus the max threads/host value winds up causing a lot of the threads to spend their time idling, if your crawl (like mine) is focused.

Also 500 threads sounds excessive to me, but I don't really know.

-- Ken


On 9/28/05, AJ Chen <[EMAIL PROTECTED]> wrote:

 I started the crawler with about 2000 sites. The fetcher could achieve
 7 pages/sec initially, but the performance gradually dropped to about 2
 pages/sec, sometimes even 0.5 pages/sec. The fetch list had 300k pages
 and I used 500 threads. What are the main causes of this slowing down?
 Below are sample status:

 050927 005952 status: segment 20050927005922, 100 pages, 3 errors,
 1784615 bytes, 14611 ms
 050927 005952 status: 6.8441586 pages/s, 954.2334 kb/s, 17846.15bytes/page
 050927 010005 status: segment 20050927005922, 200 pages, 9 errors,
 3656863 bytes, 28170 ms
 050927 010005 status: 7.0997515 pages/s, 1014.1726 kb/s, 18284.314
 bytes/page

 after sometime ...
 050927 171818 status: segment 20050927070752, 101400 pages, 7201 errors,
 2593026554 bytes, 36216316 ms
 050927 171818 status: 2.799843 pages/s, 559.3617 kb/s, 25572.254bytes/page
 050927 171832 status: segment 20050927070752, 101500 pages, 7204 errors,
 2595591632 bytes, 36230516 ms
 050927 171832 status: 2.8015058 pages/s, 559.6956 kb/s, 25572.332bytes/page

 Thanks,
 > AJ

--
Ken Krugler
Krugle, Inc.
+1 530-470-9200

Reply via email to