Re: [Nutch-general] Parallelizing URLFiltering

Enzo Michelangeli Thu, 31 May 2007 07:59:46 -0700

----- Original Message ----- 
From: "Andrzej Bialecki" <[EMAIL PROTECTED]>
Sent: Thursday, May 31, 2007 2:25 PM


> Are you running jobs in the "local" mode? In distributed mode filtering is 
> naturally parallel, because you have as many concurrent lookups as there 
> are map tasks.

I'm just using the vanilla (local) configuration. The situation is so bad 
that lately I'm seeing durations like:

generate: 2h 48' (-topN 20000)
fetch:    1h 40' (200 threads)
updatedb: 2h 20'

This because both generate and updatedb perform filtering, and are 
single-threaded. Before I enforced filtering on updatedb, that phase last 
only few minutes. But if I don't filter in updatedb, the database gets 
polluted by URL's that will never be fetched.

> In my experience, using multiple threads for DNS lookup doesn't help that 
> much. What helps A LOT (like several orders of magnitude) is using a local 
> DNS cache, or even two-level DNS cache (one cache per node, one cache per 
> cluster).

I do have a local cache, but the problem is especially serious with negative 
responses, which are usually not cached - despite RFC2308).

Enzo


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Parallelizing URLFiltering

Reply via email to