----- Original Message ----- From: "Andrzej Bialecki" <[EMAIL PROTECTED]> Sent: Thursday, May 31, 2007 2:25 PM
> Are you running jobs in the "local" mode? In distributed mode filtering is > naturally parallel, because you have as many concurrent lookups as there > are map tasks. I'm just using the vanilla (local) configuration. The situation is so bad that lately I'm seeing durations like: generate: 2h 48' (-topN 20000) fetch: 1h 40' (200 threads) updatedb: 2h 20' This because both generate and updatedb perform filtering, and are single-threaded. Before I enforced filtering on updatedb, that phase last only few minutes. But if I don't filter in updatedb, the database gets polluted by URL's that will never be fetched. > In my experience, using multiple threads for DNS lookup doesn't help that > much. What helps A LOT (like several orders of magnitude) is using a local > DNS cache, or even two-level DNS cache (one cache per node, one cache per > cluster). I do have a local cache, but the problem is especially serious with negative responses, which are usually not cached - despite RFC2308). Enzo ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general