Enzo Michelangeli wrote: > I'm just using the vanilla (local) configuration. The situation is so > bad that lately I'm seeing durations like: > > generate: 2h 48' (-topN 20000) > fetch: 1h 40' (200 threads) > updatedb: 2h 20' > > This because both generate and updatedb perform filtering, and are > single-threaded. Before I enforced filtering on updatedb, that phase > last only few minutes. But if I don't filter in updatedb, the database > gets polluted by URL's that will never be fetched.
Caching seems to be the only solution. Even if you were able to fire DNS requests more rapidly, remote servers wouldn't be able (or wouldn't like to) respond that quickly ... > >> In my experience, using multiple threads for DNS lookup doesn't help >> that much. What helps A LOT (like several orders of magnitude) is >> using a local DNS cache, or even two-level DNS cache (one cache per >> node, one cache per cluster). > > I do have a local cache, but the problem is especially serious with > negative responses, which are usually not cached - despite RFC2308). Which DNS cache implementation are you using? I've had positive experience with djbdns / tinydns package, with some modifications to increase the number of concurrent requests and the cache size. This was on Linux, though - I have no idea how to do this on Windows. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general