Enzo Michelangeli wrote:

> I'm just using the vanilla (local) configuration. The situation is so 
> bad that lately I'm seeing durations like:
> generate: 2h 48' (-topN 20000)
> fetch:    1h 40' (200 threads)
> updatedb: 2h 20'
> This because both generate and updatedb perform filtering, and are 
> single-threaded. Before I enforced filtering on updatedb, that phase 
> last only few minutes. But if I don't filter in updatedb, the database 
> gets polluted by URL's that will never be fetched.

Caching seems to be the only solution. Even if you were able to fire DNS 
requests more rapidly, remote servers wouldn't be able (or wouldn't like 
to) respond that quickly ...

>> In my experience, using multiple threads for DNS lookup doesn't help 
>> that much. What helps A LOT (like several orders of magnitude) is 
>> using a local DNS cache, or even two-level DNS cache (one cache per 
>> node, one cache per cluster).
> I do have a local cache, but the problem is especially serious with 
> negative responses, which are usually not cached - despite RFC2308).

Which DNS cache implementation are you using? I've had positive 
experience with djbdns / tinydns package, with some modifications to 
increase the number of concurrent requests and the cache size. This was 
on Linux, though - I have no idea how to do this on Windows.

Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
Nutch-general mailing list

Reply via email to