Re: [Nutch-general] Parallelizing URLFiltering

2007-06-01 Thread Enzo Michelangeli
- Original Message - From: Dennis Kubes [EMAIL PROTECTED] Sent: Friday, June 01, 2007 12:44 PM [...] We are also using BIND and our current index is 52,519,267 pages so you should be fine with this. I think djbdns is just easier to use. Are you using any big DNS caches as backups?

Re: [Nutch-general] Parallelizing URLFiltering

2007-05-31 Thread Enzo Michelangeli
- Original Message - From: Andrzej Bialecki [EMAIL PROTECTED] Sent: Thursday, May 31, 2007 2:25 PM Are you running jobs in the local mode? In distributed mode filtering is naturally parallel, because you have as many concurrent lookups as there are map tasks. I'm just using the

Re: [Nutch-general] Parallelizing URLFiltering

2007-05-31 Thread Dennis Kubes
We setup an /etc/resolv.conf configuration as shown below. This allows us to check first local then two of the major DNS caches on the internet before requesting it through a local DNS caching server. The 208 addresses are OpenDNS servers and the 4.x addresses are Verizon DNS servers. All

Re: [Nutch-general] Parallelizing URLFiltering

2007-05-31 Thread Andrzej Bialecki
Enzo Michelangeli wrote: I'm just using the vanilla (local) configuration. The situation is so bad that lately I'm seeing durations like: generate: 2h 48' (-topN 2) fetch:1h 40' (200 threads) updatedb: 2h 20' This because both generate and updatedb perform filtering, and are

Re: [Nutch-general] Parallelizing URLFiltering

2007-05-31 Thread Enzo Michelangeli
- Original Message - From: Andrzej Bialecki [EMAIL PROTECTED] Sent: Thursday, May 31, 2007 11:39 PM Caching seems to be the only solution. Even if you were able to fire DNS requests more rapidly, remote servers wouldn't be able (or wouldn't like to) respond that quickly ... Then why

Re: [Nutch-general] Parallelizing URLFiltering

2007-05-31 Thread Ken Krugler
- Original Message - From: Andrzej Bialecki [EMAIL PROTECTED] Sent: Thursday, May 31, 2007 11:39 PM Caching seems to be the only solution. Even if you were able to fire DNS requests more rapidly, remote servers wouldn't be able (or wouldn't like to) respond that quickly ... Then why is

Re: [Nutch-general] Parallelizing URLFiltering

2007-05-31 Thread Dennis Kubes
Enzo Michelangeli wrote: - Original Message - From: Andrzej Bialecki [EMAIL PROTECTED] Sent: Thursday, May 31, 2007 11:39 PM Caching seems to be the only solution. Even if you were able to fire DNS requests more rapidly, remote servers wouldn't be able (or wouldn't like to)

[Nutch-general] Parallelizing URLFiltering

2007-05-30 Thread Enzo Michelangeli
Is there a way of parallelizing URLFiltering over multiple threads? After all, the URLFilters themselves must already be thread-safe, or else they would have problems during fetching. The reason why I'm asking is I have a custom URLFilter that needs to make calls to the DNS resolver, and