Hi folks,

A few weeks ago, I decided to create a Nutch extension that would
allow one to crawl URLs only within a certain geographic area. It
could be handy for a Canadian to build a Nutch setup that crawls all
Canadian sites, including the .com and .orgs. Or, since I'm in New
York, I'd like to search local content in the NYC area w/o needing the
disk space to crawl the entire web.

One way to do this is to IP-to-location lookup, using something like
the MaxMind.com GeoIP database. The free version resolves to the
country level, and pay versions resolve down to metroarea. So I
implemented a subclass of net.nutch.net.RegexURLFilter that does this.
(see attached)

The result, IPRegexURLFilter, works as advertised: it filters by regex
*and* country-netblock. It's also very, very slow. The reason is quite
simple. To do an IP-to-country lookup from a URL, I first have to do a
DNS lookup on the hostname, which has high latency. So the
single-threaded sections of code that call URLFilter.filter()
implementations spend most of their time waiting for the lookup to
complete.

My instincts tell me there are two way to improve this situation:
1) Move the IP-based filter into the multithreaded parts of Fetcher,
e.g. FetcherThread
2) Or, push it all the way down to where the Fetcher does its own DNS
lookup, so we eliminate duplicate lookups for each non-filtered URL

(2) would require hooking into each Protocol implementation that deals
with hostnames, e.g. protocol-http AND protocol-ftp. That seems like a
bad idea. Considering that the JVM will cache DNS requests, perhaps
it's not worth going this far to eliminate the double-lookup.

So, if (1) is a better course of action, I would need to hook into
FetcherThread.run() and run a filter before the call to
protocol.getContent(url).

What's the best way to achieve this? More importantly, what's the
Nutch way? Since FetcherThread is a inner class, subclassing it isn't
the answer. A delegate of some kind seems more appropriate. Perhaps
Fetcher could gain a URLFilter ivar, which if not null, FetcherThread
calls before protocol.getContent(url)?

I think this would be a generally-useful extension to the crawler, and
am willing to write it & submit as a patch.

Nutch committers, what do you think?

(ps: I don't work for MaxMind, I just think their product is useful.
The DB access API and GeoIP Free DB are both GPL'd)

--Matt

Attachment: ipregexurlfilter.java
Description: Binary data

Reply via email to