Hi folks, A few weeks ago, I decided to create a Nutch extension that would allow one to crawl URLs only within a certain geographic area. It could be handy for a Canadian to build a Nutch setup that crawls all Canadian sites, including the .com and .orgs. Or, since I'm in New York, I'd like to search local content in the NYC area w/o needing the disk space to crawl the entire web.
One way to do this is to IP-to-location lookup, using something like the MaxMind.com GeoIP database. The free version resolves to the country level, and pay versions resolve down to metroarea. So I implemented a subclass of net.nutch.net.RegexURLFilter that does this. (see attached) The result, IPRegexURLFilter, works as advertised: it filters by regex *and* country-netblock. It's also very, very slow. The reason is quite simple. To do an IP-to-country lookup from a URL, I first have to do a DNS lookup on the hostname, which has high latency. So the single-threaded sections of code that call URLFilter.filter() implementations spend most of their time waiting for the lookup to complete. My instincts tell me there are two way to improve this situation: 1) Move the IP-based filter into the multithreaded parts of Fetcher, e.g. FetcherThread 2) Or, push it all the way down to where the Fetcher does its own DNS lookup, so we eliminate duplicate lookups for each non-filtered URL (2) would require hooking into each Protocol implementation that deals with hostnames, e.g. protocol-http AND protocol-ftp. That seems like a bad idea. Considering that the JVM will cache DNS requests, perhaps it's not worth going this far to eliminate the double-lookup. So, if (1) is a better course of action, I would need to hook into FetcherThread.run() and run a filter before the call to protocol.getContent(url). What's the best way to achieve this? More importantly, what's the Nutch way? Since FetcherThread is a inner class, subclassing it isn't the answer. A delegate of some kind seems more appropriate. Perhaps Fetcher could gain a URLFilter ivar, which if not null, FetcherThread calls before protocol.getContent(url)? I think this would be a generally-useful extension to the crawler, and am willing to write it & submit as a patch. Nutch committers, what do you think? (ps: I don't work for MaxMind, I just think their product is useful. The DB access API and GeoIP Free DB are both GPL'd) --Matt
ipregexurlfilter.java
Description: Binary data
