Chris Schneider wrote:
My experience recently seeing attempted fetches of many ingrida.be URLs
made me question the Nutch 0.8 algorithm for partitioning URLs among
TaskTrackers (and their children processes). As I understand it, Nutch
doesn't worry about two lexically distinct domains (e.g.,
inherit-the-wind.ingrida.be and clancy-brown.ingrida.be) being fetched
simultaneously, even though they might actually resolve to the same IP
address (66.154.11.25 in this case).
That is correct, Nutch 0.8 currently treats each lexicially-distinct
domain as a separate domain. IP-based partitioning is possible: one
would merely need to change PartitionUrlByHost.java to hash the IP of
the host. If the performance of this is too slow, we could cache the IP
address in the CrawlDatum, which is available when we are performing
this partitioning. But probably one should run a caching DNS server
when fetching anyway, so hopefully that would not be required.
I've attached a patch. Tell me if it works and if it noticeably slows
fetching for you.
Doug
Index: src/java/org/apache/nutch/crawl/PartitionUrlByHost.java
===================================================================
--- src/java/org/apache/nutch/crawl/PartitionUrlByHost.java (revision 379848)
+++ src/java/org/apache/nutch/crawl/PartitionUrlByHost.java (working copy)
@@ -17,6 +17,8 @@
package org.apache.nutch.crawl;
import java.net.URL;
+import java.net.InetAddress;
+import java.net.UnknownHostException;
import java.net.MalformedURLException;
import org.apache.hadoop.io.*;
@@ -41,8 +43,22 @@
url = new URL(urlString);
} catch (MalformedURLException e) {
}
- int hashCode = (url==null ? urlString : url.getHost()).hashCode();
+ int hashCode;
+
+ if (url == null) {
+ hashCode = urlString.hashCode();
+ } else {
+ String host = url.getHost();
+ try {
+ InetAddress addr = InetAddress.getByName(host);
+ hashCode = addr.hashCode();
+ } catch (UnknownHostException e) {
+ Generator.LOG.info("Couldn't find IP for host: " + host);
+ hashCode = host.hashCode();
+ }
+ }
+
// make hosts wind up in different partitions on different runs
hashCode ^= seed;
@@ -50,5 +66,3 @@
}
}
-
-