[ https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14646504#comment-14646504 ]
Markus Jelsma commented on NUTCH-2069: -------------------------------------- Fine with the feature but there's a lot of clutter in the patch. Are you not happy with the code restyle Lewis did? I am not sure i see a lot of use for e.g {code} @@ -678,19 +697,18 @@ // Check whether we'll follow external outlinks if (outlinksIgnoreExternal) { - if (!URLUtil.getHost(url.toString()).equals( - URLUtil.getHost(followUrl))) { + if (!URLUtil.getHost(url.toString()) + .equals(URLUtil.getHost(followUrl))) { continue; } } - reporter - .incrCounter("FetcherOutlinks", "outlinks_following", 1); + reporter.incrCounter("FetcherOutlinks", "outlinks_following", 1); // Create new FetchItem with depth incremented FetchItem fit = FetchItem.create(new Text(followUrl), - new CrawlDatum(CrawlDatum.STATUS_LINKED, interval), - queueMode, outlinkDepth + 1); + new CrawlDatum(CrawlDatum.STATUS_LINKED, interval), queueMode, + outlinkDepth + 1); ((FetchItemQueues) fetchQueues).addFetchItem(fit); outlinkCounter++; {code} And besides, this would force me to completely rewrite some patches as well, which i already had because of the code style change ;) > Ignore external links based on domain > ------------------------------------- > > Key: NUTCH-2069 > URL: https://issues.apache.org/jira/browse/NUTCH-2069 > Project: Nutch > Issue Type: Improvement > Components: fetcher, parser > Affects Versions: 1.10 > Reporter: Julien Nioche > Fix For: 1.11 > > Attachments: NUTCH-2069.patch > > > We currently have `db.ignore.external.links` which is a nice way of > restricting the crawl based on the hostname. This adds a new parameter > 'db.ignore.external.links.domain' to do the same based on the domain. -- This message was sent by Atlassian JIRA (v6.3.4#6332)