[jira] [Commented] (NUTCH-2069) Ignore external links based on domain
[ https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15018413#comment-15018413 ] Hudson commented on NUTCH-2069: --- SUCCESS: Integrated in Nutch-trunk #3313 (See [https://builds.apache.org/job/Nutch-trunk/3313/]) NUTCH-2069 (jnioche: [http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1715386]) * trunk/CHANGES.txt * trunk/conf/nutch-default.xml * trunk/src/java/org/apache/nutch/fetcher/FetcherThread.java * trunk/src/java/org/apache/nutch/parse/ParseOutputFormat.java > Ignore external links based on domain > - > > Key: NUTCH-2069 > URL: https://issues.apache.org/jira/browse/NUTCH-2069 > Project: Nutch > Issue Type: Improvement > Components: fetcher, parser >Affects Versions: 1.10 >Reporter: Julien Nioche > Fix For: 1.11 > > Attachments: NUTCH-2069.patch, NUTCH-2069.v2.patch > > > We currently have `db.ignore.external.links` which is a nice way of > restricting the crawl based on the hostname. This adds a new parameter > 'db.ignore.external.links.domain' to do the same based on the domain. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2069) Ignore external links based on domain
[ https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15018232#comment-15018232 ] Julien Nioche commented on NUTCH-2069: -- no probs. Would be good to find a way to format based on the Eclipse XML config with an ANT task. There is a way to do it with Maven but haven't seen one for ANT. > Ignore external links based on domain > - > > Key: NUTCH-2069 > URL: https://issues.apache.org/jira/browse/NUTCH-2069 > Project: Nutch > Issue Type: Improvement > Components: fetcher, parser >Affects Versions: 1.10 >Reporter: Julien Nioche > Fix For: 1.11 > > Attachments: NUTCH-2069.patch, NUTCH-2069.v2.patch > > > We currently have `db.ignore.external.links` which is a nice way of > restricting the crawl based on the hostname. This adds a new parameter > 'db.ignore.external.links.domain' to do the same based on the domain. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2069) Ignore external links based on domain
[ https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15015387#comment-15015387 ] Lewis John McGibbney commented on NUTCH-2069: - +1 for patch. Sorry about formatting folks. We can run the code formatter over trunk before we release ;) > Ignore external links based on domain > - > > Key: NUTCH-2069 > URL: https://issues.apache.org/jira/browse/NUTCH-2069 > Project: Nutch > Issue Type: Improvement > Components: fetcher, parser >Affects Versions: 1.10 >Reporter: Julien Nioche > Fix For: 1.11 > > Attachments: NUTCH-2069.patch, NUTCH-2069.v2.patch > > > We currently have `db.ignore.external.links` which is a nice way of > restricting the crawl based on the hostname. This adds a new parameter > 'db.ignore.external.links.domain' to do the same based on the domain. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2069) Ignore external links based on domain
[ https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15013616#comment-15013616 ] Markus Jelsma commented on NUTCH-2069: -- Ah, i see it now indeed. +1 for this patch > Ignore external links based on domain > - > > Key: NUTCH-2069 > URL: https://issues.apache.org/jira/browse/NUTCH-2069 > Project: Nutch > Issue Type: Improvement > Components: fetcher, parser >Affects Versions: 1.10 >Reporter: Julien Nioche > Attachments: NUTCH-2069.patch, NUTCH-2069.v2.patch > > > We currently have `db.ignore.external.links` which is a nice way of > restricting the crawl based on the hostname. This adds a new parameter > 'db.ignore.external.links.domain' to do the same based on the domain. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2069) Ignore external links based on domain
[ https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15013545#comment-15013545 ] Julien Nioche commented on NUTCH-2069: -- > I propose to modes to be named just 'host' and 'domain'. As they are > elsewhere. Not really, see fetcher.queue.mode and partition.url.mode [https://github.com/apache/nutch/blob/trunk/conf/nutch-default.xml#L723] This issue is not about fixing existing discrepancies, this should be addressed separately. As for mixing bydomain and byDomain we do that only when comparing the strings {code} if ("bydomain".equalsIgnoreCase(ignoreExternalLinksMode)) {code} changing to "byDomain" won't make any difference but feel free to change this if you feel strongly about it > Ignore external links based on domain > - > > Key: NUTCH-2069 > URL: https://issues.apache.org/jira/browse/NUTCH-2069 > Project: Nutch > Issue Type: Improvement > Components: fetcher, parser >Affects Versions: 1.10 >Reporter: Julien Nioche > Attachments: NUTCH-2069.patch, NUTCH-2069.v2.patch > > > We currently have `db.ignore.external.links` which is a nice way of > restricting the crawl based on the hostname. This adds a new parameter > 'db.ignore.external.links.domain' to do the same based on the domain. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2069) Ignore external links based on domain
[ https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15013513#comment-15013513 ] Markus Jelsma commented on NUTCH-2069: -- Hi - looks good. One suggestion though. The patch mixes up bydomain and byDomain so this won't work well. I propose to modes to be named just 'host' and 'domain'. As they are elsewhere. M. > Ignore external links based on domain > - > > Key: NUTCH-2069 > URL: https://issues.apache.org/jira/browse/NUTCH-2069 > Project: Nutch > Issue Type: Improvement > Components: fetcher, parser >Affects Versions: 1.10 >Reporter: Julien Nioche > Attachments: NUTCH-2069.patch, NUTCH-2069.v2.patch > > > We currently have `db.ignore.external.links` which is a nice way of > restricting the crawl based on the hostname. This adds a new parameter > 'db.ignore.external.links.domain' to do the same based on the domain. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2069) Ignore external links based on domain
[ https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012193#comment-15012193 ] Markus Jelsma commented on NUTCH-2069: -- Hi J - i agree with the mode! Have it defaulted so it never breaks older instances and doesn't allow excluding both. Your follow up patch is probably spot on, have you got one? It can still come in 1.11! M. > Ignore external links based on domain > - > > Key: NUTCH-2069 > URL: https://issues.apache.org/jira/browse/NUTCH-2069 > Project: Nutch > Issue Type: Improvement > Components: fetcher, parser >Affects Versions: 1.10 >Reporter: Julien Nioche > Attachments: NUTCH-2069.patch > > > We currently have `db.ignore.external.links` which is a nice way of > restricting the crawl based on the hostname. This adds a new parameter > 'db.ignore.external.links.domain' to do the same based on the domain. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2069) Ignore external links based on domain
[ https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14647467#comment-14647467 ] Julien Nioche commented on NUTCH-2069: -- Hi [~wastl-nagel] and [~markus17]. BTW did not mean to be short in my previous message but was typing from my phone ;-) I know the difficulties of enforcing the code formatting systematically, but I thought I might as well fix it while I was working on that part of the code. Feel free to remove the bits from the patch that are about the formatting only. bq. we could define this as two properties `db.ignore.external.links` + `db.ignore.external.links.mode`. The latter can be "host" or "domain", similar to other properties (partition.url.mode, generator.count.mode, fetcher.queue.mode). That would be extensible and can make the code leaner. yes that would be more elegant on vacation for the next few weeks as of today, will update the code based on your suggestion when I am back unless one of you beats me to it of course. J. > Ignore external links based on domain > - > > Key: NUTCH-2069 > URL: https://issues.apache.org/jira/browse/NUTCH-2069 > Project: Nutch > Issue Type: Improvement > Components: fetcher, parser >Affects Versions: 1.10 >Reporter: Julien Nioche > Fix For: 1.11 > > Attachments: NUTCH-2069.patch > > > We currently have `db.ignore.external.links` which is a nice way of > restricting the crawl based on the hostname. This adds a new parameter > 'db.ignore.external.links.domain' to do the same based on the domain. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2069) Ignore external links based on domain
[ https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14646793#comment-14646793 ] Sebastian Nagel commented on NUTCH-2069: You're right, Julien. The code in FetcherThread does not follow the style. The code formatting patch (NUTCH-865) is now 80 issues back in the history, Fetcher has been refactored meanwhile, and not all commits are following the style. It's often hard to resist :) ant not to correct the style, re-organize imports, etc., so that patches are lean and easy to review. But back to the main topic: +1, so far. One point: 'db.ignore.external.links' and the new 'db.ignore.external.links.domain' are mutually exclusive, "external" is either defined by host or domain. This should be show up in the code {code} if (ignoreExternalLinks) { ... } else if (ignoreLinksOutsideDomain) { ... } {code} Or we could define this as two properties `db.ignore.external.links` + `db.ignore.external.links.mode`. The latter can be "host" or "domain", similar to other properties (partition.url.mode, generator.count.mode, fetcher.queue.mode). That would be extensible and can make the code leaner. Btw., good idea to add the formatter to 1.x as well, and if possible automatically add it to the Eclipse project created by "ant eclipse". > Ignore external links based on domain > - > > Key: NUTCH-2069 > URL: https://issues.apache.org/jira/browse/NUTCH-2069 > Project: Nutch > Issue Type: Improvement > Components: fetcher, parser >Affects Versions: 1.10 >Reporter: Julien Nioche > Fix For: 1.11 > > Attachments: NUTCH-2069.patch > > > We currently have `db.ignore.external.links` which is a nice way of > restricting the crawl based on the hostname. This adds a new parameter > 'db.ignore.external.links.domain' to do the same based on the domain. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2069) Ignore external links based on domain
[ https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14646543#comment-14646543 ] Julien Nioche commented on NUTCH-2069: -- What code restyle? I applied the formatting rules from 2.x as expected. They should be copied to trunk BTW. Looks like Lewis did not use them > Ignore external links based on domain > - > > Key: NUTCH-2069 > URL: https://issues.apache.org/jira/browse/NUTCH-2069 > Project: Nutch > Issue Type: Improvement > Components: fetcher, parser >Affects Versions: 1.10 >Reporter: Julien Nioche > Fix For: 1.11 > > Attachments: NUTCH-2069.patch > > > We currently have `db.ignore.external.links` which is a nice way of > restricting the crawl based on the hostname. This adds a new parameter > 'db.ignore.external.links.domain' to do the same based on the domain. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2069) Ignore external links based on domain
[ https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14646504#comment-14646504 ] Markus Jelsma commented on NUTCH-2069: -- Fine with the feature but there's a lot of clutter in the patch. Are you not happy with the code restyle Lewis did? I am not sure i see a lot of use for e.g {code} @@ -678,19 +697,18 @@ // Check whether we'll follow external outlinks if (outlinksIgnoreExternal) { -if (!URLUtil.getHost(url.toString()).equals( -URLUtil.getHost(followUrl))) { +if (!URLUtil.getHost(url.toString()) +.equals(URLUtil.getHost(followUrl))) { continue; } } - reporter - .incrCounter("FetcherOutlinks", "outlinks_following", 1); + reporter.incrCounter("FetcherOutlinks", "outlinks_following", 1); // Create new FetchItem with depth incremented FetchItem fit = FetchItem.create(new Text(followUrl), - new CrawlDatum(CrawlDatum.STATUS_LINKED, interval), - queueMode, outlinkDepth + 1); + new CrawlDatum(CrawlDatum.STATUS_LINKED, interval), queueMode, + outlinkDepth + 1); ((FetchItemQueues) fetchQueues).addFetchItem(fit); outlinkCounter++; {code} And besides, this would force me to completely rewrite some patches as well, which i already had because of the code style change ;) > Ignore external links based on domain > - > > Key: NUTCH-2069 > URL: https://issues.apache.org/jira/browse/NUTCH-2069 > Project: Nutch > Issue Type: Improvement > Components: fetcher, parser >Affects Versions: 1.10 >Reporter: Julien Nioche > Fix For: 1.11 > > Attachments: NUTCH-2069.patch > > > We currently have `db.ignore.external.links` which is a nice way of > restricting the crawl based on the hostname. This adds a new parameter > 'db.ignore.external.links.domain' to do the same based on the domain. -- This message was sent by Atlassian JIRA (v6.3.4#6332)